Text Analysis in Python3


In this assignment we work with files. Files are everywhere in this Universe. In computer system files are essential part. Operating system consists a lot of files.

Python has two types of files-Text Files and Binary Files.

Text Analysis

Here we discuss about Text Files

Here we focus some of the important functions on files.

  • Number of words
  • Number of characters
  • Average word length
  • Number of stop words
  • Number of special characters
  • Number of numeric
  • Number of uppercase words

We have a test file "css3.txt", we are working on that file

Number of words

When we count number of words in a sentences, we use split function. This is most easiest way. In this case we also apply split function.

Example code

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   print("Total words of" + filename ,"is" , str(number_words))

Output

Total words of C:/Users/TP/Desktop/css3.txt is 3574

Number of characters

Here we count the number of characters in a word, here we use the length of the word. If the length is 5 then 5 characters are there in that word.

Example code

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   characters=0
   wordslist=contents.split()
   words+=len(wordslist)
   characters += sum(len(word) for word in wordslist)
   #print(lineno)
   print("TOTAL CHARACTERS IN A TEXT FILE =",characters)

Output

TOTAL CHARACTERS IN A TEXT FILE = 17783

Average word length

Here, we calculate the sum of the length of all the words and divide it by the total length.

Example code

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   wordslist=contents.split()
   words=len(wordslist)
   average= sum(len(word) for word in wordslist)/words    
   print("Average=",average)

Output

Average= 4.97

Number of stop words

To solve this we use NLP library in Python.

Example code

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
my_example_sent = "This is a sample sentence"
mystop_words = set(stopwords.words('english')) 
my_word_tokens = word_tokenize(my_example_sent) 
my_filtered_sentence = [w for w in my_word_tokens if not w in mystop_words] 
my_filtered_sentence = []
for w in my_word_tokens: 
   if w not in mystop_words: 
      my_filtered_sentence.append(w) 
print(my_word_tokens) 
print(my_filtered_sentence) 

Number of special characters

Here we can calculating the number of hashtags or mentions present in it. This is helps to extract extra information from our text data.

Example code

import collections as ct
filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   special_chars = "#"
   new=sum(v for k, v in ct.Counter(words).items() if k in special_chars)
   print("Total Special Characters", new)

Output

Total Special Characters 0

Number of numeric

Here we can calculate the number of numeric data present in the text files. It is same as the calculation the number of characters in a word.

Example code

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isdigit, contents.split())) 
   print("TOTAL NUMERIC IN A TEXT FILE =",words)

Output

TOTAL NUMERIC IN A TEXT FILE = 2

Number of uppercase words

Using isupper() function, we can calculate number of upper case letters in the text.

Example code

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isupper, contents.split())) 
   print("TOTAL UPPERCASE WORDS IN A TEXT FILE =",words)

Output

TOTAL UPPERCASE WORDS IN A TEXT FILE = 121

karthikeya Boyini
karthikeya Boyini

I love programming (: That's all I know

Updated on: 30-Jul-2019

166 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements