Classification of Text Documents using the Naive Bayes approach in Python


Naive Bayes algorithm is a powerful tool that one can use to classify the words of a document or text in different categories. As an example, if a document has words like ‘humid’, ‘rainy’, or ‘cloudy’, then we can use the Bayes algorithm to check if this document falls in the category of a ‘sunny day’ or a ‘rainy day’.

Note that the Naive Bayes algorithm works on the assumption that the words of the two documents under comparison are independent of each other. However, given the nuances of language, it is rarely true. This is why the algorithm’s name has the term ‘naive’ in it but nonetheless, it performs well enough.

Algorithm

  • Step 1 − Input the number of documents, text strings and corresponding classes. Do the needful splitting of text and keywords using lists and input the string/text to be classified.

  • Step 2 − Create a list where the frequency of all keywords of each document will be stored. Print this in tabular form using the pretty table library. Name the headings as required.

  • Step 3 − Count the number of total words and documents belonging to each class, Positive and Negative.

  • Step 4 − Find the probability of each word and round it off to 4-digits precision.

  • Step 5 − Find the probability for class using Bayes formula and round it off to 8-digits precision.

  • Step 6 − Find the probability for class using Bayes formula and round it off to 8-digits precision.

  • Step 7 − Repeat the above two steps for the Negative class.

  • Step 8 − Compare the resultant probabilities of both the classes and print the result.

Example

In this example, for the sake of simplicity and understanding, we will take only two documents containing one sentence each and perform Naive Bayes classification on a string similar to both these sentences. Also, there will be a class for each document and our aim is to reach the conclusion as to which class the string under test belongs to.

#Step 1 - Input the required data and split the text and keywords
total_documents = 2
text_list = ["they love laugh and pray", "without faith you suffer"]
category_list = ["Positive", "Negative"]
doc_class = []
i = 0
keywords = []
while not i == total_documents:
   doc_class.append([])
   text = text_list[i]
   category = category_list[i]
   doc_class[i].append(text.split())
   doc_class[i].append(category)
   keywords.extend(text.split())
   i = i+1
keywords = set(keywords)
keywords = list(keywords)
keywords.sort()
to_find = "suffer without love laugh and pray"

#step 2 - make frequency table for keywords and print the table
probability_table = []
for i in range(total_documents):
   probability_table.append([])
   for j in keywords:
      probability_table[i].append(0)
doc_id = 1
for i in range(total_documents):
   for k in range(len(keywords)):
      if keywords[k] in doc_class[i][0]:
         probability_table[i][k] += doc_class[i][0].count(keywords[k])
print('\n')
import prettytable
keywords.insert(0, 'Document Number')
keywords.append("Class/Category")
Prob_Table = prettytable.PrettyTable()
Prob_Table.field_names = keywords
Prob_Table.title = 'Probability table'
x=0
for i in probability_table:
   i.insert(0,x+1)
   i.append(doc_class[x][1])
   Prob_Table.add_row(i)
   x=x+1
print(Prob_Table)
print('\n')
for i in probability_table:
   i.pop(0)
    
#step 3 - count the words and documents based on categories    
totalpluswords=0
totalnegwords=0
totalplus=0
totalneg=0
vocabulary=len(keywords)-2
for i in probability_table:
   if i[len(i)-1]=="+":
      totalplus+=1
      totalpluswords+=sum(i[0:len(i)-1])
   else:
      totalneg+=1
      totalnegwords+=sum(i[0:len(i)-1])
keywords.pop(0)
keywords.pop(len(keywords)-1)

#step - 4 Find probability of each word for positive class
temp=[]
for i in to_find:
   count=0
   x=keywords.index(i)
   for j in probability_table:
      if j[len(j)-1]=="Positive":
         count=count+j[x]
   temp.append(count)
   count=0
for i in range(len(temp)):
   temp[i]=format((temp[i]+1)/(vocabulary+totalpluswords),".4f")
print()
temp=[float(f) for f in temp]
print("Probabilities of each word in the 'Positive' category are: ")
h=0
for i in to_find:
   print(f"P({i}/+) = {temp[h]}")
   h=h+1
print()

#step - 5 Find probability of class using Bayes formula
prob_pos=float(format((totalplus)/(totalplus+totalneg),".8f"))
for i in temp:
   prob_pos=prob_pos*i
prob_pos=format(prob_pos,".8f")
print("Probability of text in 'Positive' class is :",prob_pos)
print()

#step - 6 Repeat above two steps for the negative class
temp=[]
for i in to_find:
   count=0
   x=keywords.index(i)
   for j in probability_table:
      if j[len(j)-1]=="Negative":
         count=count+j[x]
   temp.append(count)
   count=0
for i in range(len(temp)):
   temp[i]=format((temp[i]+1)/(vocabulary+totalnegwords),".4f")
print()
temp=[float(f) for f in temp]
print("Probabilities of each word in the 'Negative' category are: ")
h=0
for i in to_find:
   print(f"P({i}/-) = {temp[h]}")
   h=h+1
print()
prob_neg=float(format((totalneg)/(totalplus+totalneg),".8f"))
for i in temp:
   prob_neg=prob_neg*i
prob_neg=format(prob_neg,".8f")
print("Probability of text in 'Negative' class is :",prob_neg)
print('\n')

#step - 7 Compare the probabilities and print the result 
if prob_pos>prob_neg:
   print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Positive' class with the probability {prob_pos}")
else:
   print(f"By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability {prob_neg}")
print('\n')

We iterate over each document to store the keywords in a separate list. We store the frequency of the keywords by iterating over the documents and plot a probability table. The code calculates the number of positive and negative words in the document and determines the size of unique keywords.

Then, we calculate the probability of each keyword in the positive category and iterate over the keywords in the input text and count the occurrences in the positive category. The resulting probabilities are then stored in a new list. We then calculate the probability of input text that belongs to the positive category using the Baye’s Formula. Similarly, we calculate probability for each keyword in negative category and store them. We then compare the probabilities of both the categories and determine the category with higher probability.

Output

Probabilities of each word in the 'Positive' category are: 
P(suffer/+) = 0.1111
P(without/+) = 0.1111
P(love/+) = 0.2222
P(laugh/+) = 0.2222
P(and/+) = 0.2222
P(pray/+) = 0.2222

Probability of text in 'Positive' class is : 0.00000000

Probabilities of each word in the 'Negative' category are: 
P(suffer/-) = 0.1111
P(without/-) = 0.1111
P(love/-) = 0.0556
P(laugh/-) = 0.0556
P(and/-) = 0.0556
P(pray/-) = 0.0556

Probability of text in 'Negative' class is : 0.00000012

By Naive Bayes Classification, we can conclude that the given belongs to 'Negative' class with the probability 0.00000012

Conclusion

The Naive Bayes Algorithm is one such algorithm that works very well without much training. However, for any new data that is not present in the documents, the algorithm might give absurd results or errors. Nonetheless, the algorithm finds great use in real-time predictions and filtering based functionalities. Other such classification algorithms include Logistic Regression, Decision Tree and Random Forest, etc.

Updated on: 07-Aug-2023

84 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements