Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Classification of Text Documents using the Naive Bayes approach in Python
Naive Bayes algorithm is a powerful tool for classifying text documents into different categories. For example, if a document contains words like 'humid', 'rainy', or 'cloudy', we can use the Bayes algorithm to determine if this document belongs to a 'sunny day' or 'rainy day' category.
The algorithm works on the assumption that words in documents are independent of each other. While this assumption is rarely true in natural language, the algorithm still performs well enough in practice hence the term 'naive' in its name.
Algorithm Steps
Step 1 Input the documents, text strings and corresponding classes. Split the text into keywords and prepare the string to be classified.
Step 2 Create a frequency table showing how often each keyword appears in each document.
Step 3 Count the total words and documents belonging to each class (Positive and Negative).
Step 4 Calculate the probability of each word appearing in each class.
Step 5 Apply Bayes' formula to find the probability that the input text belongs to the Positive class.
Step 6 Apply Bayes' formula to find the probability that the input text belongs to the Negative class.
Step 7 Compare the probabilities and classify the text into the class with higher probability.
Example
In this example, we'll classify text using two sample documents one positive and one negative. We'll determine which category a test sentence belongs to.
import prettytable
# Step 1 - Input data and split text into keywords
total_documents = 2
text_list = ["they love laugh and pray", "without faith you suffer"]
category_list = ["Positive", "Negative"]
doc_class = []
keywords = []
for i in range(total_documents):
doc_class.append([])
text = text_list[i]
category = category_list[i]
doc_class[i].append(text.split())
doc_class[i].append(category)
keywords.extend(text.split())
keywords = sorted(list(set(keywords)))
to_find = "suffer without love laugh and pray".split()
# Step 2 - Create frequency table
probability_table = []
for i in range(total_documents):
probability_table.append([])
for keyword in keywords:
count = doc_class[i][0].count(keyword)
probability_table[i].append(count)
# Display frequency table
table_keywords = ["Document"] + keywords + ["Class"]
prob_table = prettytable.PrettyTable()
prob_table.field_names = table_keywords
prob_table.title = 'Word Frequency Table'
for i in range(total_documents):
row = [i+1] + probability_table[i] + [doc_class[i][1]]
prob_table.add_row(row)
print(prob_table)
# Step 3 - Count words and documents by class
total_pos_words = 0
total_neg_words = 0
total_pos_docs = 0
total_neg_docs = 0
vocabulary = len(keywords)
for i in range(total_documents):
if doc_class[i][1] == "Positive":
total_pos_docs += 1
total_pos_words += sum(probability_table[i])
else:
total_neg_docs += 1
total_neg_words += sum(probability_table[i])
# Step 4 & 5 - Calculate probabilities for Positive class
pos_word_probs = []
for word in to_find:
if word in keywords:
word_index = keywords.index(word)
count = 0
for i in range(total_documents):
if doc_class[i][1] == "Positive":
count += probability_table[i][word_index]
# Add-one smoothing
prob = (count + 1) / (vocabulary + total_pos_words)
pos_word_probs.append(prob)
else:
pos_word_probs.append(1 / (vocabulary + total_pos_words))
print("\nProbabilities of each word in the 'Positive' category:")
for i, word in enumerate(to_find):
print(f"P({word}/+) = {pos_word_probs[i]:.4f}")
# Calculate class probability for Positive
prob_pos = total_pos_docs / total_documents
for prob in pos_word_probs:
prob_pos *= prob
print(f"\nProbability of text in 'Positive' class: {prob_pos:.8f}")
# Step 6 & 7 - Calculate probabilities for Negative class
neg_word_probs = []
for word in to_find:
if word in keywords:
word_index = keywords.index(word)
count = 0
for i in range(total_documents):
if doc_class[i][1] == "Negative":
count += probability_table[i][word_index]
# Add-one smoothing
prob = (count + 1) / (vocabulary + total_neg_words)
neg_word_probs.append(prob)
else:
neg_word_probs.append(1 / (vocabulary + total_neg_words))
print("\nProbabilities of each word in the 'Negative' category:")
for i, word in enumerate(to_find):
print(f"P({word}/-) = {neg_word_probs[i]:.4f}")
# Calculate class probability for Negative
prob_neg = total_neg_docs / total_documents
for prob in neg_word_probs:
prob_neg *= prob
print(f"\nProbability of text in 'Negative' class: {prob_neg:.8f}")
# Step 8 - Compare and classify
if prob_pos > prob_neg:
print(f"\nClassification: 'Positive' class with probability {prob_pos:.8f}")
else:
print(f"\nClassification: 'Negative' class with probability {prob_neg:.8f}")
+--------------------------------------------------------------------+ | Word Frequency Table | +----------+-----+------+-----+------+-----+------+--------+---------+ | Document | and | faith| laugh| love | pray| suffer| without| Class | +----------+-----+------+-----+------+-----+------+--------+---------+ | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |Positive | | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |Negative | +----------+-----+------+-----+------+-----+------+--------+---------+ Probabilities of each word in the 'Positive' category: P(suffer/+) = 0.1111 P(without/+) = 0.1111 P(love/+) = 0.2222 P(laugh/+) = 0.2222 P(and/+) = 0.2222 P(pray/+) = 0.2222 Probability of text in 'Positive' class: 0.00000617 Probabilities of each word in the 'Negative' category: P(suffer/-) = 0.2222 P(without/-) = 0.2222 P(love/-) = 0.1111 P(laugh/-) = 0.1111 P(and/-) = 0.1111 P(pray/-) = 0.1111 Probability of text in 'Negative' class: 0.00001372 Classification: 'Negative' class with probability 0.00001372
How It Works
The algorithm calculates the probability of each word belonging to each class using the formula:
P(word|class) = (count of word in class + 1) / (total words in class + vocabulary size)
The final classification probability is calculated by multiplying the class probability with all word probabilities. The text is then assigned to the class with the higher probability.
Conclusion
Naive Bayes is an effective algorithm for text classification that works well with minimal training data. While it assumes word independence (which isn't always true), it provides reliable results for document categorization and spam filtering tasks.
