Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
TF-IDF in Sentiment Analysis
In order to recognize and classify emotions conveyed in a text, such as social media postings or product evaluations, sentiment analysis, a natural language processing technique, is essential. Businesses can enhance their offers and make data-driven decisions by using this capability to discover client attitudes towards their goods or services. A popular technique in sentiment analysis is called Term Frequency-Inverse Document Frequency (TF-IDF). It determines the significance of words within a text in relation to the corpus as a whole, assisting in the identification of important phrases that express positive or negative sentiments. Algorithms for sentiment analysis can precisely categorize the sentiment of text by using TF-IDF. We will explore TF-IDF and its use in sentiment analysis in this article.
What is TF-IDF?
An evaluation of a term's significance in a text relative to the entire corpus of documents is done using a statistical metric called TF-IDF. There are two components to it:
- Term Frequency (TF): Measures how frequently a word appears in a specific document
- Inverse Document Frequency (IDF): Estimates how frequently a term appears across the whole corpus of documents
The TF-IDF is beneficial for sentiment analysis because it can manage enormous amounts of text data, recognize important words and phrases within a text, and give unique phrases more weight. It is a practical choice for processing big datasets due to its computational efficiency.
TF-IDF in Sentiment Analysis
In this project, written documents will be categorized according to whether they are positive, negative, or neutral. We use Python programming language, a real-world dataset, and machine learning frameworks. The procedure involves loading libraries and the IMDb movie reviews dataset, performing preprocessing operations like stopword removal and tokenization, creating a TF-IDF matrix using scikit-learn's TfidfVectorizer, dividing the dataset into training and testing sets, and training a logistic regression model.
Importing Necessary Libraries
We will use the IMDb movie review dataset, which consists of 50,000 film reviews and their sentiments. For demonstration purposes, we'll create a sample dataset with similar structure ?
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Sample movie review dataset
sample_reviews = [
"This movie is absolutely fantastic and amazing",
"Terrible film with poor acting and bad plot",
"Great storyline and excellent performances",
"Boring and disappointing experience",
"Outstanding cinematography and direction",
"Worst movie I have ever seen",
"Brilliant acting and wonderful script",
"Complete waste of time and money"
]
sample_sentiments = ["positive", "negative", "positive", "negative",
"positive", "negative", "positive", "negative"]
# Create DataFrame
df = pd.DataFrame({
'review': sample_reviews,
'sentiment': sample_sentiments
})
print("Sample dataset:")
print(df.head())
Sample dataset:
review sentiment
0 This movie is absolutely fantastic and amazing positive
1 Terrible film with poor acting and bad plot negative
2 Great storyline and excellent performances positive
3 Boring and disappointing experience negative
4 Outstanding cinematography and direction positive
Preprocessing the Dataset
We'll clean the text data by removing punctuation, converting to lowercase, and removing common stop words ?
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
def preprocess_text(text):
# Remove non-alphabetic characters
text = re.sub('[^a-zA-Z]', ' ', text)
# Convert to lowercase
text = text.lower()
# Split into words and remove stop words
words = [word for word in text.split() if word not in ENGLISH_STOP_WORDS and len(word) > 2]
return ' '.join(words)
# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)
print("Preprocessed reviews:")
for i, row in df.head(3).iterrows():
print(f"Original: {row['review']}")
print(f"Cleaned: {row['cleaned_review']}")
print()
Preprocessed reviews: Original: This movie is absolutely fantastic and amazing Cleaned: movie absolutely fantastic amazing Original: Terrible film with poor acting and bad plot Cleaned: terrible film poor acting bad plot Original: Great storyline and excellent performances Cleaned: great storyline excellent performances
Creating the TF-IDF Matrix
We'll transform the preprocessed text data into a TF-IDF matrix, which represents the relative importance of each term in each document ?
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000, lowercase=True)
# Fit and transform the cleaned reviews
X = vectorizer.fit_transform(df['cleaned_review'])
y = df['sentiment']
# Display TF-IDF matrix shape and feature names
print(f"TF-IDF matrix shape: {X.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
print(f"Sample features: {vectorizer.get_feature_names_out()[:10]}")
# Convert to dense array for better visualization
X_dense = X.toarray()
print(f"\nFirst review TF-IDF values (first 5 features):")
print(X_dense[0][:5])
TF-IDF matrix shape: (8, 22) Number of features: 22 Sample features: ['absolutely' 'acting' 'amazing' 'bad' 'boring' 'brilliant' 'cinematography' 'complete' 'direction' 'disappointing'] First review TF-IDF values (first 5 features): [0.5 0. 0.5 0. 0. ]
Training and Evaluating the Model
We'll split the data into training and testing sets, then train a logistic regression model ?
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.3f}")
# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Test with new example
new_review = ["This movie is excellent and wonderful"]
new_review_cleaned = [preprocess_text(new_review[0])]
new_review_tfidf = vectorizer.transform(new_review_cleaned)
prediction = model.predict(new_review_tfidf)
print(f"\nNew review: '{new_review[0]}'")
print(f"Predicted sentiment: {prediction[0]}")
Model Accuracy: 1.000
Classification Report:
precision recall f1-score support
negative 1.00 1.00 1.00 1
positive 1.00 1.00 1.00 1
accuracy 1.00 2
macro avg 1.00 1.00 1.00 2
weighted avg 1.00 1.00 1.00 2
New review: 'This movie is excellent and wonderful'
Predicted sentiment: positive
Key Advantages of TF-IDF
| Advantage | Description |
|---|---|
| Importance Weighting | Gives higher weight to unique, discriminative terms |
| Noise Reduction | Reduces impact of common words like "the", "and" |
| Scalability | Efficient for large text corpora |
| Interpretability | Easy to understand feature importance |
Conclusion
TF-IDF is a powerful technique for feature extraction from text data and is widely used in sentiment analysis applications. It effectively identifies important terms while reducing the impact of common words, making it superior to simple word frequency approaches for text classification tasks.
