Article Categories

Selected Reading

TF-IDF in Sentiment Analysis

Machine Learning Python Data Science

In order to recognize and classify emotions conveyed in a text, such as social media postings or product evaluations, sentiment analysis, a natural language processing technique, is essential. Businesses can enhance their offers and make data-driven decisions by using this capability to discover client attitudes towards their goods or services. A popular technique in sentiment analysis is called Term Frequency-Inverse Document Frequency (TF-IDF). It determines the significance of words within a text in relation to the corpus as a whole, assisting in the identification of important phrases that express positive or negative sentiments. Algorithms for sentiment analysis can precisely categorize the sentiment of text by using TF-IDF. We will explore TF-IDF and its use in sentiment analysis in this article.

What is TF-IDF?

An evaluation of a term's significance in a text relative to the entire corpus of documents is done using a statistical metric called TF-IDF. There are two components to it:

Term Frequency (TF): Measures how frequently a word appears in a specific document
Inverse Document Frequency (IDF): Estimates how frequently a term appears across the whole corpus of documents

The TF-IDF is beneficial for sentiment analysis because it can manage enormous amounts of text data, recognize important words and phrases within a text, and give unique phrases more weight. It is a practical choice for processing big datasets due to its computational efficiency.

TF-IDF in Sentiment Analysis

In this project, written documents will be categorized according to whether they are positive, negative, or neutral. We use Python programming language, a real-world dataset, and machine learning frameworks. The procedure involves loading libraries and the IMDb movie reviews dataset, performing preprocessing operations like stopword removal and tokenization, creating a TF-IDF matrix using scikit-learn's TfidfVectorizer, dividing the dataset into training and testing sets, and training a logistic regression model.

Importing Necessary Libraries

We will use the IMDb movie review dataset, which consists of 50,000 film reviews and their sentiments. For demonstration purposes, we'll create a sample dataset with similar structure ?

import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample movie review dataset
sample_reviews = [
    "This movie is absolutely fantastic and amazing",
    "Terrible film with poor acting and bad plot",
    "Great storyline and excellent performances",
    "Boring and disappointing experience",
    "Outstanding cinematography and direction",
    "Worst movie I have ever seen",
    "Brilliant acting and wonderful script",
    "Complete waste of time and money"
]

sample_sentiments = ["positive", "negative", "positive", "negative", 
                    "positive", "negative", "positive", "negative"]

# Create DataFrame
df = pd.DataFrame({
    'review': sample_reviews,
    'sentiment': sample_sentiments
})

print("Sample dataset:")
print(df.head())

Sample dataset:
                                      review sentiment
0  This movie is absolutely fantastic and amazing  positive
1   Terrible film with poor acting and bad plot  negative
2      Great storyline and excellent performances  positive
3           Boring and disappointing experience  negative
4      Outstanding cinematography and direction  positive

Preprocessing the Dataset

We'll clean the text data by removing punctuation, converting to lowercase, and removing common stop words ?

import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

def preprocess_text(text):
    # Remove non-alphabetic characters
    text = re.sub('[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Split into words and remove stop words
    words = [word for word in text.split() if word not in ENGLISH_STOP_WORDS and len(word) > 2]
    return ' '.join(words)

# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)

print("Preprocessed reviews:")
for i, row in df.head(3).iterrows():
    print(f"Original: {row['review']}")
    print(f"Cleaned:  {row['cleaned_review']}")
    print()

Preprocessed reviews:
Original: This movie is absolutely fantastic and amazing
Cleaned:  movie absolutely fantastic amazing

Original: Terrible film with poor acting and bad plot
Cleaned:  terrible film poor acting bad plot

Original: Great storyline and excellent performances
Cleaned:  great storyline excellent performances

Creating the TF-IDF Matrix

We'll transform the preprocessed text data into a TF-IDF matrix, which represents the relative importance of each term in each document ?

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000, lowercase=True)

# Fit and transform the cleaned reviews
X = vectorizer.fit_transform(df['cleaned_review'])
y = df['sentiment']

# Display TF-IDF matrix shape and feature names
print(f"TF-IDF matrix shape: {X.shape}")
print(f"Number of features: {len(vectorizer.get_feature_names_out())}")
print(f"Sample features: {vectorizer.get_feature_names_out()[:10]}")

# Convert to dense array for better visualization
X_dense = X.toarray()
print(f"\nFirst review TF-IDF values (first 5 features):")
print(X_dense[0][:5])

TF-IDF matrix shape: (8, 22)
Number of features: 22
Sample features: ['absolutely' 'acting' 'amazing' 'bad' 'boring' 'brilliant' 'cinematography'
 'complete' 'direction' 'disappointing']

First review TF-IDF values (first 5 features):
[0.5 0.  0.5 0.  0. ]

Training and Evaluating the Model

We'll split the data into training and testing sets, then train a logistic regression model ?

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.3f}")

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Test with new example
new_review = ["This movie is excellent and wonderful"]
new_review_cleaned = [preprocess_text(new_review[0])]
new_review_tfidf = vectorizer.transform(new_review_cleaned)
prediction = model.predict(new_review_tfidf)

print(f"\nNew review: '{new_review[0]}'")
print(f"Predicted sentiment: {prediction[0]}")

Model Accuracy: 1.000

Classification Report:
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         1
    positive       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

New review: 'This movie is excellent and wonderful'
Predicted sentiment: positive

Key Advantages of TF-IDF

Advantage	Description
Importance Weighting	Gives higher weight to unique, discriminative terms
Noise Reduction	Reduces impact of common words like "the", "and"
Scalability	Efficient for large text corpora
Interpretability	Easy to understand feature importance

Conclusion

TF-IDF is a powerful technique for feature extraction from text data and is widely used in sentiment analysis applications. It effectively identifies important terms while reducing the impact of common words, making it superior to simple word frequency approaches for text classification tasks.

Jay Singh

Updated on: 2026-03-27T10:41:07+05:30

3K+ Views

Previous Next