TF-IDF in Sentiment Analysis

Machine Learning Python Data Science

In order to recognize and classify emotions conveyed in a text, such as social media postings or product evaluations, sentiment analysis, a natural language processing approach, is essential. Businesses can enhance their offers and make data-driven decisions by using this capability to discover client attitudes towards their goods or services. A popular technique in sentiment analysis is called Term Frequency-Inverse Document Frequency (TF-IDF). It determines the significance of words inside a text in relation to the corpus as a whole, assisting in the identification of important phrases that express positive or negative moods. Algorithms for sentiment analysis can precisely categorize the sentiment of text by using TF-IDF. We will go into TF-IDF and its use in sentiment analysis in this article.

What is TF-IDF?

An evaluation of a term's significance in a text relative to the entire corpus of documents is done using a statistical metric called TF-IDF. There are two components to it: an inverse document frequency (IDF) component that estimates how frequently a term appears over the whole corpus of documents, and a term frequency (TF) component that assesses how frequently a word appears in a specific document. The TF-IDF is beneficial for sentiment analysis because it can manage enormous amounts of text data, recognize words and phrases within a text, and give unique phrases more weight. It is a practical choice for processing big datasets due to its computational efficiency.

TF-IDF in Sentiment Analysis

With this project, written documents will be categorized according to whether they are favorable, bad, or neutral. The popular Python programming language, a real-world dataset, and machine learning frameworks are all used. The procedure entails loading libraries and the IMDb movie reviews dataset, performing preprocessing operations like stopword removal and tokenization, creating a TF-IDF matrix using scikit-learn's TfidfVectorizer, dividing the dataset into training and testing sets using train_test_split, and creating a logistic regression model using the TF-IDF matrix as features and sentiment labels as targets on the training set.

Importing necessary libraries & collecting the dataset

We will make use of the IMDb movie review dataset, which is made up of 50,000 reviews of films and their feelings. The dataset is available here for download

import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1 − Collecting the Dataset
df = pd.read_csv('/content/sample_data/IMDB_Dataset.csv')

Preprocessing the dataset

Stop words, capitalization, and punctuation will all be removed as part of the preprocessing of the raw text data. To decrease the dimensionality of the data, we will also use tokenization and stemming.

# Step 2− Preprocessing the Data
corpus = []
stemmer = PorterStemmer()
for i in range(0, len(df)):
   review = re.sub('[^a-zA-Z]', ' ', df['review'][i])
   review = review.lower()
   review = review.split()
   review = [stemmer.stem(word) for word in review if word not in set(stopwords.words('english'))]
   review = ' '.join(review)
   corpus.append(review)

Creating the TF-IDF Matrix

We will take the preprocessed data and turn it into a term-frequency inverse-document-frequency (TF-IDF) matrix. The proportional relevance of each phrase in each document to the total corpus is shown by the TF-IDF matrix.

# Step 3− Creating the TF-IDF Matrix
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(corpus).toarray()
y = df.iloc[:, 1].values

Splitting the Dataset

The dataset will be used to create the training and test sets. 80% of the dataset will be used to train the machine learning model, while the remaining 20% will be used to test it.

# Step 4− Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Model

In order to train a machine learning model on the training set, we will utilize the TF-IDF matrix as features and the sentiment labels as targets. We will use a logistic regression model for this problem.

# Step 5− Training the Model
model = LogisticRegression()
model.fit(X_train, y_train)

Evaluating the Model

Accuracy, precision, recall, and F1 score are a few of the metrics that will be used to assess how well the model performed on the testing set.

# Step 6− Evaluating the Model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy: {accuracy:}")
print(f"Precision: {precision:}")
print(f"Recall: {recall:}")
print(f"F1 score: {f1:}")

Results

Accuracy− 0.886
Precision− 0.8863485349216157
Recall− 0.886
F1 score− 0.8859583626410477

The project used TF-IDF to do sentiment analysis on the IMDb movie review dataset. We preprocessed the original text data by removing stop words, capitalizing just certain terms, removing punctuation, tokenizing, and stemming. We created a TF-IDF matrix using the preprocessed data after splitting the dataset into training and testing sets. The accuracy, precision, recall, and F1 score were used to gauge the logistic regression model's performance on the testing set after it had been trained on the training set.

Conclusion

In conclusion, TF-IDF is a potent method for feature extraction from text data and is often used in NLP applications including sentiment analysis, text classification, and information retrieval. It is superior to straightforward term-frequency-based techniques because it takes each term's significance in each document relative to the whole corpus into account.

Jay Singh

Updated on: 31-Jul-2023

910 Views

Kickstart Your Career

Get certified by completing the course

Get Started