- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
TF-IDF in Sentiment Analysis
In order to recognize and classify emotions conveyed in a text, such as social media postings or product evaluations, sentiment analysis, a natural language processing approach, is essential. Businesses can enhance their offers and make data-driven decisions by using this capability to discover client attitudes towards their goods or services. A popular technique in sentiment analysis is called Term Frequency-Inverse Document Frequency (TF-IDF). It determines the significance of words inside a text in relation to the corpus as a whole, assisting in the identification of important phrases that express positive or negative moods. Algorithms for sentiment analysis can precisely categorize the sentiment of text by using TF-IDF. We will go into TF-IDF and its use in sentiment analysis in this article.
What is TF-IDF?
An evaluation of a term's significance in a text relative to the entire corpus of documents is done using a statistical metric called TF-IDF. There are two components to it: an inverse document frequency (IDF) component that estimates how frequently a term appears over the whole corpus of documents, and a term frequency (TF) component that assesses how frequently a word appears in a specific document. The TF-IDF is beneficial for sentiment analysis because it can manage enormous amounts of text data, recognize words and phrases within a text, and give unique phrases more weight. It is a practical choice for processing big datasets due to its computational efficiency.
TF-IDF in Sentiment Analysis
With this project, written documents will be categorized according to whether they are favorable, bad, or neutral. The popular Python programming language, a real-world dataset, and machine learning frameworks are all used. The procedure entails loading libraries and the IMDb movie reviews dataset, performing preprocessing operations like stopword removal and tokenization, creating a TF-IDF matrix using scikit-learn's TfidfVectorizer, dividing the dataset into training and testing sets using train_test_split, and creating a logistic regression model using the TF-IDF matrix as features and sentiment labels as targets on the training set.
Importing necessary libraries & collecting the dataset
We will make use of the IMDb movie review dataset, which is made up of 50,000 reviews of films and their feelings. The dataset is available here for download
import pandas as pd import numpy as np import re import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Step 1 − Collecting the Dataset df = pd.read_csv('/content/sample_data/IMDB_Dataset.csv')
Preprocessing the dataset
Stop words, capitalization, and punctuation will all be removed as part of the preprocessing of the raw text data. To decrease the dimensionality of the data, we will also use tokenization and stemming.
# Step 2− Preprocessing the Data corpus = [] stemmer = PorterStemmer() for i in range(0, len(df)): review = re.sub('[^a-zA-Z]', ' ', df['review'][i]) review = review.lower() review = review.split() review = [stemmer.stem(word) for word in review if word not in set(stopwords.words('english'))] review = ' '.join(review) corpus.append(review)
Creating the TF-IDF Matrix
We will take the preprocessed data and turn it into a term-frequency inverse-document-frequency (TF-IDF) matrix. The proportional relevance of each phrase in each document to the total corpus is shown by the TF-IDF matrix.
# Step 3− Creating the TF-IDF Matrix vectorizer = TfidfVectorizer(max_features=5000) X = vectorizer.fit_transform(corpus).toarray() y = df.iloc[:, 1].values
Splitting the Dataset
The dataset will be used to create the training and test sets. 80% of the dataset will be used to train the machine learning model, while the remaining 20% will be used to test it.
# Step 4− Splitting the Dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training the Model
In order to train a machine learning model on the training set, we will utilize the TF-IDF matrix as features and the sentiment labels as targets. We will use a logistic regression model for this problem.
# Step 5− Training the Model model = LogisticRegression() model.fit(X_train, y_train)
Evaluating the Model
Accuracy, precision, recall, and F1 score are a few of the metrics that will be used to assess how well the model performed on the testing set.
# Step 6− Evaluating the Model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, average='weighted') recall = recall_score(y_test, y_pred, average='weighted') f1 = f1_score(y_test, y_pred, average='weighted') print(f"Accuracy: {accuracy:}") print(f"Precision: {precision:}") print(f"Recall: {recall:}") print(f"F1 score: {f1:}")
Results
Accuracy− 0.886 Precision− 0.8863485349216157 Recall− 0.886 F1 score− 0.8859583626410477
The project used TF-IDF to do sentiment analysis on the IMDb movie review dataset. We preprocessed the original text data by removing stop words, capitalizing just certain terms, removing punctuation, tokenizing, and stemming. We created a TF-IDF matrix using the preprocessed data after splitting the dataset into training and testing sets. The accuracy, precision, recall, and F1 score were used to gauge the logistic regression model's performance on the testing set after it had been trained on the training set.
Conclusion
In conclusion, TF-IDF is a potent method for feature extraction from text data and is often used in NLP applications including sentiment analysis, text classification, and information retrieval. It is superior to straightforward term-frequency-based techniques because it takes each term's significance in each document relative to the whole corpus into account.