Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Classification of text documents using sparse features in Python Scikit Learn
In today's digital age, efficiently categorizing text documents has become crucial. One approach to this is using sparse features in Python's Scikit?Learn library. Sparse features involve representing each document as a high?dimensional vector, with each dimension corresponding to a unique word in the corpus. In this article, we'll explore the theory and implementation of text classification using sparse features in Scikit?Learn.
Understanding Sparse Feature Representation
Sparse feature representation is a popular and effective method for performing text classification. By representing text documents as vectors of numerical values, where each dimension corresponds to a specific feature, sparse feature representation enables the efficient analysis of large volumes of text data.
This approach creates sparse vectors where most dimensions are zero, and only a few dimensions have non?zero values. This sparsity reduces computational complexity and ensures that only the most relevant features are used in classification.
Feature Extraction Techniques
Scikit?Learn provides two prominent techniques for extracting features from text data:
CountVectorizer
CountVectorizer processes text data in a bag?of?words format, where the frequency of each word in a document is counted. The resulting vectors present documents as a matrix, with each row representing a document and each column representing a word.
TF-IDF Vectorizer
TF?IDF (Term Frequency?Inverse Document Frequency) computes the significance of each word by considering both its frequency in the document and its frequency in the entire corpus. This assigns higher weights to unique and meaningful words while lowering the significance of common words.
Implementation Example
Let's demonstrate text classification using the 20 Newsgroups dataset, which contains about 20,000 newsgroup documents across 20 different categories ?
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
# Load a subset of categories for faster processing
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
remove=('headers', 'footers', 'quotes'))
# Convert text documents into TF-IDF feature vectors
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Train the Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Show feature sparsity
print(f"Training data shape: {X_train.shape}")
print(f"Sparsity: {(1.0 - X_train.nnz / (X_train.shape[0] * X_train.shape[1])):.4f}")
Accuracy: 0.8947 Training data shape: (2257, 5000) Sparsity: 0.9890
Understanding the Results
The output shows several important aspects:
- High Accuracy: The classifier achieves approximately 89% accuracy on the test set
- Sparse Representation: The feature matrix has 98.9% sparsity, meaning most values are zero
- Efficient Storage: Only 1.1% of the feature space contains non?zero values
Comparing Feature Extraction Methods
| Method | Representation | Best For | Computational Cost |
|---|---|---|---|
| CountVectorizer | Word frequencies | Simple text classification | Low |
| TF-IDF | Weighted term importance | Complex text analysis | Medium |
Key Advantages
Sparse feature representation offers several benefits:
- Memory Efficiency: Stores only non?zero values, reducing memory usage
- Fast Computation: Algorithms can skip zero values, speeding up processing
- Scalability: Handles large vocabularies and document collections effectively
Conclusion
Text classification using sparse features provides an efficient approach for analyzing large text datasets. TF?IDF vectorization combined with algorithms like Multinomial Naive Bayes creates powerful classification models with minimal computational overhead. The sparse representation's memory efficiency and fast processing make it ideal for real?world text classification applications.
