Classification of text documents using sparse features in Python Scikit Learn

In today's digital age, efficiently categorizing text documents has become crucial. One approach to this is using sparse features in Python's Scikit?Learn library. Sparse features involve representing each document as a high?dimensional vector, with each dimension corresponding to a unique word in the corpus. In this article, we'll explore the theory and implementation of text classification using sparse features in Scikit?Learn.

Understanding Sparse Feature Representation

Sparse feature representation is a popular and effective method for performing text classification. By representing text documents as vectors of numerical values, where each dimension corresponds to a specific feature, sparse feature representation enables the efficient analysis of large volumes of text data.

This approach creates sparse vectors where most dimensions are zero, and only a few dimensions have non?zero values. This sparsity reduces computational complexity and ensures that only the most relevant features are used in classification.

Feature Extraction Techniques

Scikit?Learn provides two prominent techniques for extracting features from text data:

CountVectorizer

CountVectorizer processes text data in a bag?of?words format, where the frequency of each word in a document is counted. The resulting vectors present documents as a matrix, with each row representing a document and each column representing a word.

TF-IDF Vectorizer

TF?IDF (Term Frequency?Inverse Document Frequency) computes the significance of each word by considering both its frequency in the document and its frequency in the entire corpus. This assigns higher weights to unique and meaningful words while lowering the significance of common words.

Implementation Example

Let's demonstrate text classification using the 20 Newsgroups dataset, which contains about 20,000 newsgroup documents across 20 different categories ?

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load a subset of categories for faster processing
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, 
                                     remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                    remove=('headers', 'footers', 'quotes'))

# Convert text documents into TF-IDF feature vectors
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)

y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Train the Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Show feature sparsity
print(f"Training data shape: {X_train.shape}")
print(f"Sparsity: {(1.0 - X_train.nnz / (X_train.shape[0] * X_train.shape[1])):.4f}")
Accuracy: 0.8947
Training data shape: (2257, 5000)
Sparsity: 0.9890

Understanding the Results

The output shows several important aspects:

  • High Accuracy: The classifier achieves approximately 89% accuracy on the test set
  • Sparse Representation: The feature matrix has 98.9% sparsity, meaning most values are zero
  • Efficient Storage: Only 1.1% of the feature space contains non?zero values

Comparing Feature Extraction Methods

Method Representation Best For Computational Cost
CountVectorizer Word frequencies Simple text classification Low
TF-IDF Weighted term importance Complex text analysis Medium

Key Advantages

Sparse feature representation offers several benefits:

  • Memory Efficiency: Stores only non?zero values, reducing memory usage
  • Fast Computation: Algorithms can skip zero values, speeding up processing
  • Scalability: Handles large vocabularies and document collections effectively

Conclusion

Text classification using sparse features provides an efficient approach for analyzing large text datasets. TF?IDF vectorization combined with algorithms like Multinomial Naive Bayes creates powerful classification models with minimal computational overhead. The sparse representation's memory efficiency and fast processing make it ideal for real?world text classification applications.

Updated on: 2026-03-27T08:58:32+05:30

325 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements