Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Chi-Square Test for Feature Selection in Machine Learning
Feature selection is an important aspect of machine learning that involves selecting a subset of features from a larger set to improve model performance. It helps reduce complexity, improve accuracy, and make models more interpretable.
A common approach to feature selection is the Chi-Square test. This tutorial explains what the Chi-Square test is, how it's used for feature selection, and provides a Python implementation.
What is the Chi-Square Test?
The Chi-Square test is a statistical test used to determine if there is a significant association between two categorical variables. It's based on the Chi-Square distribution, which describes the distribution of the sum of squared standard normal deviates.
The test evaluates the null hypothesis that there is no association between two categorical variables. If the p-value is less than a predetermined significance level (typically 0.05), we reject the null hypothesis and conclude there is a significant association.
Chi-Square Test for Feature Selection
In machine learning, the Chi-Square test determines if there's a significant association between each feature and the target variable. Features highly associated with the target are more likely to be useful for prediction.
How It Works
The Chi-Square test uses a contingency table that shows the distribution of a categorical variable across different groups. For feature selection, we count how often each feature appears in each class of the target variable.
The Chi-Square statistic is calculated as:
?² = ? (O? - E?)² / E?
Where:
- O? = observed frequency for the i-th cell
- E? = expected frequency for the i-th cell
The expected frequency is calculated by: (row total × column total) / grand total
Example
Consider a dataset predicting disease status with four features: age, gender, smoking status, and family history. Here's a sample contingency table:
| Feature | Category | Disease | No Disease |
|---|---|---|---|
| Age | 0-40 | 20 | 80 |
| 40-60 | 30 | 70 | |
| >60 | 40 | 60 | |
| Gender | Male | 30 | 70 |
| Female | 60 | 40 |
After calculating Chi-Square statistics:
Age: Chi-Square = 15.67, p-value = 0.0004 Gender: Chi-Square = 24.5, p-value = 7.5e-07 Smoking: Chi-Square = 8.33, p-value = 0.0159 Family: Chi-Square = 2.5, p-value = 0.1131
Gender has the highest Chi-Square statistic and lowest p-value, making it the most relevant feature. Age and smoking also show significant associations, while family history does not.
Python Implementation
Let's implement Chi-Square feature selection using scikit-learn's SelectKBest class with the Iris dataset:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.3, random_state=0
)
# Create Chi-Square selector for top 2 features
selector = SelectKBest(score_func=chi2, k=2)
selector.fit(X_train, y_train)
# Transform datasets to include only selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
# Train model on selected features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
# Evaluate performance
y_pred = clf.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.4f}")
# Show which features were selected
selected_features = selector.get_support(indices=True)
feature_names = data.feature_names
print("Selected features:", [feature_names[i] for i in selected_features])
Accuracy with selected features: 0.9778 Selected features: ['petal length (cm)', 'petal width (cm)']
Feature Scores
You can also view the Chi-Square scores for all features:
# Get feature scores
scores = selector.scores_
feature_scores = list(zip(data.feature_names, scores))
print("Feature Chi-Square Scores:")
for feature, score in feature_scores:
print(f"{feature}: {score:.4f}")
Feature Chi-Square Scores: sepal length (cm): 10.8171 sepal width (cm): 3.7107 petal length (cm): 116.3129 petal width (cm): 67.0483
Key Points
- Chi-Square test works only with categorical variables or discretized continuous variables
- Higher Chi-Square scores indicate stronger association with the target
- The test assumes independence between observations
- For continuous features, consider binning or other feature selection methods
Conclusion
The Chi-Square test is a powerful feature selection method for categorical data that identifies features most associated with the target variable. It helps reduce model complexity while maintaining predictive performance, making it valuable for preprocessing in machine learning pipelines.
---