Chi-Square Test for Feature Selection in Machine Learning

Feature selection is an important aspect of machine learning that involves selecting a subset of features from a larger set to improve model performance. It helps reduce complexity, improve accuracy, and make models more interpretable.

A common approach to feature selection is the Chi-Square test. This tutorial explains what the Chi-Square test is, how it's used for feature selection, and provides a Python implementation.

What is the Chi-Square Test?

The Chi-Square test is a statistical test used to determine if there is a significant association between two categorical variables. It's based on the Chi-Square distribution, which describes the distribution of the sum of squared standard normal deviates.

The test evaluates the null hypothesis that there is no association between two categorical variables. If the p-value is less than a predetermined significance level (typically 0.05), we reject the null hypothesis and conclude there is a significant association.

Chi-Square Test for Feature Selection

In machine learning, the Chi-Square test determines if there's a significant association between each feature and the target variable. Features highly associated with the target are more likely to be useful for prediction.

How It Works

The Chi-Square test uses a contingency table that shows the distribution of a categorical variable across different groups. For feature selection, we count how often each feature appears in each class of the target variable.

The Chi-Square statistic is calculated as:

?² = ? (O? - E?)² / E?

Where:

  • O? = observed frequency for the i-th cell
  • E? = expected frequency for the i-th cell

The expected frequency is calculated by: (row total × column total) / grand total

Example

Consider a dataset predicting disease status with four features: age, gender, smoking status, and family history. Here's a sample contingency table:

Feature Category Disease No Disease
Age 0-40 20 80
40-60 30 70
>60 40 60
Gender Male 30 70
Female 60 40

After calculating Chi-Square statistics:

Age: Chi-Square = 15.67, p-value = 0.0004
Gender: Chi-Square = 24.5, p-value = 7.5e-07
Smoking: Chi-Square = 8.33, p-value = 0.0159
Family: Chi-Square = 2.5, p-value = 0.1131

Gender has the highest Chi-Square statistic and lowest p-value, making it the most relevant feature. Age and smoking also show significant associations, while family history does not.

Python Implementation

Let's implement Chi-Square feature selection using scikit-learn's SelectKBest class with the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=0
)

# Create Chi-Square selector for top 2 features
selector = SelectKBest(score_func=chi2, k=2)
selector.fit(X_train, y_train)

# Transform datasets to include only selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# Train model on selected features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)

# Evaluate performance
y_pred = clf.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.4f}")

# Show which features were selected
selected_features = selector.get_support(indices=True)
feature_names = data.feature_names
print("Selected features:", [feature_names[i] for i in selected_features])
Accuracy with selected features: 0.9778
Selected features: ['petal length (cm)', 'petal width (cm)']

Feature Scores

You can also view the Chi-Square scores for all features:

# Get feature scores
scores = selector.scores_
feature_scores = list(zip(data.feature_names, scores))

print("Feature Chi-Square Scores:")
for feature, score in feature_scores:
    print(f"{feature}: {score:.4f}")
Feature Chi-Square Scores:
sepal length (cm): 10.8171
sepal width (cm): 3.7107
petal length (cm): 116.3129
petal width (cm): 67.0483

Key Points

  • Chi-Square test works only with categorical variables or discretized continuous variables
  • Higher Chi-Square scores indicate stronger association with the target
  • The test assumes independence between observations
  • For continuous features, consider binning or other feature selection methods

Conclusion

The Chi-Square test is a powerful feature selection method for categorical data that identifies features most associated with the target variable. It helps reduce model complexity while maintaining predictive performance, making it valuable for preprocessing in machine learning pipelines.

---
Updated on: 2026-03-27T16:42:30+05:30

812 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements