Article Categories

Selected Reading

Chi-Square Test for Feature Selection in Machine Learning

Python Server Side Programming Programming Machine Learning

Feature selection is an important aspect of machine learning that involves selecting a subset of features from a larger set to improve model performance. It helps reduce complexity, improve accuracy, and make models more interpretable.

A common approach to feature selection is the Chi-Square test. This tutorial explains what the Chi-Square test is, how it's used for feature selection, and provides a Python implementation.

What is the Chi-Square Test?

The Chi-Square test is a statistical test used to determine if there is a significant association between two categorical variables. It's based on the Chi-Square distribution, which describes the distribution of the sum of squared standard normal deviates.

The test evaluates the null hypothesis that there is no association between two categorical variables. If the p-value is less than a predetermined significance level (typically 0.05), we reject the null hypothesis and conclude there is a significant association.

Chi-Square Test for Feature Selection

In machine learning, the Chi-Square test determines if there's a significant association between each feature and the target variable. Features highly associated with the target are more likely to be useful for prediction.

How It Works

The Chi-Square test uses a contingency table that shows the distribution of a categorical variable across different groups. For feature selection, we count how often each feature appears in each class of the target variable.

The Chi-Square statistic is calculated as:

?² = ? (O? - E?)² / E?

Where:

O? = observed frequency for the i-th cell
E? = expected frequency for the i-th cell

The expected frequency is calculated by: (row total × column total) / grand total

Example

Consider a dataset predicting disease status with four features: age, gender, smoking status, and family history. Here's a sample contingency table:

Feature	Category	Disease	No Disease
Age	0-40	20	80
	40-60	30	70
	>60	40	60
Gender	Male	30	70
Gender	Female	60	40

After calculating Chi-Square statistics:

Age: Chi-Square = 15.67, p-value = 0.0004
Gender: Chi-Square = 24.5, p-value = 7.5e-07
Smoking: Chi-Square = 8.33, p-value = 0.0159
Family: Chi-Square = 2.5, p-value = 0.1131

Gender has the highest Chi-Square statistic and lowest p-value, making it the most relevant feature. Age and smoking also show significant associations, while family history does not.

Python Implementation

Let's implement Chi-Square feature selection using scikit-learn's SelectKBest class with the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=0
)

# Create Chi-Square selector for top 2 features
selector = SelectKBest(score_func=chi2, k=2)
selector.fit(X_train, y_train)

# Transform datasets to include only selected features
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

# Train model on selected features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)

# Evaluate performance
y_pred = clf.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with selected features: {accuracy:.4f}")

# Show which features were selected
selected_features = selector.get_support(indices=True)
feature_names = data.feature_names
print("Selected features:", [feature_names[i] for i in selected_features])

Accuracy with selected features: 0.9778
Selected features: ['petal length (cm)', 'petal width (cm)']

Feature Scores

You can also view the Chi-Square scores for all features:

# Get feature scores
scores = selector.scores_
feature_scores = list(zip(data.feature_names, scores))

print("Feature Chi-Square Scores:")
for feature, score in feature_scores:
    print(f"{feature}: {score:.4f}")

Feature Chi-Square Scores:
sepal length (cm): 10.8171
sepal width (cm): 3.7107
petal length (cm): 116.3129
petal width (cm): 67.0483

Key Points

Chi-Square test works only with categorical variables or discretized continuous variables
Higher Chi-Square scores indicate stronger association with the target
The test assumes independence between observations
For continuous features, consider binning or other feature selection methods

Conclusion

The Chi-Square test is a powerful feature selection method for categorical data that identifies features most associated with the target variable. It helps reduce model complexity while maintaining predictive performance, making it valuable for preprocessing in machine learning pipelines.

---

Gaurav Leekha

Updated on: 2026-03-27T16:42:30+05:30

1K+ Views

Previous Next