Chi-Square Test for Feature Selection in Machine Learning


Feature selection is an important aspect of machine learning. It involves selecting a subset of features from a larger set of available features to improve the performance of the model. Feature selection is important because it can help reduce the complexity of the model, improve the accuracy of the model, and make the model more interpretable.

A common approach to feature selection is the Chi-Square test. This tutorial will explain what the Chi-Square test is, how it is used for feature selection along with an example, and Python implementation of Chi-Square feature selection.

What is the Chi-Square Test?

The Chi-Square test is a statistical test that is used to determine if there is a significant association between two categorical variables. It is based on the Chi-Square distribution, which is a probability distribution that describes the distribution of the sum of squared standard normal deviates.

The Chi-Square test is used to test the null hypothesis that there is no association between two categorical variables. If the test produces a p-value that is less than a pre-determined significance level, the null hypothesis is rejected, and it is concluded that there is a significant association between the two variables.

How is the Chi-Square Test used for Feature Selection?

In machine learning, the Chi-Square test is often used for feature selection. The goal of feature selection is to select a subset of features that are most relevant to the prediction task. The Chi-Square test can be used to determine if there is a significant association between each feature and the target variable. Features that are highly associated with the target variable are more likely to be useful for prediction, and features that are not associated with the target variable are less likely to be useful for prediction.

The Chi-Square test is typically performed on a contingency table. A contingency table is a table that shows the distribution of a categorical variable across two or more groups. In the context of feature selection, the contingency table is constructed by counting the number of times each feature appears in each class of the target variable. The contingency table is then used to calculate the Chi-Square statistic and the p-value.

The Chi-Square statistic is calculated as follows βˆ’

$$\mathrm{X^{2}\:=\:\Sigma\:(0_{i}\:-\:E_{i})^{2}\:/\:E_{i}}$$

Where 𝑂𝑖 is the observed frequency for the i-th cell in the contingency table, and 𝐸𝑖 is the expected frequency for the i-th cell in the contingency table. The expected frequency is calculated by multiplying the row total and column total for the i-th cell and dividing by the grand total.

The Chi-Square statistic measures the difference between the observed frequency and the expected frequency for each cell in the contingency table. If the Chi-Square statistic is large, it suggests that there is a significant association between the feature and the target variable.

The p-value is calculated from the Chi-Square statistic and the degrees of freedom. The degrees of freedom for a contingency table are calculated as (r-1)(c-1), where r is the number of rows and c is the number of columns in the contingency table. The p-value represents the probability of observing a Chi-Square statistic as extreme as the one observed, assuming that the null hypothesis is true. If the p-value is less than the significance level, the null hypothesis is rejected, and it is concluded that there is a significant association between the feature and the target variable.

Example

To illustrate how the Chi-Square test can be used for feature selection, let's consider a simple example. Suppose we have a dataset of patients with a binary classification task of whether or not they have a particular disease. We also have four categorical features: age (under 40, 40-60, over 60), gender (male, female), smoking status (never smoked, current smoker or former smoker), and family history (positive, negative). We want to determine which features are most relevant for predicting the disease status.

We start by constructing a contingency table that shows the distribution of each feature across the two classes of the target variable (disease status).

Disease No Disease
Age 0-40 20 80
40-60 30 70
>60 40 60
Gender Male 30 70
Female 60 40
Smoking Never 50 50
Current 20 80
Former 40 60
Family Positive 40 60
Negative 30 70

We can then calculate the Chi-Square statistic and the p-value for each feature using the formula mentioned earlier.

Age: Chi-Square = 15.67, p-value = 0.0004
Gender: Chi-Square = 24.5, p-value = 7.5e-07
Smoking: Chi-Square = 8.33, p-value = 0.0159
Family: Chi-Square = 2.5, p-value = 0.1131

From the results, we can see that gender has the highest Chi-Square statistic and the lowest p-value, indicating that it is the most relevant feature for predicting the disease status. Age and smoking status also have significant associations with the disease status, while family history does not. Therefore, we may choose to select age, gender, and smoking status as our features for the prediction task.

Python Implementation of Chi-Square Feature Selection

Now let's discuss how to perform Chi-Square feature selection in Python. We will use the scikitlearn library, which provides a function called SelectKBest for selecting the k-best features based on a given scoring function. In our case, the scoring function will be the Chi-Square test.

First, let's load a dataset and split it into training and testing sets βˆ’

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the dataset
data = load_iris()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
test_size=0.3, random_state=0) 

Next, we need to import the SelectKBest class and the chi2 function from the sklearn.feature_selection module βˆ’

from sklearn.feature_selection import SelectKBest, chi2 

We can then instantiate the SelectKBest class and specify the number of features to select. In this example, we will select the top two features βˆ’

# Instantiate the SelectKBest class
selector = SelectKBest(score_func=chi2, k=2)
# Fit the selector to the training data
selector.fit(X_train, y_train) 

We can then use the transform method to transform the training and testing sets to only include the selected features βˆ’

# Transform the training and testing sets
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test) 

Finally, we can train a model on the selected features and evaluate its performance on the testing set βˆ’

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Train a logistic regression model on the selected features
clf = LogisticRegression()
clf.fit(X_train_selected, y_train)
# Evaluate the performance of the model on the testing set
y_pred = clf.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Output

Here is its output βˆ’

Accuracy: 0.9777777777777777 

We performed Chi-Square feature selection and trained a model on the selected features. Note that this is just one example of how to perform Chi-Square feature selection in Python, and there are many other ways to implement it depending on the specific requirements of your project.

Conclusion

In conclusion, the Chi-Square test is a powerful and widely used feature selection method in machine learning. With its ability to identify the most relevant features for predicting the target variable, it can help reduce the complexity of the model, improve its accuracy, and enhance its interpretability.

In this tutorial, we discussed the Chi-Square test in detail, including its mathematical foundation, application in feature selection, and Python implementation using the scikit-learn library.

Updated on: 22-Feb-2024

4 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements