Correlation Between Categorical and Continuous Variables

Machine Learning Numpy Server Side Programming

Introduction

In machine learning, the data and the knowledge about its behavior is an essential things that one should have while working with any kind of data. In machine learning, it is impossible to have the same data with the same parameters and behavior, so it is essential to conduct some pre-training stages meaning that it is necessary to have some knowledge of the data before training the model.

The correlations are something every data scientist or data analyst wants to know about the data as it reveals essential information about the data, which could help one perform feature engineering techniques. This article will discuss the correlation between categorical and continuous variables and the methods to calculate the same.

What is Correlation?

Correlation in machine learning is a type of statistical measure that represents the behavior of a particular variable on changing the values of some other variables, meaning that it gives us an idea about how one variable will behave or change when we fluctuate or change the value of some other variable from the data.

The correlation cam is very helpful in conducting some feature engineering and feature selection techniques as we can quickly get an idea about the correlated feature with the target column, and the minor correlated variables can be dropped from the data.

Various techniques are known for conducting correlation tests, some of which are Pearson and Spearman correlations. Still, these techniques do not help conduct correlation tests between continuous and categorical variables.

Table guiding which test is suitable for what conditions		Dependent Variables
Table guiding which test is suitable for what conditions		Categorical	Continuous
Independent Variables	Categorical	Chi-Square Test	ANOVA Test
Independent Variables	Continuous	Logistic Regression	Linear Regression

In the above image, we can see some of the correlation calculation methods are listed for various situations of variables. Here the chi-square method can be used for finding the correlation between categorical variables, and linear regression can be used for calculating the correlation between continuous variables as linear regression calculates the slopes and intercept for the best-fit line.

Now, if you want to calculate the correlation between categorical and continuous variables, then the ANOVA test can be used for the same. Also, the logistic regression approach is better for calculating the correlation if the target column is categorical. Other than these, the point biserial methods can also be used to calculate the correlation between categorical and continuous variables.

ANOVA Test

The ANOVA or analysis of variance test is mainly used to calculate the correlation between variables using their variance. The ANOVA test is also a parametric test with certain assumptions −

The data needs to be normally distributed.
The data is distributed with equal variance.
There are no drastic outliers in the data.
The groups are independent of each other.

If the data is normally distributed, it can be easily converted to a normal distribution using the log and square root transform. You can use the log transform if the data is right skewed and the square root transform if the data is left-skewed.

Example

import pingouin as pg
import pandas as pd
import numpy as np
# create DataFrame
df = pd.DataFrame({'values'': [1,2,5,6,89,67,54,34], 'groups': np.repeat(['cat1','cat2','cat3'], repeats=5)})
# perform Welch's ANOVA
pg.welch_anova(dv='values', between='groups', data=df)

In the above code we can see that we have passed all the different categories that we have and values for which we want to calculate the correlations. The above code would give the output as a table in which the f-value and the p-value would b present.

If the p-value obtained from the above code is less than 0.05, it implies that it is rejecting the null hypothesis and that all the variance or the mean for all categories for particular values is the same. The values will not be affected by changing the category.

Point Biserial Test

The point biserial test is also used to calculate the correlation between categorical and continuous variables in the dataset. This method is also a statistical parametric method having certain assumptions.

The data is distributed normally.
There are no drastic outliers in the data.
Equal variance is present in the data.

The values obtained from the point biserial test are between -1 to 1, where the values equal to 1 mean strong positive correlations and vice versa. The value 0 implied that there is no correlation present.

Example

import numpy as np
from scipy import stats
a = np.array([1,1,1,2,2,2])
b = np.arange(6)
stats.pointbiserialr(a, b)
np.corrcoef(a, b)

We can use the scipy.stats library to calculate the point biserial correlation between such variables. The np.corrcoef will return a table-type output representing the correlation between variables ranging from -1 to 1.

Key Takeaways

The ANOVA and Point Biserial tests can be used to calculate the correlations between categorical and continuous variables.
The data should be normally distributed and of equal variance is a primary assumption of both methods.
The point biserial methods return the correlation value between -1 to 1, where 0 represents the no correlation between variables.

Conclusion

In this article, we discussed the correlation between continuous and categorical variables, their core intuitions, and the methods for calculating the same with code examples. This will help one to understand the concept better and conduct such cases efficiently.

Parth Shukla

Updated on: 16-Jan-2023

20K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started