How to Convert Categorical Features to Numerical Features in Python?


In machine learning, data comes in different types, including numerical, categorical, and text data. Categorical features are features that take on a limited set of values, such as colors, genders, or countries. However, most machine learning algorithms require numerical features as inputs, which means we need to convert categorical features to numerical features before training our models.

In this article, we will explore various techniques to convert categorical features to numerical features in Python. We will discuss one-hot encoding, label encoding, binary encoding, count encoding, and target encoding, and provide examples of how to implement these techniques using the category_encoders library. By the end of this article, you will have a good understanding of how to handle categorical features in your machine-learning projects.

Label Encoding

Label encoding is a technique used to convert categorical data to numerical data by assigning each category a unique integer value. For instance, a categorical feature like "color" with categories "red", "green", and "blue" can be assigned values 0, 1, and 2, respectively.

Label encoding is easy to implement and memory-efficient, requiring only a single column to store the encoded values. However, it may not accurately represent the inherent order or ranking of categories, and some machine learning algorithms may interpret the encoded values as continuous variables, leading to incorrect results.

To implement label encoding in Python, we can use the LabelEncoder class from the scikit-learn library. Here is an example:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['color_encoded'] = le.fit_transform(data['color'])

In this code, we first create an instance of the LabelEncoder class. We then fit the encoder to the "color" column of our dataset and transform the column to its encoded values.

One-Hot Encoding

One-hot encoding is a way to turn categories into numbers. We create a new feature for each category, and if a row has that category, its feature gets a 1 while the others get a 0. This technique is good for representing nominal categorical features and allows easy comparison between categories. But, it may need a lot of memory and be slow if there are many categories.

To implement one-hot encoding in Python, we can use the get_dummies() function from the pandas library. Here is an example:

To implement one-hot encoding in Python, we can use the get_dummies() function from the pandas library. Here is an example:

In this code, we first read our dataset from a CSV file. We then use the get_dummies() function to create new binary features for each category in the "color" column.

Binary Encoding

Binary encoding is a technique that converts categorical features into a binary representation. For instance, we can assign the values 0, 1, and 2 to the categories of a feature called "color" and then convert them to binary representation: 0 becomes 00, 1 becomes 01, and 2 becomes 10. This technique combines the advantages of label encoding and one-hot encoding.

Binary encoding can reduce memory usage and capture some ordinal information about categories. However, it may not accurately represent nominal categorical features, and it can become complex with many categories.

To implement binary encoding in Python, we can use the category_encoders library. Here is an example:

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['color'])
encoded_data = encoder.fit_transform(data)

In this code, we first import the category_encoders library. We then create an instance of the BinaryEncoder class and specify the "color" column as the column to encode. We fit the encoder to our dataset and transform the column to its binary encoded values.

Count Encoding

Count encoding is a technique that replaces each category with the number of times it appears in the dataset. For instance, if a categorical feature called "color" has three categories, and "red" appears 10 times, "green" appears 5 times, and "blue" appears 3 times, we can replace "red" with 10, "green" with 5, and "blue" with 3.

Count encoding is useful for high-cardinality categorical features because it reduces the number of columns created through one-hot encoding. It also captures the frequency of the categories, but may not be ideal for ordinal categorical features where frequency does not necessarily indicate the order or ranking of the categories.

To implement count encoding in Python, we can use the category_encoders library. Here is an example:

import category_encoders as ce

encoder = ce.CountEncoder(cols=['color'])
encoded_data = encoder.fit_transform(data)

In this code, we first import the category_encoders library. We then create an instance of the CountEncoder class and specify the "color" column as the column to encode. We fit the encoder to our dataset and transform the column to its count encoded values.

Target Encoding

Target encoding is a method that replaces each category with the average target value for that category. For example, if we have a categorical feature called "color" and a binary target variable, we can replace "red" with the average target value of 0.3, "green" with 0.6, and "blue" with 0.4. Target encoding works well for high-cardinality categorical features and can capture the relationship between categories and the target variable. However, it may overfit if categories are rare or if the target variable is imbalanced.

To implement target encoding in Python, we can use the category_encoders library. Here is an example:

import category_encoders as ce

encoder = ce.TargetEncoder(cols=['color'])
encoded_data = encoder.fit_transform(data, target)

In this code, we first import the category_encoders library. We then create an instance of the TargetEncoder class and specify the "color" column as the column to encode. We fit the encoder to our dataset and transform the column to its target encoded values, using the target variable as the target.

Conclusion

To sum up, in this article, we have covered different methods for converting categorical features to numerical features in Python such as one-hot encoding, label encoding, binary encoding, count encoding, and target encoding. The selection of the method depends on the type of categorical feature and the machine learning algorithm used. Converting categorical features to numerical features helps machine learning algorithms to process and analyze categorical data more accurately, which can result in better models.

Updated on: 21-Jul-2023

181 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements