Handling Categorical Data in Python


Data that only includes a few values is referred to as categorical data, often known as categories or levels and it is described in two ways - nominal or ordinal. Data that lacks any intrinsic order, such as colors, genders, or animal species, is represented as nominal categorical data while ordinal categorical data refers to information that is naturally ranked or ordered, such as customer satisfaction levels or educational attainment. We will go through how to handle categorical data in Python in this tutorial.

Setup

pip install pandas
pip install scikit-learn
pip install category_encoders

Categorical data is often represented as text labels, and many machine learning algorithms require numerical input data. Customer demographics, product classifications, and geographic areas are just a few examples of real-world datasets that include categorical data which must be converted into numerical representation before being used in machine learning algorithms. Therefore, it is important to convert categorical data into a numerical format before feeding it to a machine learning algorithm. This process is known as encoding. There are various techniques for encoding categorical data, including one-hot encoding, ordinal encoding, and target encoding.

Ways to Handle Categorical Data

Example 1 - One Hot Encoding

One-Hot Encoding is a technique used to convert categorical data into numerical format. It creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories

The pandas and scikit-learn libraries provide functions to perform One-Hot Encoding. The following code snippet shows how to perform One-Hot Encoding using pandas and scikit-learn.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder, TargetEncoder

# Create a pandas DataFrame with categorical data
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']})

# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the DataFrame using the encoder
encoded_data = encoder.fit_transform(df)

# Convert the encoded data into a pandas DataFrame
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names())
print(encoded_df)

Output

x0_blue x0_green x0_red
0   0.0   0.0     1.0
1   1.0   0.0     0.0
2   0.0   1.0     0.0
3   0.0   1.0     0.0
4   0.0   0.0     1.0

Example 2 - Ordinal Encoding

Ordinal coding is a popular technique for encoding categorical data where each category is given a different numerical value based on its rank or order. The categories with the lowest values receive the smallest integers, while those with the highest values receive the largest integers. When the categories are grouped organically, like with ratings (poor, fair, good, outstanding), or educational achievement, this strategy is extremely useful (high school, college, graduate school). Let us do ordinal encoding using Pandas and the category encoders package −

import pandas as pd
import category_encoders as ce

# create a sample dataset
data = {'category': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# initialize the encoder
encoder = ce.OrdinalEncoder()

# encode the categorical feature
df['category_encoded'] = encoder.fit_transform(df['category'])

# print the encoded dataframe
print(df)

Output

category category_encoded
0         red      1
1         green    2
2         blue     3
3         red      1
4         green    2

As you can see, the red category has been given the value 1, green has been given the value 2, and blue has been given the value 3. The sequence in which the categories occurred in the original dataset served as the basis for this encoding.

Example 3: Target Encoding using Category Encoders

Target Encoding is another technique used for encoding categorical data, particularly when dealing with high cardinality features. It replaces each category with the average target value for that category. Target Encoding is useful when there is a strong relationship between the categorical feature and the target variable.

import pandas as pd
import category_encoders as ce

# create a sample dataset
data = {'category': ['red', 'green', 'blue', 'red', 'green'], 'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# initialize the encoder
encoder = ce.TargetEncoder()

# encode the categorical feature
df['category_encoded'] = encoder.fit_transform(df['category'], df['target'])

# print the encoded dataframe
print(df)

In this example, we create a sample dataset with a single categorical feature called "category" and a corresponding target variable called "target". We import the category_encoders library and initialize a TargetEncoder object. We use the fit_transform() method to encode the categorical feature based on the target variable and add the encoded feature to the original dataframe.

Output

  category target category_encoded
0   red      1      0.585815
1   green    0      0.585815
2   blue     1      0.652043
3   red      0      0.585815
4   green    1      0.585815

The color column was successfully encoded using target encoding, as can be seen in the output, by category encoders. The column to be encoded is specified using the cols option, and the encoding is done using TargetEncoder. The target variable and the encoding target column are the two arguments that the fit transform function requires.

Conclusion

The significance of managing categorical data properly in machine learning applications was covered in this article. It investigated one-hot encoding, ordinal encoding, and target encoding as three distinct methods for encoding categorical data in Python. One-hot encoding is a quick and efficient method, but it can result in a lot more features. When the order of the categories is known, ordinal encoding is a reasonable option, but it misses the connection between the categories and the target variable.

Hence, managing categorical data is a crucial component of machine learning systems, and selecting the proper encoding method is key for producing accurate and trustworthy results.

Updated on: 18-Apr-2023

682 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements