Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Convert Categorical Features to Numerical Features in Python?
In machine learning, data comes in different types, including numerical, categorical, and text data. Categorical features are features that take on a limited set of values, such as colors, genders, or countries. However, most machine learning algorithms require numerical features as inputs, which means we need to convert categorical features to numerical features before training our models.
In this article, we will explore various techniques to convert categorical features to numerical features in Python. We will discuss label encoding, one-hot encoding, binary encoding, count encoding, and target encoding, with complete working examples.
Label Encoding
Label encoding converts categorical data to numerical data by assigning each category a unique integer value. For instance, a categorical feature like "color" with categories "red", "green", and "blue" can be assigned values 0, 1, and 2, respectively.
Label encoding is memory-efficient, requiring only a single column. However, it may introduce artificial ordering that doesn't exist in the data
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Create sample data
data = pd.DataFrame({
'color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red']
})
print("Original data:")
print(data)
# Apply label encoding
le = LabelEncoder()
data['color_encoded'] = le.fit_transform(data['color'])
print("\nAfter label encoding:")
print(data)
Original data: color 0 red 1 green 2 blue 3 red 4 green 5 blue 6 red After label encoding: color color_encoded 0 red 2 1 green 1 2 blue 0 3 red 2 4 green 1 5 blue 0 6 red 2
One-Hot Encoding
One-hot encoding creates a new binary feature for each category. If a row has that category, its feature gets a 1 while the others get a 0. This technique preserves the categorical nature without introducing artificial ordering
import pandas as pd
# Create sample data
data = pd.DataFrame({
'color': ['red', 'green', 'blue', 'red', 'green']
})
print("Original data:")
print(data)
# Apply one-hot encoding
data_encoded = pd.get_dummies(data, columns=['color'], prefix='color')
print("\nAfter one-hot encoding:")
print(data_encoded)
Original data: color 0 red 1 green 2 blue 3 red 4 green After one-hot encoding: color_blue color_green color_red 0 0 0 1 1 0 1 0 2 1 0 0 3 0 0 1 4 0 1 0
Binary Encoding
Binary encoding first assigns integer values to categories (like label encoding), then converts these integers to binary representation. This reduces memory usage compared to one-hot encoding
# Install category_encoders: pip install category-encoders
import pandas as pd
import category_encoders as ce
# Create sample data
data = pd.DataFrame({
'color': ['red', 'green', 'blue', 'yellow', 'red', 'green']
})
print("Original data:")
print(data)
# Apply binary encoding
encoder = ce.BinaryEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data)
print("\nAfter binary encoding:")
print(data_encoded)
Original data: color 0 red 1 green 2 blue 3 yellow 4 red 5 green After binary encoding: color_0 color_1 color_2 0 0 0 1 1 0 1 0 2 0 1 1 3 1 0 0 4 0 0 1 5 0 1 0
Count Encoding
Count encoding replaces each category with the number of times it appears in the dataset. This is useful for high-cardinality categorical features
import pandas as pd
import category_encoders as ce
# Create sample data with different frequencies
data = pd.DataFrame({
'color': ['red', 'green', 'blue', 'red', 'red', 'green', 'blue', 'red']
})
print("Original data:")
print(data)
print(f"\nValue counts:")
print(data['color'].value_counts())
# Apply count encoding
encoder = ce.CountEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data)
print("\nAfter count encoding:")
print(data_encoded)
Original data: color 0 red 1 green 2 blue 3 red 4 red 5 green 6 blue 7 red Value counts: red 4 green 2 blue 2 Name: color, dtype: int64 After count encoding: color 0 4 1 2 2 2 3 4 4 4 5 2 6 2 7 4
Target Encoding
Target encoding replaces each category with the average target value for that category. This method captures the relationship between categories and the target variable
import pandas as pd
import category_encoders as ce
import numpy as np
# Create sample data with target variable
np.random.seed(42)
data = pd.DataFrame({
'color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green']
})
target = pd.Series([1, 0, 1, 1, 0, 0, 1, 1]) # Binary target
print("Original data with target:")
combined = pd.concat([data, target.rename('target')], axis=1)
print(combined)
print(f"\nTarget mean by color:")
print(combined.groupby('color')['target'].mean())
# Apply target encoding
encoder = ce.TargetEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data, target)
print("\nAfter target encoding:")
print(data_encoded)
Original data with target:
color target
0 red 1
1 green 0
2 blue 1
3 red 1
4 green 0
5 blue 0
6 red 1
7 green 1
Target mean by color:
color
blue 0.5
green 0.333333
red 1.0
Name: target, dtype: float64
After target encoding:
color
0 1.000000
1 0.333333
2 0.500000
3 1.000000
4 0.333333
5 0.500000
6 1.000000
7 0.333333
Comparison
| Method | Memory Usage | Best For | Limitations |
|---|---|---|---|
| Label Encoding | Low | Ordinal features | Introduces artificial ordering |
| One-Hot Encoding | High | Nominal features | Many columns with high cardinality |
| Binary Encoding | Medium | High cardinality features | Less interpretable |
| Count Encoding | Low | Frequency matters | Loses categorical meaning |
| Target Encoding | Low | High cardinality with target | Risk of overfitting |
Conclusion
Converting categorical features to numerical features is essential for machine learning. Choose label encoding for ordinal data, one-hot encoding for nominal data with low cardinality, and binary/target encoding for high-cardinality features. The choice depends on your data characteristics and algorithm requirements.
