Article Categories

Selected Reading

How to Convert Categorical Features to Numerical Features in Python?

Python Server Side Programming Programming

In machine learning, data comes in different types, including numerical, categorical, and text data. Categorical features are features that take on a limited set of values, such as colors, genders, or countries. However, most machine learning algorithms require numerical features as inputs, which means we need to convert categorical features to numerical features before training our models.

In this article, we will explore various techniques to convert categorical features to numerical features in Python. We will discuss label encoding, one-hot encoding, binary encoding, count encoding, and target encoding, with complete working examples.

Label Encoding

Label encoding converts categorical data to numerical data by assigning each category a unique integer value. For instance, a categorical feature like "color" with categories "red", "green", and "blue" can be assigned values 0, 1, and 2, respectively.

Label encoding is memory-efficient, requiring only a single column. However, it may introduce artificial ordering that doesn't exist in the data

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Create sample data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red']
})

print("Original data:")
print(data)

# Apply label encoding
le = LabelEncoder()
data['color_encoded'] = le.fit_transform(data['color'])

print("\nAfter label encoding:")
print(data)

Original data:
   color
0    red
1  green
2   blue
3    red
4  green
5   blue
6    red

After label encoding:
   color  color_encoded
0    red              2
1  green              1
2   blue              0
3    red              2
4  green              1
5   blue              0
6    red              2

One-Hot Encoding

One-hot encoding creates a new binary feature for each category. If a row has that category, its feature gets a 1 while the others get a 0. This technique preserves the categorical nature without introducing artificial ordering

import pandas as pd

# Create sample data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'green']
})

print("Original data:")
print(data)

# Apply one-hot encoding
data_encoded = pd.get_dummies(data, columns=['color'], prefix='color')

print("\nAfter one-hot encoding:")
print(data_encoded)

Original data:
   color
0    red
1  green
2   blue
3    red
4  green

After one-hot encoding:
   color_blue  color_green  color_red
0           0            0          1
1           0            1          0
2           1            0          0
3           0            0          1
4           0            1          0

Binary Encoding

Binary encoding first assigns integer values to categories (like label encoding), then converts these integers to binary representation. This reduces memory usage compared to one-hot encoding

# Install category_encoders: pip install category-encoders
import pandas as pd
import category_encoders as ce

# Create sample data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'yellow', 'red', 'green']
})

print("Original data:")
print(data)

# Apply binary encoding
encoder = ce.BinaryEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data)

print("\nAfter binary encoding:")
print(data_encoded)

Original data:
   color
0    red
1  green
2   blue
3 yellow
4    red
5  green

After binary encoding:
   color_0  color_1  color_2
0        0        0        1
1        0        1        0
2        0        1        1
3        1        0        0
4        0        0        1
5        0        1        0

Count Encoding

Count encoding replaces each category with the number of times it appears in the dataset. This is useful for high-cardinality categorical features

import pandas as pd
import category_encoders as ce

# Create sample data with different frequencies
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'red', 'green', 'blue', 'red']
})

print("Original data:")
print(data)
print(f"\nValue counts:")
print(data['color'].value_counts())

# Apply count encoding
encoder = ce.CountEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data)

print("\nAfter count encoding:")
print(data_encoded)

Original data:
   color
0    red
1  green
2   blue
3    red
4    red
5  green
6   blue
7    red

Value counts:
red      4
green    2
blue     2
Name: color, dtype: int64

After count encoding:
   color
0      4
1      2
2      2
3      4
4      4
5      2
6      2
7      4

Target Encoding

Target encoding replaces each category with the average target value for that category. This method captures the relationship between categories and the target variable

import pandas as pd
import category_encoders as ce
import numpy as np

# Create sample data with target variable
np.random.seed(42)
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green']
})
target = pd.Series([1, 0, 1, 1, 0, 0, 1, 1])  # Binary target

print("Original data with target:")
combined = pd.concat([data, target.rename('target')], axis=1)
print(combined)

print(f"\nTarget mean by color:")
print(combined.groupby('color')['target'].mean())

# Apply target encoding
encoder = ce.TargetEncoder(cols=['color'])
data_encoded = encoder.fit_transform(data, target)

print("\nAfter target encoding:")
print(data_encoded)

Original data with target:
   color  target
0    red       1
1  green       0
2   blue       1
3    red       1
4  green       0
5   blue       0
6    red       1
7  green       1

Target mean by color:
color
blue     0.5
green    0.333333
red      1.0
Name: target, dtype: float64

After target encoding:
      color
0  1.000000
1  0.333333
2  0.500000
3  1.000000
4  0.333333
5  0.500000
6  1.000000
7  0.333333

Comparison

Method	Memory Usage	Best For	Limitations
Label Encoding	Low	Ordinal features	Introduces artificial ordering
One-Hot Encoding	High	Nominal features	Many columns with high cardinality
Binary Encoding	Medium	High cardinality features	Less interpretable
Count Encoding	Low	Frequency matters	Loses categorical meaning
Target Encoding	Low	High cardinality with target	Risk of overfitting

Conclusion

Converting categorical features to numerical features is essential for machine learning. Choose label encoding for ordinal data, one-hot encoding for nominal data with low cardinality, and binary/target encoding for high-cardinality features. The choice depends on your data characteristics and algorithm requirements.

Prince Yadav

Updated on: 2026-03-27T09:17:27+05:30

728 Views

Previous Next