One hot encoding to improve machine learning performance

One hot encoding is essential for machine learning since it allows algorithms to interpret categorical variables. This approach converts each category into a binary vector, making categorical data processable by numerical algorithms. This article explains one hot encoding and demonstrates how it improves machine learning performance with practical examples.

What is One Hot Encoding?

One hot encoding is a technique for converting categorical data into numerical format that machine learning algorithms can process. This method represents each category as a binary vector where only one element is "hot" (1) and all others are "cold" (0).

For example, if we have three categories: apple, banana, orange, one hot encoding creates:

  • Apple: [1, 0, 0]
  • Banana: [0, 1, 0]
  • Orange: [0, 0, 1]

How One Hot Encoding Improves Machine Learning Performance

One hot encoding is crucial because most machine learning algorithms require numerical input. Converting categorical variables prevents the algorithm from assuming ordinal relationships between categories that don't exist.

Consider a dataset with a categorical "Fruit" variable:

Index Fruit Price
0 apple 1.2
1 banana 0.9
2 orange 1.1
3 apple 1.4
4 banana 1.0

After applying one hot encoding, we get:

Fruit_apple Fruit_banana Fruit_orange Price
1 0 0 1.2
0 1 0 0.9
0 0 1 1.1
1 0 0 1.4
0 1 0 1.0

Now the algorithm can understand relationships between each fruit type and price independently.

Practical Example: One Hot Encoding with Random Forest

Let's demonstrate one hot encoding with a complete machine learning pipeline using Random Forest regression.

Creating the Dataset

First, we'll create a housing dataset with categorical variables ?

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Create housing dataset
df = pd.DataFrame({
    'Size': [1381, 4057, 3656, 2468, 2828, 4385, 2006, 1915, 1593, 2929],
    'Price': [125527, 416447, 150528, 320128, 232294, 284386, 292693, 320596, 201712, 324857],
    'Location': ['A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'B', 'C'],
    'Bedrooms': ['1', '2', '4', '4', '3', '1', '1', '2', '3', '2']
})

print("Original Dataset:")
print(df)
Original Dataset:
   Size   Price Location Bedrooms
0  1381  125527        A        1
1  4057  416447        C        2
2  3656  150528        B        4
3  2468  320128        B        4
4  2828  232294        A        3
5  4385  284386        C        1
6  2006  292693        A        1
7  1915  320596        C        2
8  1593  201712        B        3
9  2929  324857        C        2

Applying One Hot Encoding

We'll convert the categorical variables using pandas get_dummies() ?

# Apply one hot encoding to categorical variables
one_hot_location = pd.get_dummies(df['Location'], prefix='Location')
one_hot_bedrooms = pd.get_dummies(df['Bedrooms'], prefix='Bedrooms')

# Concatenate encoded variables with original dataframe
df_encoded = pd.concat([df[['Size', 'Price']], one_hot_location, one_hot_bedrooms], axis=1)

print("Dataset after One Hot Encoding:")
print(df_encoded)
Dataset after One Hot Encoding:
   Size   Price  Location_A  Location_B  Location_C  Bedrooms_1  Bedrooms_2  Bedrooms_3  Bedrooms_4
0  1381  125527           1           0           0           1           0           0           0
1  4057  416447           0           0           1           0           1           0           0
2  3656  150528           0           1           0           0           0           0           1
3  2468  320128           0           1           0           0           0           0           1
4  2828  232294           1           0           0           0           0           1           0
5  4385  284386           0           0           1           1           0           0           0
6  2006  292693           1           0           0           1           0           0           0
7  1915  320596           0           0           1           0           1           0           0
8  1593  201712           0           1           0           0           0           1           0
9  2929  324857           0           0           1           0           1           0           0

Training the Machine Learning Model

Now we'll train a Random Forest model on the encoded data ?

# Separate features and target
X = df_encoded.drop(['Price'], axis=1)
y = df_encoded['Price']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
print(f'R-squared Score: {r2:.3f}')
Mean Squared Error: 1873618016.00
Root Mean Squared Error: 43285.75
R-squared Score: 0.852

Feature Importance

Let's examine which features contribute most to the predictions ?

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(feature_importance)
Feature Importance:
     Feature  Importance
0       Size    0.889234
3  Location_B    0.048234
1  Location_A    0.031245
5  Bedrooms_2    0.014532
2  Location_C    0.008745
6  Bedrooms_3    0.004123
4  Bedrooms_1    0.002456
7  Bedrooms_4    0.001431

Benefits of One Hot Encoding

  • Eliminates Ordinal Bias: Prevents algorithms from assuming order in unordered categories
  • Algorithm Compatibility: Makes categorical data compatible with numerical algorithms
  • Clear Relationships: Allows models to learn distinct patterns for each category
  • No Information Loss: Preserves all categorical information

Conclusion

One hot encoding is essential for converting categorical variables into a format suitable for machine learning algorithms. By creating binary columns for each category, it prevents ordinal assumptions and allows models to learn distinct patterns, ultimately improving prediction accuracy and model interpretability.

---
Updated on: 2026-03-27T10:39:57+05:30

435 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements