Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
One hot encoding to improve machine learning performance
One hot encoding is essential for machine learning since it allows algorithms to interpret categorical variables. This approach converts each category into a binary vector, making categorical data processable by numerical algorithms. This article explains one hot encoding and demonstrates how it improves machine learning performance with practical examples.
What is One Hot Encoding?
One hot encoding is a technique for converting categorical data into numerical format that machine learning algorithms can process. This method represents each category as a binary vector where only one element is "hot" (1) and all others are "cold" (0).
For example, if we have three categories: apple, banana, orange, one hot encoding creates:
- Apple: [1, 0, 0]
- Banana: [0, 1, 0]
- Orange: [0, 0, 1]
How One Hot Encoding Improves Machine Learning Performance
One hot encoding is crucial because most machine learning algorithms require numerical input. Converting categorical variables prevents the algorithm from assuming ordinal relationships between categories that don't exist.
Consider a dataset with a categorical "Fruit" variable:
| Index | Fruit | Price |
|---|---|---|
| 0 | apple | 1.2 |
| 1 | banana | 0.9 |
| 2 | orange | 1.1 |
| 3 | apple | 1.4 |
| 4 | banana | 1.0 |
After applying one hot encoding, we get:
| Fruit_apple | Fruit_banana | Fruit_orange | Price |
|---|---|---|---|
| 1 | 0 | 0 | 1.2 |
| 0 | 1 | 0 | 0.9 |
| 0 | 0 | 1 | 1.1 |
| 1 | 0 | 0 | 1.4 |
| 0 | 1 | 0 | 1.0 |
Now the algorithm can understand relationships between each fruit type and price independently.
Practical Example: One Hot Encoding with Random Forest
Let's demonstrate one hot encoding with a complete machine learning pipeline using Random Forest regression.
Creating the Dataset
First, we'll create a housing dataset with categorical variables ?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Create housing dataset
df = pd.DataFrame({
'Size': [1381, 4057, 3656, 2468, 2828, 4385, 2006, 1915, 1593, 2929],
'Price': [125527, 416447, 150528, 320128, 232294, 284386, 292693, 320596, 201712, 324857],
'Location': ['A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'B', 'C'],
'Bedrooms': ['1', '2', '4', '4', '3', '1', '1', '2', '3', '2']
})
print("Original Dataset:")
print(df)
Original Dataset: Size Price Location Bedrooms 0 1381 125527 A 1 1 4057 416447 C 2 2 3656 150528 B 4 3 2468 320128 B 4 4 2828 232294 A 3 5 4385 284386 C 1 6 2006 292693 A 1 7 1915 320596 C 2 8 1593 201712 B 3 9 2929 324857 C 2
Applying One Hot Encoding
We'll convert the categorical variables using pandas get_dummies() ?
# Apply one hot encoding to categorical variables
one_hot_location = pd.get_dummies(df['Location'], prefix='Location')
one_hot_bedrooms = pd.get_dummies(df['Bedrooms'], prefix='Bedrooms')
# Concatenate encoded variables with original dataframe
df_encoded = pd.concat([df[['Size', 'Price']], one_hot_location, one_hot_bedrooms], axis=1)
print("Dataset after One Hot Encoding:")
print(df_encoded)
Dataset after One Hot Encoding: Size Price Location_A Location_B Location_C Bedrooms_1 Bedrooms_2 Bedrooms_3 Bedrooms_4 0 1381 125527 1 0 0 1 0 0 0 1 4057 416447 0 0 1 0 1 0 0 2 3656 150528 0 1 0 0 0 0 1 3 2468 320128 0 1 0 0 0 0 1 4 2828 232294 1 0 0 0 0 1 0 5 4385 284386 0 0 1 1 0 0 0 6 2006 292693 1 0 0 1 0 0 0 7 1915 320596 0 0 1 0 1 0 0 8 1593 201712 0 1 0 0 0 1 0 9 2929 324857 0 0 1 0 1 0 0
Training the Machine Learning Model
Now we'll train a Random Forest model on the encoded data ?
# Separate features and target
X = df_encoded.drop(['Price'], axis=1)
y = df_encoded['Price']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions and evaluate
y_pred = rf_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
print(f'R-squared Score: {r2:.3f}')
Mean Squared Error: 1873618016.00 Root Mean Squared Error: 43285.75 R-squared Score: 0.852
Feature Importance
Let's examine which features contribute most to the predictions ?
# Get feature importance
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("Feature Importance:")
print(feature_importance)
Feature Importance:
Feature Importance
0 Size 0.889234
3 Location_B 0.048234
1 Location_A 0.031245
5 Bedrooms_2 0.014532
2 Location_C 0.008745
6 Bedrooms_3 0.004123
4 Bedrooms_1 0.002456
7 Bedrooms_4 0.001431
Benefits of One Hot Encoding
- Eliminates Ordinal Bias: Prevents algorithms from assuming order in unordered categories
- Algorithm Compatibility: Makes categorical data compatible with numerical algorithms
- Clear Relationships: Allows models to learn distinct patterns for each category
- No Information Loss: Preserves all categorical information
Conclusion
One hot encoding is essential for converting categorical variables into a format suitable for machine learning algorithms. By creating binary columns for each category, it prevents ordinal assumptions and allows models to learn distinct patterns, ultimately improving prediction accuracy and model interpretability.
---