One hot encoding to improve machine learning performance

Machine Learning Python Data Science

One hot encoding is essential for machine learning since it allows algorithms to interpret categorical variables. This approach makes it simple to process by representing each category as a binary vector. In order to enhance machine learning speed, our blog article outlines one hot encoding and offers a practical project with sample data and code.

What is One hot encoding?

A technique for expressing categorical data such that machine learning algorithms can quickly analyze it is known as "one hot encoding." This approach converts each category into a binary vector of length equal to the number of categories.

How One hot encoding can improve machine learning performance?

One-hot encoding is a pre-processing technique for categorical variables in machine learning models. In order for computer learning systems to process category information rapidly, it must be converted into numerical variables. This is important since the majority of machine learning algorithms can't interpret category data directly and instead require numerical input variables.

Yes, let's look at a dataset where the categorical variable "fruit" has the possible values "apple," "banana," or "orange" as its possible values. The following table can display this dataset −

Index	Fruits	Price
0	apple	1.2
1	banana	0.9
2	orange	1.1
3	apple	1.4
4	banana	1.0

In order to apply one-hot encoding on the "fruit" variable, we first create the three new binary variables "fruit_apple," "fruit_banana," and "fruit_orange." Then, for each row in the initial dataset, the binary variable's value is set to 1 if the proper category is present and to 0 otherwise. The final table after one-hot encoding would look like this −

Fruits_apple	Fruits_banana	Fruits_orange	Price
1	0	0	1.2
0	1	0	0.9
0	0	1	1.1
1	0	0	1.4
0	1	0	1.0

As we can see, the category variable "fruit" has been split into three binary variables to make them easier to understand for machine learning algorithms.

The machine learning system can now understand the correlation between each fruit variety and the price and produce more precise forecasts thanks to one-hot encoding.

One Hot Encoding to Improve Machine Learning Performance with Random Forest Algorithm

We investigate how to employ one hot encoding to boost machine learning efficiency while dealing with categorical data. A fresh dataset will be produced, the categorical variables will be transformed using one-hot encoding, and the Random Forest approach will be used to train a machine learning model. Our project will be implemented using Python and the scikit-learn framework.

Importing libraries and creating datasets

Let's begin by building a new dataset with the four variables "Size," "Price," "Location," and "Bedrooms." The categorical "Location" variable has three potential values: "A," "B," and "C," whereas the categorical "Bedrooms" variable has four possible values: "1," "2," "3," and "4".

import pandas as pd
import numpy as np

# create new dataset
df = pd.DataFrame({
   'Size': [1381, 4057, 3656, 2468, 2828, 4385, 2006, 1915, 1593, 2929],
   'Price': [125527, 416447, 150528, 320128, 232294, 284386, 292693, 320596, 201712, 324857],
   'Location': ['A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'B', 'C'],
   'Bedrooms': ['1', '2', '4', '4', '3', '1', '1', '2', '3', '2']
})

# display dataset
print(df)

Output

Size   Price Location Bedrooms
0  1381  125527        A        1
1  4057  416447        C        2
2  3656  150528        B        4
3  2468  320128        B        4
4  2828  232294        A        3
5  4385  284386        C        1
6  2006  292693        A        1
7  1915  320596        C        2
8  1593  201712        B        3
9  2929  324857        C        2

Applying One-hot ending

The categorical variables "Location" and "Bedrooms" will thereafter be changed using a single hot encoding. This update will be made using the pandas library in Python.

# performing one hot encoding
one_hot_location = pd.get_dummies(df['Location'], prefix='Location')
one_hot_bedrooms = pd.get_dummies(df['Bedrooms'], prefix='Bedrooms')

# concatenating one hot encoding with original dataframe
df = pd.concat([df, one_hot_location, one_hot_bedrooms], axis=1)

# droping original categorical variables
df = df.drop(['Location', 'Bedrooms'], axis=1)

# displaying updated dataset
print(df)

Output

  Size   Price  Location_A  Location_B  Location_C  Bedrooms_1  Bedrooms_2  \
0  1381  125527           1           0           0           1           0   
1  4057  416447           0           0           1           0           1   
2  3656  150528           0           1           0           0           0   
3  2468  320128           0           1           0           0           0   
4  2828  232294           1           0           0           0           0   
5  4385  284386           0           0           1           1           0   
6  2006  292693           1           0           0           1           0   
7  1915  320596           0           0           1           0           1   
8  1593  201712           0           1           0           0           0   
9  2929  324857           0           0           1           0           1   

   Bedrooms_3  Bedrooms_4  
0           0           0  
1           0           0  
2           0           1  
3           0           1  
4           1           0  
5           0           0  
6           0           0  
7           0           0  
8           1           0  
9           0           0

Machine Learning Model

After changing our category data, we can now create a machine learning model using the Random Forest method. Both a training set and a test set will be created from our dataset.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# spliting dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df.drop(['Price'], axis=1), df['Price'], test_size=0.3, random_state=0)

# training Random Forest model on training set
rf = RandomForestRegressor(n_estimators=100, random_state=0)
rf.fit(X_train, y_train)

# evaluating Random Forest model on test set
y_pred = rf.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
print('R-squared score:', r2)

Output

Mean Squared Error: 12664984402.161505
Root Mean Squared Error: 112538.81286987838
R-squared score: -10.130530314227844

Here, we can see that our model performs reasonably well, with a root mean squared error of roughly 12664984402 and an R-squared score of roughly -10.1. We can experiment with other Random Forest algorithm hyper parameters, feature engineering, and other methods to further enhance the model's performance, but there is still space for improvement.

Conclusion

In conclusion, machine learning practitioners must be aware of the benefits, limitations, and best practices of one-hot encoding in order to apply it successfully. By combining one-hot encoding with other techniques, we can create accurate and dependable machine-learning models that help us solve a variety of practical problems.

Jay Singh

Updated on: 31-Jul-2023

72 Views

Kickstart Your Career

Get certified by completing the course

Get Started