- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
One hot encoding to improve machine learning performance
One hot encoding is essential for machine learning since it allows algorithms to interpret categorical variables. This approach makes it simple to process by representing each category as a binary vector. In order to enhance machine learning speed, our blog article outlines one hot encoding and offers a practical project with sample data and code.
What is One hot encoding?
A technique for expressing categorical data such that machine learning algorithms can quickly analyze it is known as "one hot encoding." This approach converts each category into a binary vector of length equal to the number of categories.
How One hot encoding can improve machine learning performance?
One-hot encoding is a pre-processing technique for categorical variables in machine learning models. In order for computer learning systems to process category information rapidly, it must be converted into numerical variables. This is important since the majority of machine learning algorithms can't interpret category data directly and instead require numerical input variables.
Yes, let's look at a dataset where the categorical variable "fruit" has the possible values "apple," "banana," or "orange" as its possible values. The following table can display this dataset −
Index |
Fruits |
Price |
---|---|---|
0 |
apple |
1.2 |
1 |
banana |
0.9 |
2 |
orange |
1.1 |
3 |
apple |
1.4 |
4 |
banana |
1.0 |
In order to apply one-hot encoding on the "fruit" variable, we first create the three new binary variables "fruit_apple," "fruit_banana," and "fruit_orange." Then, for each row in the initial dataset, the binary variable's value is set to 1 if the proper category is present and to 0 otherwise. The final table after one-hot encoding would look like this −
Fruits_apple | Fruits_banana | Fruits_orange | Price |
---|---|---|---|
1 | 0 | 0 | 1.2 |
0 | 1 | 0 | 0.9 |
0 | 0 | 1 | 1.1 |
1 | 0 | 0 | 1.4 |
0 | 1 | 0 | 1.0 |
As we can see, the category variable "fruit" has been split into three binary variables to make them easier to understand for machine learning algorithms.
The machine learning system can now understand the correlation between each fruit variety and the price and produce more precise forecasts thanks to one-hot encoding.
One Hot Encoding to Improve Machine Learning Performance with Random Forest Algorithm
We investigate how to employ one hot encoding to boost machine learning efficiency while dealing with categorical data. A fresh dataset will be produced, the categorical variables will be transformed using one-hot encoding, and the Random Forest approach will be used to train a machine learning model. Our project will be implemented using Python and the scikit-learn framework.
Importing libraries and creating datasets
Let's begin by building a new dataset with the four variables "Size," "Price," "Location," and "Bedrooms." The categorical "Location" variable has three potential values: "A," "B," and "C," whereas the categorical "Bedrooms" variable has four possible values: "1," "2," "3," and "4".
import pandas as pd import numpy as np # create new dataset df = pd.DataFrame({ 'Size': [1381, 4057, 3656, 2468, 2828, 4385, 2006, 1915, 1593, 2929], 'Price': [125527, 416447, 150528, 320128, 232294, 284386, 292693, 320596, 201712, 324857], 'Location': ['A', 'C', 'B', 'B', 'A', 'C', 'A', 'C', 'B', 'C'], 'Bedrooms': ['1', '2', '4', '4', '3', '1', '1', '2', '3', '2'] }) # display dataset print(df)
Output
Size Price Location Bedrooms 0 1381 125527 A 1 1 4057 416447 C 2 2 3656 150528 B 4 3 2468 320128 B 4 4 2828 232294 A 3 5 4385 284386 C 1 6 2006 292693 A 1 7 1915 320596 C 2 8 1593 201712 B 3 9 2929 324857 C 2
Applying One-hot ending
The categorical variables "Location" and "Bedrooms" will thereafter be changed using a single hot encoding. This update will be made using the pandas library in Python.
# performing one hot encoding one_hot_location = pd.get_dummies(df['Location'], prefix='Location') one_hot_bedrooms = pd.get_dummies(df['Bedrooms'], prefix='Bedrooms') # concatenating one hot encoding with original dataframe df = pd.concat([df, one_hot_location, one_hot_bedrooms], axis=1) # droping original categorical variables df = df.drop(['Location', 'Bedrooms'], axis=1) # displaying updated dataset print(df)
Output
Size Price Location_A Location_B Location_C Bedrooms_1 Bedrooms_2 \ 0 1381 125527 1 0 0 1 0 1 4057 416447 0 0 1 0 1 2 3656 150528 0 1 0 0 0 3 2468 320128 0 1 0 0 0 4 2828 232294 1 0 0 0 0 5 4385 284386 0 0 1 1 0 6 2006 292693 1 0 0 1 0 7 1915 320596 0 0 1 0 1 8 1593 201712 0 1 0 0 0 9 2929 324857 0 0 1 0 1 Bedrooms_3 Bedrooms_4 0 0 0 1 0 0 2 0 1 3 0 1 4 1 0 5 0 0 6 0 0 7 0 0 8 1 0 9 0 0
Machine Learning Model
After changing our category data, we can now create a machine learning model using the Random Forest method. Both a training set and a test set will be created from our dataset.
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score # spliting dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(df.drop(['Price'], axis=1), df['Price'], test_size=0.3, random_state=0) # training Random Forest model on training set rf = RandomForestRegressor(n_estimators=100, random_state=0) rf.fit(X_train, y_train) # evaluating Random Forest model on test set y_pred = rf.predict(X_test) mse = mean_squared_error(y_test, y_pred) rmse = mean_squared_error(y_test, y_pred, squared=False) r2 = r2_score(y_test, y_pred) print('Mean Squared Error:', mse) print('Root Mean Squared Error:', rmse) print('R-squared score:', r2)
Output
Mean Squared Error: 12664984402.161505 Root Mean Squared Error: 112538.81286987838 R-squared score: -10.130530314227844
Here, we can see that our model performs reasonably well, with a root mean squared error of roughly 12664984402 and an R-squared score of roughly -10.1. We can experiment with other Random Forest algorithm hyper parameters, feature engineering, and other methods to further enhance the model's performance, but there is still space for improvement.
Conclusion
In conclusion, machine learning practitioners must be aware of the benefits, limitations, and best practices of one-hot encoding in order to apply it successfully. By combining one-hot encoding with other techniques, we can create accurate and dependable machine-learning models that help us solve a variety of practical problems.