One Hot Encoding and Label Encoding Explained

Introduction

Categorical variables are extensively utilized in data analysis and machine learning. Many algorithms are incapable of directly processing these variables, and they must be encoded or translated into numerical data before they can be used. Hot encoding and label encoding are two popular methods for encoding categorical data. One hot encoding provides a binary vector for each category in a categorical variable, indicating whether that category exists or not.

We will discuss the ideas of one hot encoding and label encoding, as well as their advantages and disadvantages, and present examples of when and how to utilize each approach.

One Hot Encoding

Categorical variables are often utilized in machine learning and data analysis. Colors, gender, or age groupings are examples of variables with a restricted number of different values. Nevertheless, most machine learning algorithms demand numerical inputs, and categorical variables are not numerical. As a result, we must encode these category variables into a numerical representation.

One−shot coding is a common approach for transforming categorical variables into numeric values. This is converting categorical data into binary data, where all category is stated by a binary value. A binary vector of length equivalent to the number of categories is generated for the entire categorical variable. For example, hypothesize your categorical variable has three categories. For 'red', 'green', and 'blue', the one−hot coding representation of that variable has three dimensions, each representing one of the categories.

In one−hot encoding, the dimension that reflects the category to which an observation belongs is given a binary value of 1, and all other dimensions are given a binary value of 0. For instance, if an observation falls into the "red" category, its one−hot encoded vector's first dimension will be 1 and all other dimensions will be 0. The second dimension of an observation's one−hot encoded vector will have a value of 1 if it falls into the "green" category, while all other dimensions will have a value of 0.

One perk of one−hot encoding is the fact that it enables machine learning algorithms to perceive categorical input as continuous, potentially improving model accuracy. It also avoids the problem of providing arbitrary numerical values to categories, which can lead to incorrect judgments.

But, one−shot encoding has several limitations. One of the main worries is that, particularly if the categorical variable has several categories, it can produce a vast number of features. When the number of features outnumbers the size of the dataset, the "curse of dimensionality" sets in, leading to overfitting and subpar performance. One−hot encoding may also result in collinearity, which happens when two or more features are strongly related and can harm model performance.

Example

import pandas as pd 
 
# create a sample dataframe with categorical variables 
df = pd.DataFrame({'Fruit': ['Apple', 'Orange', 'Banana', 'Banana', 'Orange', 'Apple']}) 
 
# perform one hot encoding 
one_hot_df = pd.get_dummies(df, columns=['Fruit']) 
 
# print the encoded dataframe 
print(one_hot_df)

Output

   Fruit_Apple  Fruit_Banana  Fruit_Orange 
0	 1             0              0 
1	 0             0              1 
2	 0             1              0 
3	 0             1              0 
4	 0             0              1 
5	 1             0              0

This illustration creates a Pandas dataframe with a categorical variable named "Fruit" with 6 available values. Next, we hot code the variable "Fruit" using Pandas' get dummies() method. Get dummies is a function that creates a new column (). Every row in this new column has a binary value of 1 or 0, and it contains distinct fruit variables. If the fruit of that specific row matches the value in the column determines whether this number is 1 or 0. The results are displayed once we export the encoded dataframe.

Label Encoding

Another method for encoding category information is label encoding. It operates by assigning a number value to each category to transform it to ordinal data. Each category is allocated a unique integer value using this technique. For example, if the categorical variable "color" includes the categories "red," "green," and "blue," label encoding would assign values 1 to "red," "green," and "blue."

Label encoding is simple to do and keeps the order of the categories, which may be advantageous in some cases. Label encoding, on the other hand, has limitations. By providing each category a distinct integer number, it creates an implicit ordinal relationship between them. If there is no natural order between the categories, this may cause issues. Moreover, because to the arbitrary nature of the numerical values supplied to the categories, label encoding may not always effectively reflect the underlying connections between the categories.

Another drawback of label encoding is that it might cause problems when utilized with machine learning techniques. Some algorithms, such as decision trees and linear regression, may perceive numerical values as continuous variables, resulting in inaccurate model predictions. Therefore, it is essential to use label encoding judiciously and carefully check the model's performance before using it in a production environment.

Example

from sklearn.preprocessing import LabelEncoder 
 
# Sample data data = ['apple', 'banana', 'banana', 'orange', 'kiwi', 'mango', 'apple'] 
 
# Create label encoder object 
le = LabelEncoder() 
 
# Fit and transform the data 
encoded_data = le.fit_transform(data) 
 
print(encoded_data)

Output

[0 1 1 4 2 3 0]

The LabelEncoder module from the sklearn.the preprocessing package is imported in this example. We next construct an instance of the LabelEncoder class and use the fit transform() function to fit and transform the sample data. The encoded version of the original data is returned as the output.

Note that label encoding assigns unique numerical values to each category in the data, starting from 0. In this example, 'apple' is assigned 0, 'banana' is assigned 1, 'orange' is assigned 2, 'kiwi' is assigned 3, and 'mango' is assigned 4.

Conclusion

Finally, one−hot encoding and label encoding are two extensively used strategies in machine learning for encoding categorical data. One hot encoding generates binary columns for each category, whereas label encoding provides each category a unique numeric label. The decision between these strategies is determined by the nature of the data and the machine−learning algorithm utilized. To guarantee accurate and successful analysis, it is critical to understand the strengths and limits of each approach.

Premansh Sharma

Updated on: 24-Jul-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started