Guided Ordinal Encoding Techniques

Data preparation is a necessary step before modeling in the field of data science. We must do a number of activities during the data preparation process. One such important task is encoding categorical data. As is well known, the majority of data in the real world has categorical string values, while the majority of machine learning models only operate with integer values. However, certain models can work with other values that are more complex but still intelligible to the model. In essence, all models carry out mathematical operations that can be done with a variety of tools and approaches. But the awful truth is that mathematics depends totally on numbers. In conclusion, most models require numbers rather than words or other types of input, and these numbers might be floats or integers. In this article, we'll discuss encoding and the ordinal encoding method.

What is encoding?

Encoding is the process of transforming categorical data into integer format so that the models can use the data with transformed categorical values to produce and enhance predictions.

Implementing Ordinal Encoding

Ordinal encoding, which turns each label into an integer value and depicts the sequence of labels in the encoded data, is employed when the variables in the data are ordinal.


Below is an illustration of how to do this in Python.

#Installation !pip install sklearn import pandas as pd import sklearn !pip install category_encoders import category_encoders as ce df=pd.DataFrame({'height':['tall','medium','short','tall','medium',' short','tall','medium','short',]}) # create object of Ordinalencoding encoder= ce.OrdinalEncoder(cols=['height'],return_df=True, mapping=[{'col':'height', 'mapping':{'None':0,'tall':1,'medium':2,'short':3}}]) #Original data print(df) df['transformed'] = encoder.fit_transform(df) print(df)


0    tall
1  medium
2   short
3    tall
4  medium
5   short
6    tall
7  medium
8   short
  height  transformed
0    tall            1
1  medium            2
2   short            3
3    tall            1
4  medium            2
5   short            3
6    tall            1
7  medium            2
8   short            3

One Hot Encoding

Each category of each categorical variable receives a new variable in one-hot encoding. Each category is mapped using binary integers (0 or 1). When the data is nominal, this kind of encoding is employed. Dummy variables can be thought of as newly produced binary features. The amount of dummy variables used after a hot encoding depends on how many categories are included in the data. Below is an illustration of how to do this in Python.


!pip install sklearn import pandas as pd import sklearn !pip install category_encoders import category_encoders as ce df=pd.DataFrame({'name':[ 'rahul','jay','aman','devesh','ashok','shubham','amit' ]}) encoder=ce.OneHotEncoder(cols='name',handle_unknown='return_nan',ret urn_df=True,use_cat_names=True) #Original Data print(df) #Fit and transform Data df_encoded = encoder.fit_transform(df) print(df_encoded)


0    rahul
1      jay
2     aman
3   devesh
4    ashok
5  shubham
6     amit
name_rahul  name_jay  name_aman  name_devesh  name_ashok  name_shubham  \
0         1.0       0.0        0.0          0.0         0.0           0.0   
1         0.0       1.0        0.0          0.0         0.0           0.0   
2         0.0       0.0        1.0          0.0         0.0           0.0   
3         0.0       0.0        0.0          1.0         0.0           0.0   
4         0.0       0.0        0.0          0.0         1.0           0.0   
5         0.0       0.0        0.0          0.0         0.0           1.0   
6         0.0       0.0        0.0          0.0         0.0           0.0   

0        0.0  
1        0.0  
2        0.0  
3        0.0  
4        0.0  
5        0.0  
6        1.0  


In conclusion, we can say that encoding plays a significant role in machine learning. Most of the time, real-world challenges call for us to select just one encoding technique in order for the model to function properly. Working with various encoders can alter the model's output. In this post, we've seen a variety of encoding techniques and how to use the category encoders package and Python to apply them.