- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Handling Categorical Data in Python
Data that only includes a few values is referred to as categorical data, often known as categories or levels and it is described in two ways - nominal or ordinal. Data that lacks any intrinsic order, such as colors, genders, or animal species, is represented as nominal categorical data while ordinal categorical data refers to information that is naturally ranked or ordered, such as customer satisfaction levels or educational attainment. We will go through how to handle categorical data in Python in this tutorial.
Setup
pip install pandas pip install scikit-learn pip install category_encoders
Categorical data is often represented as text labels, and many machine learning algorithms require numerical input data. Customer demographics, product classifications, and geographic areas are just a few examples of real-world datasets that include categorical data which must be converted into numerical representation before being used in machine learning algorithms. Therefore, it is important to convert categorical data into a numerical format before feeding it to a machine learning algorithm. This process is known as encoding. There are various techniques for encoding categorical data, including one-hot encoding, ordinal encoding, and target encoding.
Ways to Handle Categorical Data
Example 1 - One Hot Encoding
One-Hot Encoding is a technique used to convert categorical data into numerical format. It creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories
The pandas and scikit-learn libraries provide functions to perform One-Hot Encoding. The following code snippet shows how to perform One-Hot Encoding using pandas and scikit-learn.
import pandas as pd from sklearn.preprocessing import OneHotEncoder from category_encoders import OrdinalEncoder, TargetEncoder # Create a pandas DataFrame with categorical data df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']}) # Create an instance of OneHotEncoder encoder = OneHotEncoder() # Fit and transform the DataFrame using the encoder encoded_data = encoder.fit_transform(df) # Convert the encoded data into a pandas DataFrame encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names()) print(encoded_df)
Output
x0_blue x0_green x0_red 0 0.0 0.0 1.0 1 1.0 0.0 0.0 2 0.0 1.0 0.0 3 0.0 1.0 0.0 4 0.0 0.0 1.0
Example 2 - Ordinal Encoding
Ordinal coding is a popular technique for encoding categorical data where each category is given a different numerical value based on its rank or order. The categories with the lowest values receive the smallest integers, while those with the highest values receive the largest integers. When the categories are grouped organically, like with ratings (poor, fair, good, outstanding), or educational achievement, this strategy is extremely useful (high school, college, graduate school). Let us do ordinal encoding using Pandas and the category encoders package −
import pandas as pd import category_encoders as ce # create a sample dataset data = {'category': ['red', 'green', 'blue', 'red', 'green']} df = pd.DataFrame(data) # initialize the encoder encoder = ce.OrdinalEncoder() # encode the categorical feature df['category_encoded'] = encoder.fit_transform(df['category']) # print the encoded dataframe print(df)
Output
category category_encoded 0 red 1 1 green 2 2 blue 3 3 red 1 4 green 2
As you can see, the red category has been given the value 1, green has been given the value 2, and blue has been given the value 3. The sequence in which the categories occurred in the original dataset served as the basis for this encoding.
Example 3: Target Encoding using Category Encoders
Target Encoding is another technique used for encoding categorical data, particularly when dealing with high cardinality features. It replaces each category with the average target value for that category. Target Encoding is useful when there is a strong relationship between the categorical feature and the target variable.
import pandas as pd import category_encoders as ce # create a sample dataset data = {'category': ['red', 'green', 'blue', 'red', 'green'], 'target': [1, 0, 1, 0, 1]} df = pd.DataFrame(data) # initialize the encoder encoder = ce.TargetEncoder() # encode the categorical feature df['category_encoded'] = encoder.fit_transform(df['category'], df['target']) # print the encoded dataframe print(df)
In this example, we create a sample dataset with a single categorical feature called "category" and a corresponding target variable called "target". We import the category_encoders library and initialize a TargetEncoder object. We use the fit_transform() method to encode the categorical feature based on the target variable and add the encoded feature to the original dataframe.
Output
category target category_encoded 0 red 1 0.585815 1 green 0 0.585815 2 blue 1 0.652043 3 red 0 0.585815 4 green 1 0.585815
The color column was successfully encoded using target encoding, as can be seen in the output, by category encoders. The column to be encoded is specified using the cols option, and the encoding is done using TargetEncoder. The target variable and the encoding target column are the two arguments that the fit transform function requires.
Conclusion
The significance of managing categorical data properly in machine learning applications was covered in this article. It investigated one-hot encoding, ordinal encoding, and target encoding as three distinct methods for encoding categorical data in Python. One-hot encoding is a quick and efficient method, but it can result in a lot more features. When the order of the categories is known, ordinal encoding is a reasonable option, but it misses the connection between the categories and the target variable.
Hence, managing categorical data is a crucial component of machine learning systems, and selecting the proper encoding method is key for producing accurate and trustworthy results.