- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to convert categorical data to binary data in Python?
Categorical data, also known as nominal data, is a type of data that is divided into discrete categories or groups. These categories have no inherent order or numerical value, and they are usually represented by words, labels, or symbols. Categorical data is commonly used to describe characteristics or attributes of objects, people, or events, and it can be found in various fields such as social sciences, marketing, and medical research.
In Python, categorical data can be represented using various data structures, such as lists, tuples, dictionaries, and arrays. The most commonly used data structure for categorical data in Python is the pandas DataFrame, which is a two-dimensional table-like data structure that can store and manipulate large amounts of data.
Here's a simple example to illustrate categorical data in Python
Suppose you have a dataset containing information about the type of vehicles people own. The dataset includes the following categorical variables −
Vehicle Type − Car, Truck, SUV, Van, Motorcycle
Fuel Type − Gasoline, Diesel, Electric, Hybrid
Color − Red, Blue, Green, Black, White
Example
You can represent this dataset in Python using a pandas DataFrame as follows
import pandas as pd data = {'Vehicle Type': ['Car', 'Truck', 'SUV', 'Van', 'Motorcycle'], 'Fuel Type': ['Gasoline', 'Diesel', 'Electric', 'Hybrid', 'Gasoline'], 'Color': ['Red', 'Blue', 'Green', 'Black', 'White']} df = pd.DataFrame(data) print(df)
To run the above code, we first need to install the Pandas library in our machine, and for that we can make use of the command shown below −
pip3 install pandas
Once Pandas is installed successfully, we can run the command shown below
python3 main.py
Output
The output of the above command is shown below.
Vehicle Type Fuel Type Color 0 Car Gasoline Red 1 Truck Diesel Blue 2 SUV Electric Green 3 Van Hybrid Black 4 Motorcycle Gasoline White
As you can see, the categorical variables are represented as columns in the DataFrame, and each category is represented as a string value in the corresponding column. You can use various Pandas functions and methods to manipulate and analyse this data, such as groupby, count, value_counts, and crosstab. These functions can help you summarize and visualize the distribution and relationships between the categories, which can provide valuable insights into the dataset.
Now that we know a little about categorical data, let's see what characteristics they possess.
Characteristics of Categorical Data
Below are some of the characteristics mentioned of categorical data.
Categorical data has a limited number of categories.
The categories have no inherent order or ranking.
Categorical data can be measured on a nominal or ordinal scale.
Categorical data is often summarised using count or frequency distributions.
Categorical data has limited statistical analysis compared to numerical data.
Conversion of Categorical Data into Binary Data
Conversion of categorical data into binary data involves transforming categorical variables into binary (0 or 1) values that can be used for analysis or modeling purposes. This transformation is useful because many machine learning algorithms and statistical methods require numerical inputs, rather than categorical inputs.
Binary encoding is a common approach that converts each unique category in a categorical variable into a separate binary column, where a value of 1 indicates the presence of the category and 0 indicates its absence.
This technique is easy to implement in Python using the pandas get_dummies() function or other similar libraries. Binary encoding can help to improve the accuracy of predictive models, reduce data storage requirements, and simplify data analysis.
Example
Consider the code shown below in which we will convert a categorical data into binary data with the help of Pandas.
import pandas as pd # create a sample DataFrame with categorical data data = {'Gender': ['Male', 'Female', 'Male', 'Female'], 'City': ['New York', 'Chicago', 'Chicago', 'Los Angeles'], 'Marital Status': ['Single', 'Married', 'Single', 'Divorced']} df = pd.DataFrame(data) # use get_dummies() to encode categorical variables as binary values encoded_df = pd.get_dummies(df) print(encoded_df)
Explanation
The first line imports the Pandas library as pd.
A sample DataFrame with categorical data is created in the data dictionary. The DataFrame contains three categorical variables: Gender, City, and Marital Status.
The pd.DataFrame() function is used to create a pandas DataFrame from the data dictionary. This DataFrame is assigned to the variable df.
The pd.get_dummies() function is called on the df DataFrame to convert the categorical variables into binary values. This function creates a new DataFrame with a binary encoding for each unique category in the categorical variables.
The resulting binary encoded DataFrame is assigned to the variable encoded_df.
Finally, the print() function is used to display the resulting binary encoded DataFrame.
To run the above code, we need to run the command shown below.
python3 main.py
Output
The output of the above command is shown below.
Gender_Female Gender_Male ... Marital Status_Married Marital Status_Single 0 0 1 ... 0 1 1 1 0 ... 1 0 2 0 1 ... 0 1 3 1 0 ... 0 0 [4 rows x 8 columns]
Conclusion
Converting categorical data into binary data is an important step in data preprocessing for machine learning and statistical analysis. In this tutorial, we learned how we explored what is categorical data and how we can convert it to binary data with the help of Pandas library.