How to convert categorical data to binary data in Python?

Categorical data, also known as nominal data, is a type of data that is divided into discrete categories or groups. These categories have no inherent order or numerical value, and they are usually represented by words, labels, or symbols. Categorical data is commonly used to describe characteristics or attributes of objects, people, or events, and it can be found in various fields such as social sciences, marketing, and medical research.

In Python, categorical data can be represented using various data structures, such as lists, tuples, dictionaries, and arrays. The most commonly used data structure for categorical data in Python is the pandas DataFrame, which is a two-dimensional table-like data structure that can store and manipulate large amounts of data.

Here's a simple example to illustrate categorical data in Python

Suppose you have a dataset containing information about the type of vehicles people own. The dataset includes the following categorical variables −

  • Vehicle Type − Car, Truck, SUV, Van, Motorcycle

  • Fuel Type − Gasoline, Diesel, Electric, Hybrid

  • Color − Red, Blue, Green, Black, White


You can represent this dataset in Python using a pandas DataFrame as follows

import pandas as pd

data = {'Vehicle Type': ['Car', 'Truck', 'SUV', 'Van', 'Motorcycle'],
   'Fuel Type': ['Gasoline', 'Diesel', 'Electric', 'Hybrid', 'Gasoline'],
   'Color': ['Red', 'Blue', 'Green', 'Black', 'White']}
df = pd.DataFrame(data)

To run the above code, we first need to install the Pandas library in our machine, and for that we can make use of the command shown below −

pip3 install pandas

Once Pandas is installed successfully, we can run the command shown below



The output of the above command is shown below.

  Vehicle Type Fuel Type  Color
0          Car  Gasoline    Red
1        Truck    Diesel   Blue
2          SUV  Electric  Green
3          Van    Hybrid  Black
4   Motorcycle  Gasoline  White

As you can see, the categorical variables are represented as columns in the DataFrame, and each category is represented as a string value in the corresponding column. You can use various Pandas functions and methods to manipulate and analyse this data, such as groupby, count, value_counts, and crosstab. These functions can help you summarize and visualize the distribution and relationships between the categories, which can provide valuable insights into the dataset.

Now that we know a little about categorical data, let's see what characteristics they possess.

Characteristics of Categorical Data

Below are some of the characteristics mentioned of categorical data.

  • Categorical data has a limited number of categories.

  • The categories have no inherent order or ranking.

  • Categorical data can be measured on a nominal or ordinal scale.

  • Categorical data is often summarised using count or frequency distributions.

  • Categorical data has limited statistical analysis compared to numerical data.

Conversion of Categorical Data into Binary Data

Conversion of categorical data into binary data involves transforming categorical variables into binary (0 or 1) values that can be used for analysis or modeling purposes. This transformation is useful because many machine learning algorithms and statistical methods require numerical inputs, rather than categorical inputs.

Binary encoding is a common approach that converts each unique category in a categorical variable into a separate binary column, where a value of 1 indicates the presence of the category and 0 indicates its absence.

This technique is easy to implement in Python using the pandas get_dummies() function or other similar libraries. Binary encoding can help to improve the accuracy of predictive models, reduce data storage requirements, and simplify data analysis.


Consider the code shown below in which we will convert a categorical data into binary data with the help of Pandas.

import pandas as pd

# create a sample DataFrame with categorical data
data = {'Gender': ['Male', 'Female', 'Male', 'Female'],
   'City': ['New York', 'Chicago', 'Chicago', 'Los Angeles'],
   'Marital Status': ['Single', 'Married', 'Single', 'Divorced']}
df = pd.DataFrame(data)

# use get_dummies() to encode categorical variables as binary values
encoded_df = pd.get_dummies(df)



  • The first line imports the Pandas library as pd.

  • A sample DataFrame with categorical data is created in the data dictionary. The DataFrame contains three categorical variables: Gender, City, and Marital Status.

  • The pd.DataFrame() function is used to create a pandas DataFrame from the data dictionary. This DataFrame is assigned to the variable df.

  • The pd.get_dummies() function is called on the df DataFrame to convert the categorical variables into binary values. This function creates a new DataFrame with a binary encoding for each unique category in the categorical variables.

  • The resulting binary encoded DataFrame is assigned to the variable encoded_df.

  • Finally, the print() function is used to display the resulting binary encoded DataFrame.

To run the above code, we need to run the command shown below.



The output of the above command is shown below.

   Gender_Female  Gender_Male  ...  Marital Status_Married  Marital Status_Single
0              0            1  ...                       0                      1
1              1            0  ...                       1                      0
2              0            1  ...                       0                      1
3              1            0  ...                       0                      0

[4 rows x 8 columns]


Converting categorical data into binary data is an important step in data preprocessing for machine learning and statistical analysis. In this tutorial, we learned how we explored what is categorical data and how we can convert it to binary data with the help of Pandas library.

Updated on: 18-Apr-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started