Exploring Categorical Data


Introduction

Categorical data is a type of data that takes a fixed number of values and there is no possible logical order in such variables. Categorical variables can be blood groups, yes-no situations, gender, ranking (ex. first, second, third), etc. Categorical variables most of the time undergo encodings such as one hot encoding, and nominal encoding to represent them in binary or integer format for the Machine Learning use case under consideration.

Categorical Data and related terms

Mode is the most common central tendency associated with categorical variables/observations. It is the value in the set of observations that has the highest frequency of occurrence.

For example,

In the following dataset [1,2,6,7,7,7,2,6,6,6,6], the mode is 6 since it occurs 5 times which is the maximum among all other variables.

Analysis of Categorical Data

  • Using Bar Charts − Bar charts can be used to show the frequency of each categorical variable.

The below code plots the bar chart or frequency distribution of five students and the marks obtained by them in a test. The bar plot is plotted using matplotlib library.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
students = ['Saurav','Mohit','Rajan','Aditi','Sonal']
marks = [78,98,65,90,80]
plt.bar(students, marks)
plt.xlabel('Student', fontsize = 10)
plt.ylabel('Marks', fontsize = 10)
plt.title('Student marks distribution')

Output

  • Pie chart − Pie charts are used to show data or categorical variables a percentage of whole in the form of an angle in a circle.

The below code plots a pie-chart of five students and the marks obtained by them in a test. The pie chart also is plotted using matplotlib library.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
students = ['Saurav','Mohit','Rajan','Aditi','Sonal']
marks = [78,98,65,90,80]
plt.figure(figsize =(5, 5))
plt.pie(marks, labels = students,
startangle = 90, autopct ='%.2f %%')
plt.show()

Output

  • Box Plots − It is used to show the distribution of the data and to compare data among different groups.

The below code plots box plot of five students and the marks obtained by them in a test. Matplotlob is used to plot the graph.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("/content/train.csv")
sns.boxplot(data = data, x='Street', y='SalePrice')

Output

  • Violin Plots − – It is used to visualize the distribution of categorical data and define a Kernel density plot.

    The below code plots the violin plot of five students and the marks obtained by them in a test.

    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    data = pd.read_csv("/content/train.csv")
    sns.violinplot(data = data, x='Street', y='SalePrice')
    

    Output

    Conclusion

    Categorical data can be represented and explored in various forms. While working with categorical data Bar Charts, Pie Charts, Box and Violin plots tend to be very useful in representing the data and deriving insights from it.

    Updated on: 09-Aug-2023

    293 Views

    Kickstart Your Career

    Get certified by completing the course

    Get Started
    Advertisements