- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Exploring Categorical Data
Introduction
Categorical data is a type of data that takes a fixed number of values and there is no possible logical order in such variables. Categorical variables can be blood groups, yes-no situations, gender, ranking (ex. first, second, third), etc. Categorical variables most of the time undergo encodings such as one hot encoding, and nominal encoding to represent them in binary or integer format for the Machine Learning use case under consideration.
Categorical Data and related terms
Mode is the most common central tendency associated with categorical variables/observations. It is the value in the set of observations that has the highest frequency of occurrence.
For example,
In the following dataset [1,2,6,7,7,7,2,6,6,6,6], the mode is 6 since it occurs 5 times which is the maximum among all other variables.
Analysis of Categorical Data
Using Bar Charts − Bar charts can be used to show the frequency of each categorical variable.
The below code plots the bar chart or frequency distribution of five students and the marks obtained by them in a test. The bar plot is plotted using matplotlib library.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline students = ['Saurav','Mohit','Rajan','Aditi','Sonal'] marks = [78,98,65,90,80] plt.bar(students, marks) plt.xlabel('Student', fontsize = 10) plt.ylabel('Marks', fontsize = 10) plt.title('Student marks distribution')
Output
Pie chart − Pie charts are used to show data or categorical variables a percentage of whole in the form of an angle in a circle.
The below code plots a pie-chart of five students and the marks obtained by them in a test. The pie chart also is plotted using matplotlib library.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline students = ['Saurav','Mohit','Rajan','Aditi','Sonal'] marks = [78,98,65,90,80] plt.figure(figsize =(5, 5)) plt.pie(marks, labels = students, startangle = 90, autopct ='%.2f %%') plt.show()
Output
Box Plots − It is used to show the distribution of the data and to compare data among different groups.
The below code plots box plot of five students and the marks obtained by them in a test. Matplotlob is used to plot the graph.
import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline data = pd.read_csv("/content/train.csv") sns.boxplot(data = data, x='Street', y='SalePrice')
Output
Violin Plots − – It is used to visualize the distribution of categorical data and define a Kernel density plot.
The below code plots the violin plot of five students and the marks obtained by them in a test.
import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline data = pd.read_csv("/content/train.csv") sns.violinplot(data = data, x='Street', y='SalePrice')
Output
Conclusion
Categorical data can be represented and explored in various forms. While working with categorical data Bar Charts, Pie Charts, Box and Violin plots tend to be very useful in representing the data and deriving insights from it.