Exploring Data Distribution

Machine Learning Python Server Side Programming Programming

Introduction

The distribution of data gives us useful insights into the data while working with any data science or machine learning use case. Data Distribution is how the data is available and its present condition, the information about specific parts of the data, any outliers in the data as well as central tendencies related to the data.

To explore the data distribution there popular graphical methods that prove beneficial while working with the data. In this article let us explore these methods.

Know more about your data: The Graphical Way

Histograms & KDE Density Plots

Histograms are the most popular and common data exploration tool used among graphical methods. In a Histogram, rectangular bars are used to represent the frequency of a particular variable or category, or bin. Binning is supported when we have different buckets in which the data can be present.

Let us understand the histogram using the below code example on the house pricing dataset.

Dataset link − https://drive.google.com/file/d/1XbyBcw6OfE_w3ZeqPM1s_6jT8XeTCeOT/view?usp=sharing

The below code helps us to understand histograms more effectively. In this code example, we have used house price dataset to plot the frequency or histogram plot for SalePrice vs Frequency on the left side. The right side plot is the KDE plot for the SalePrice vs Frequency Distribution. The Density plot is the probability density function of the histogram.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("/content/house_price_data.csv")

figure, ax = plt.subplots(1, 2, sharex=True, figsize=(12, 6))
ax[0]= sns.histplot(data=df, x="SalePrice",ax=ax[0])
ax[0].set_ylabel("Frequency")
ax[0].set_xlabel("SalePrice")
ax[0].set_title("Frequency(Histogram)")

ax[1]= sns.distplot(df.SalePrice, kde = True,ax=ax[1])
ax[1].set_ylabel("Density")
ax[1].set_xlabel("SalePrice")
ax[1].set_title("Frequency(Histogram)")

Output

In the below code example, we have used bins for different classes. We have used the penguins dataset to plot the bill depth vs count. Here bill depth is binned into different brackets and is plotted on the x axis with count or frequency on the y axis.

# Using bins on penguins' dataset – seaborn

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data_pen = sns.load_dataset("penguins")
sns.histplot(data=data_pen, x="bill_depth_mm", bins=15)

Output

Boxplots

Boxplots are also known as box and whiskers plots. The box plot represents the percentile of data. The entire data is divided into different percentiles, out of which the major quantiles are the 25th, 50th, and 75th percentiles. The 50th percentile represents the median. Boxplots show the data that is located within the 25th and 75th percentiles known as the IQR(Inter Quartile Range)

Let us understand boxplot using the below code example on house pricing dataset.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("/content/house_price_data.csv")
subset = pd.concat([df['SalePrice'], df['OverallQual']])
figure = sns.boxplot(x='OverallQual', y="SalePrice", data=df)

Output

Violin Plot

It looks similar to boxplots, however, it has the probability distribution of variables also shown in the graph. It is used to compare the probability distributions of the variables under observation.

Let us understand the violin plot using the below code example on the house pricing dataset.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv("/content/house_price_data.csv")
subset = pd.concat([df['SalePrice'], df['MSSubClass']])
figure = sns.violinplot(x='MSSubClass', y="SalePrice", data=df)

Output

Conclusion

Boxplots, density plots, and violin plots are the most popular and common methods to explore data distributions. They are reliable and highly trusted by Machine Learning Engineers and Data Scientists. These plots give us a sense of the data and how the data is distributed. Also, basic information regarding skewness, sparsity, etc can also be determined from the plot.Plots likeBoxplots and violin plots can also indicate outlier points.

Mithilesh Pradhan

Updated on: 09-Aug-2023

182 Views

Kickstart Your Career

Get certified by completing the course

Get Started