Exploratory Data Analysis in Python

Exploratory Data Analysis (EDA) is the critical first step in any data analysis project. It helps us understand our dataset's structure, identify patterns, and uncover relationships between variables before applying machine learning algorithms.

What EDA Helps Us Achieve

EDA provides valuable insights by helping us to ?

  • Gain insight into the dataset's characteristics

  • Understand the underlying data structure

  • Extract important parameters and relationships between variables

  • Test underlying assumptions about the data

Loading and Exploring the Dataset

Let's perform EDA using the Wine Quality dataset from UCI Machine Learning Repository. We'll start by loading the data and examining its basic structure ?

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
print(df.head())
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5  

Dataset Dimensions

print("Dataset shape:", df.shape)
Dataset shape: (1599, 12)

Dataset Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Key observations from the dataset info ?

  • Dataset contains only numeric values (float and integer)

  • No missing values present in any column

  • 1,599 wine samples with 11 features plus quality rating

Statistical Summary

print(df.describe())
       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

        chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000              6.000000     0.990070   
25%       0.070000             7.000000             22.000000     0.995600   
50%       0.079000            14.000000             38.000000     0.996750   
75%       0.090000            21.000000             62.000000     0.997835   
max       0.611000            72.000000            289.000000     1.003690   

               pH    sulphates      alcohol      quality  
count  1599.000000  1599.000000  1599.000000  1599.000000  
mean      3.311113     0.658149    10.422983     5.636023  
std       0.154386     0.169507     1.065668     0.807569  
min       2.740000     0.330000     8.400000     3.000000  
25%       3.210000     0.550000     9.500000     5.000000  
50%       3.310000     0.620000    10.200000     6.000000  
75%       3.400000     0.730000    11.100000     6.000000  
max       4.010000     2.000000    14.900000     8.000000  

Key insights from statistical summary ?

  • Large difference between 75% and max values for residual sugar, free sulfur dioxide, and total sulfur dioxide indicates potential outliers

  • Most wine quality ratings fall between 5-6 (median and 75% values)

Target Variable Analysis

print("Unique quality ratings:", df.quality.unique())
print("\nQuality distribution:")
print(df.quality.value_counts().sort_index())
Unique quality ratings: [5 6 7 4 8 3]

Quality distribution:
3     10
4     53
5    681
6    638
7    199
8     18

Quality insights ?

  • Wine quality ranges from 3 to 8 (no ratings of 1-2 or 9-10)

  • Most wines rated 5-6 (over 80% of dataset)

  • Very few wines received extreme ratings (3 or 8)

Data Visualization

Checking for Missing Values

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

Correlation Analysis

plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='RdBu_r', annot=True, center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

Quality Correlation Focus

# Features most correlated with quality
quality_corr = df.corr()['quality'].abs().sort_values(ascending=False)
print("Features correlation with quality:")
print(quality_corr[1:])  # Exclude quality itself

plt.figure(figsize=(10, 6))
quality_corr[1:].plot(kind='bar')
plt.title('Feature Correlation with Wine Quality')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Features correlation with quality:
alcohol                 0.476166
volatile acidity        0.390558
sulphates              0.251397
citric acid            0.226373
total sulfur dioxide    0.185100
density                0.174919
chlorides              0.128907
fixed acidity          0.124052
pH                     0.057731
free sulfur dioxide    0.050656
residual sugar         0.013732

Key correlation findings ?

  • Alcohol content shows strongest positive correlation with quality (0.48)

  • Volatile acidity has strong negative correlation with quality (-0.39)

  • Free sulfur dioxide and residual sugar show minimal correlation with quality

Conclusion

EDA reveals that wine quality is primarily influenced by alcohol content and volatile acidity. The dataset is clean with no missing values, making it suitable for machine learning. Most wines cluster around average quality ratings (5-6), suggesting the need for balanced sampling techniques for classification tasks.

Updated on: 2026-03-25T05:16:07+05:30

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements