Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Exploratory Data Analysis in Python
Exploratory Data Analysis (EDA) is the critical first step in any data analysis project. It helps us understand our dataset's structure, identify patterns, and uncover relationships between variables before applying machine learning algorithms.
What EDA Helps Us Achieve
EDA provides valuable insights by helping us to ?
Gain insight into the dataset's characteristics
Understand the underlying data structure
Extract important parameters and relationships between variables
Test underlying assumptions about the data
Loading and Exploring the Dataset
Let's perform EDA using the Wine Quality dataset from UCI Machine Learning Repository. We'll start by loading the data and examining its basic structure ?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
print(df.head())
fixed acidity volatile acidity citric acid residual sugar chlorides \ 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates \ 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5
Dataset Dimensions
print("Dataset shape:", df.shape)
Dataset shape: (1599, 12)
Dataset Information
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
Key observations from the dataset info ?
Dataset contains only numeric values (float and integer)
No missing values present in any column
1,599 wine samples with 11 features plus quality rating
Statistical Summary
print(df.describe())
fixed acidity volatile acidity citric acid residual sugar \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806
std 1.741096 0.179060 0.194801 1.409928
min 4.600000 0.120000 0.000000 0.900000
25% 7.100000 0.390000 0.090000 1.900000
50% 7.900000 0.520000 0.260000 2.200000
75% 9.200000 0.640000 0.420000 2.600000
max 15.900000 1.580000 1.000000 15.500000
chlorides free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 0.087467 15.874922 46.467792 0.996747
std 0.047065 10.460157 32.895324 0.001887
min 0.012000 1.000000 6.000000 0.990070
25% 0.070000 7.000000 22.000000 0.995600
50% 0.079000 14.000000 38.000000 0.996750
75% 0.090000 21.000000 62.000000 0.997835
max 0.611000 72.000000 289.000000 1.003690
pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 5.636023
std 0.154386 0.169507 1.065668 0.807569
min 2.740000 0.330000 8.400000 3.000000
25% 3.210000 0.550000 9.500000 5.000000
50% 3.310000 0.620000 10.200000 6.000000
75% 3.400000 0.730000 11.100000 6.000000
max 4.010000 2.000000 14.900000 8.000000
Key insights from statistical summary ?
Large difference between 75% and max values for residual sugar, free sulfur dioxide, and total sulfur dioxide indicates potential outliers
Most wine quality ratings fall between 5-6 (median and 75% values)
Target Variable Analysis
print("Unique quality ratings:", df.quality.unique())
print("\nQuality distribution:")
print(df.quality.value_counts().sort_index())
Unique quality ratings: [5 6 7 4 8 3] Quality distribution: 3 10 4 53 5 681 6 638 7 199 8 18
Quality insights ?
Wine quality ranges from 3 to 8 (no ratings of 1-2 or 9-10)
Most wines rated 5-6 (over 80% of dataset)
Very few wines received extreme ratings (3 or 8)
Data Visualization
Checking for Missing Values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()
Correlation Analysis
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='RdBu_r', annot=True, center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
Quality Correlation Focus
# Features most correlated with quality
quality_corr = df.corr()['quality'].abs().sort_values(ascending=False)
print("Features correlation with quality:")
print(quality_corr[1:]) # Exclude quality itself
plt.figure(figsize=(10, 6))
quality_corr[1:].plot(kind='bar')
plt.title('Feature Correlation with Wine Quality')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Features correlation with quality: alcohol 0.476166 volatile acidity 0.390558 sulphates 0.251397 citric acid 0.226373 total sulfur dioxide 0.185100 density 0.174919 chlorides 0.128907 fixed acidity 0.124052 pH 0.057731 free sulfur dioxide 0.050656 residual sugar 0.013732
Key correlation findings ?
Alcohol content shows strongest positive correlation with quality (0.48)
Volatile acidity has strong negative correlation with quality (-0.39)
Free sulfur dioxide and residual sugar show minimal correlation with quality
Conclusion
EDA reveals that wine quality is primarily influenced by alcohol content and volatile acidity. The dataset is clean with no missing values, making it suitable for machine learning. Most wines cluster around average quality ratings (5-6), suggesting the need for balanced sampling techniques for classification tasks.
