Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to create Correlation Matrix in Python by traversing through each line?
A correlation matrix is a table containing correlation coefficients between multiple variables. Each cell represents the correlation between two variables, with values ranging from -1 to 1. It's essential for data analysis, helping identify relationships between variables and selecting significant features for machine learning models.
Correlation values have specific meanings:
Positive values (0 to 1) ? Strong positive correlation
Negative values (-1 to 0) ? Strong negative correlation
Zero (0) ? No linear relationship between variables
Benefits of Correlation Matrix
The correlation matrix provides several advantages:
Identifies relationships between independent variables
Helps select significant and non-redundant variables
Works only with numeric or continuous variables
Enables data visualization through heatmaps
Sample Dataset
We'll use a sample dataset starbucksMenu.csv with nutritional information:
| Item Name | Calories | Fat | Carb | Fiber | Protein | Sodium |
|---|---|---|---|---|---|---|
| Cool Lime Refresher | 45 | 0 | 11 | 0 | 0 | 10 |
| Ginger Limeade | 80 | 0 | 18 | 1 | 0 | 10 |
| Iced Coffee | 60 | 0 | 14 | 1 | 0 | 10 |
Creating Correlation Matrix with Sample Data
Since we can't access external files in an online environment, let's create a correlation matrix using sample data ?
import pandas as pd
import numpy as np
# Create sample data similar to Starbucks menu
data = {
'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
'Fat': [0, 0, 0, 0, 2.5, 2.5, 2.5, 0, 0, 0],
'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}
df = pd.DataFrame(data)
print("Sample Dataset:")
print(df.head())
Sample Dataset: Calories Fat Carb Protein Sodium 0 45 0.0 11 0 10 1 80 0.0 18 0 10 2 60 0.0 14 0 10 3 0 0.0 0 0 0 4 130 2.5 21 5 65
Computing Correlation Matrix
Now let's create a correlation matrix for specific numeric columns ?
import pandas as pd
import numpy as np
# Create sample data
data = {
'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
'Fat': [0, 0, 0, 0, 2.5, 2.5, 2.5, 0, 0, 0],
'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}
df = pd.DataFrame(data)
# Select specific columns for correlation
numeric_columns = ['Carb', 'Protein', 'Sodium']
# Create correlation matrix
correlation_matrix = df[numeric_columns].corr()
print("Correlation Matrix:")
print(correlation_matrix)
print("\nCorrelation Matrix (rounded to 2 decimal places):")
print(correlation_matrix.round(2))
Correlation Matrix:
Carb Protein Sodium
Carb 1.000000 0.746411 0.728967
Protein 0.746411 1.000000 0.975900
Sodium 0.728967 0.975900 1.000000
Correlation Matrix (rounded to 2 decimal places):
Carb Protein Sodium
Carb 1.00 0.75 0.73
Protein 0.75 1.00 0.98
Sodium 0.73 0.98 1.00
Visualizing with Heatmap
For better visualization, we can create a heatmap using matplotlib since seaborn requires additional setup ?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create sample data
data = {
'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}
df = pd.DataFrame(data)
numeric_columns = ['Carb', 'Protein', 'Sodium']
correlation_matrix = df[numeric_columns].corr()
# Create a simple text-based heatmap visualization
print("Correlation Heatmap (Text-based):")
print("=" * 35)
for i, row in enumerate(correlation_matrix.index):
for j, col in enumerate(correlation_matrix.columns):
value = correlation_matrix.iloc[i, j]
print(f"{row}-{col}: {value:.2f}", end=" ")
print()
Correlation Heatmap (Text-based): =================================== Carb-Carb: 1.00 Carb-Protein: 0.75 Carb-Sodium: 0.73 Protein-Carb: 0.75 Protein-Protein: 1.00 Protein-Sodium: 0.98 Sodium-Carb: 0.73 Sodium-Protein: 0.98 Sodium-Sodium: 1.00
Interpreting Results
From our correlation matrix, we can observe:
Protein-Sodium (0.98) ? Very strong positive correlation
Carb-Protein (0.75) ? Strong positive correlation
Carb-Sodium (0.73) ? Strong positive correlation
Complete Implementation
import pandas as pd
import numpy as np
def create_correlation_matrix(data, columns=None):
"""
Create and display correlation matrix for given data
"""
df = pd.DataFrame(data)
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns.tolist()
# Calculate correlation matrix
corr_matrix = df[columns].corr()
print("Dataset Shape:", df.shape)
print("\nSelected Columns:", columns)
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))
return corr_matrix
# Example usage
sample_data = {
'Calories': [45, 80, 60, 130, 140, 130, 80, 60, 150, 200],
'Carb': [11, 18, 14, 21, 23, 21, 19, 15, 38, 45],
'Protein': [0, 0, 0, 5, 5, 5, 0, 0, 0, 8],
'Sodium': [10, 10, 10, 65, 90, 65, 10, 10, 15, 120]
}
correlation_matrix = create_correlation_matrix(sample_data)
Dataset Shape: (10, 4)
Selected Columns: ['Calories', 'Carb', 'Protein', 'Sodium']
Correlation Matrix:
Calories Carb Protein Sodium
Calories 1.000 0.959 0.748 0.918
Carb 0.959 1.000 0.746 0.929
Protein 0.748 0.746 1.000 0.868
Sodium 0.918 0.929 0.868 1.000
Conclusion
Correlation matrices help identify relationships between numeric variables using Pandas corr() method. Strong correlations (above 0.7) indicate variables that move together, which is crucial for feature selection in machine learning and data analysis.
