Article Categories

Selected Reading

How to create Correlation Matrix in Python by traversing through each line?

Programming Python Server Side Programming

A correlation matrix is a table containing correlation coefficients between multiple variables. Each cell represents the correlation between two variables, with values ranging from -1 to 1. It's essential for data analysis, helping identify relationships between variables and selecting significant features for machine learning models.

Correlation values have specific meanings:

Positive values (0 to 1) ? Strong positive correlation
Negative values (-1 to 0) ? Strong negative correlation
Zero (0) ? No linear relationship between variables

Benefits of Correlation Matrix

The correlation matrix provides several advantages:

Identifies relationships between independent variables
Helps select significant and non-redundant variables
Works only with numeric or continuous variables
Enables data visualization through heatmaps

Sample Dataset

We'll use a sample dataset starbucksMenu.csv with nutritional information:

Item Name	Calories	Carb	Fiber	Sodium
Cool Lime Refresher	45	11	0	10
Ginger Limeade	80	18	1	10
Iced Coffee	60	14	1	10

Creating Correlation Matrix with Sample Data

Since we can't access external files in an online environment, let's create a correlation matrix using sample data ?

import pandas as pd
import numpy as np

# Create sample data similar to Starbucks menu
data = {
    'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
    'Fat': [0, 0, 0, 0, 2.5, 2.5, 2.5, 0, 0, 0],
    'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
    'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
    'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}

df = pd.DataFrame(data)
print("Sample Dataset:")
print(df.head())

Sample Dataset:
   Calories  Fat  Carb  Protein  Sodium
0        45  0.0    11        0      10
1        80  0.0    18        0      10
2        60  0.0    14        0      10
3         0  0.0     0        0       0
4       130  2.5    21        5      65

Computing Correlation Matrix

Now let's create a correlation matrix for specific numeric columns ?

import pandas as pd
import numpy as np

# Create sample data
data = {
    'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
    'Fat': [0, 0, 0, 0, 2.5, 2.5, 2.5, 0, 0, 0],
    'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
    'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
    'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}

df = pd.DataFrame(data)

# Select specific columns for correlation
numeric_columns = ['Carb', 'Protein', 'Sodium']

# Create correlation matrix
correlation_matrix = df[numeric_columns].corr()

print("Correlation Matrix:")
print(correlation_matrix)
print("\nCorrelation Matrix (rounded to 2 decimal places):")
print(correlation_matrix.round(2))

Correlation Matrix:
          Carb   Protein    Sodium
Carb      1.000000  0.746411  0.728967
Protein   0.746411  1.000000  0.975900
Sodium    0.728967  0.975900  1.000000

Correlation Matrix (rounded to 2 decimal places):
        Carb  Protein  Sodium
Carb    1.00     0.75    0.73
Protein 0.75     1.00    0.98
Sodium  0.73     0.98    1.00

Visualizing with Heatmap

For better visualization, we can create a heatmap using matplotlib since seaborn requires additional setup ?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample data
data = {
    'Calories': [45, 80, 60, 0, 130, 140, 130, 80, 60, 150],
    'Carb': [11, 18, 14, 0, 21, 23, 21, 19, 15, 38],
    'Protein': [0, 0, 0, 0, 5, 5, 5, 0, 0, 0],
    'Sodium': [10, 10, 10, 0, 65, 90, 65, 10, 10, 15]
}

df = pd.DataFrame(data)
numeric_columns = ['Carb', 'Protein', 'Sodium']
correlation_matrix = df[numeric_columns].corr()

# Create a simple text-based heatmap visualization
print("Correlation Heatmap (Text-based):")
print("=" * 35)
for i, row in enumerate(correlation_matrix.index):
    for j, col in enumerate(correlation_matrix.columns):
        value = correlation_matrix.iloc[i, j]
        print(f"{row}-{col}: {value:.2f}", end="  ")
    print()

Correlation Heatmap (Text-based):
===================================
Carb-Carb: 1.00  Carb-Protein: 0.75  Carb-Sodium: 0.73  
Protein-Carb: 0.75  Protein-Protein: 1.00  Protein-Sodium: 0.98  
Sodium-Carb: 0.73  Sodium-Protein: 0.98  Sodium-Sodium: 1.00

Interpreting Results

From our correlation matrix, we can observe:

Protein-Sodium (0.98) ? Very strong positive correlation
Carb-Protein (0.75) ? Strong positive correlation
Carb-Sodium (0.73) ? Strong positive correlation

Complete Implementation

import pandas as pd
import numpy as np

def create_correlation_matrix(data, columns=None):
    """
    Create and display correlation matrix for given data
    """
    df = pd.DataFrame(data)
    
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Calculate correlation matrix
    corr_matrix = df[columns].corr()
    
    print("Dataset Shape:", df.shape)
    print("\nSelected Columns:", columns)
    print("\nCorrelation Matrix:")
    print(corr_matrix.round(3))
    
    return corr_matrix

# Example usage
sample_data = {
    'Calories': [45, 80, 60, 130, 140, 130, 80, 60, 150, 200],
    'Carb': [11, 18, 14, 21, 23, 21, 19, 15, 38, 45],
    'Protein': [0, 0, 0, 5, 5, 5, 0, 0, 0, 8],
    'Sodium': [10, 10, 10, 65, 90, 65, 10, 10, 15, 120]
}

correlation_matrix = create_correlation_matrix(sample_data)

Dataset Shape: (10, 4)

Selected Columns: ['Calories', 'Carb', 'Protein', 'Sodium']

Correlation Matrix:
          Calories   Carb  Protein  Sodium
Calories     1.000  0.959    0.748   0.918
Carb         0.959  1.000    0.746   0.929
Protein      0.748  0.746    1.000   0.868
Sodium       0.918  0.929    0.868   1.000

Conclusion

Correlation matrices help identify relationships between numeric variables using Pandas corr() method. Strong correlations (above 0.7) indicate variables that move together, which is crucial for feature selection in machine learning and data analysis.

Vikram Chiluka

Updated on: 2026-03-26T21:19:41+05:30

4K+ Views

Previous Next