How to create Correlation Matrix in Python by traversing through each line?


A correlation matrix is a table containing correlation coefficients for many variables. Each cell in the table represents the correlation between two variables. The value might range between -1 and 1. A correlation matrix is used for summarizing the data, diagnose the advanced analysis, and as an input for a more complicated study.

Correlation matrix is used to represent the relationship between the variables in the data set. It is a type of matrix that helps programmers analyze the relationship between data components. It represents the correlation coefficient between 0 and 1.

A positive value implies a good correlation, a negative value shows a weak/low correlation, and a value of zero(0) indicates no dependency between the given set of variables.

The Regression Analysis and Correlation Matrix showed the following observations −

  • Recognize the relationship between the independent variables in the data set.

  • Helps in the selection of significant and non-redundant variables from a data set.

  • This only applies to variables that are numeric or continuous.

In this article, we will show you how to create a correlation matrix using python.

Assume we have taken a CSV file with the name starbucksMenu.csv consisting of some random data. We need to create a correlation matrix for the specified columns in a dataset and plot the correlation matrix.

Input File Data

starbucksMenu.csv

Item Name Calories Fat Carb Fiber Protein Sodium
             
Cool Lime Starbucks Refreshers™ 45 0 11 0 0 10
Evolution Fresh™ Organic Ginger Limeade 80 0 18 1 0 10
Iced Coffee 60 0 14 1 0 10
Tazo® Bottled Berry Blossom White 0 0 0 0 0 0
Tazo® Bottled Brambleberry 130 2.5 21 0 5 65
Tazo® Bottled Giant Peach 140 2.5 23 0 5 90
Tazo® Bottled Iced Passion 130 2.5 21 0 5 65
Tazo® Bottled Plum Pomegranate 80 0 19 0 0 10
Tazo® Bottled Tazoberry 60 0 15 0 0 10
Tazo® Bottled White Cranberry 150 0 38 0 0 15

Creating a Correlation Matrix

We will plot the correlation matrix for the three columns of the dataset which are independent continuous variables.

  • Carb
  • Protein
  • Sodium

Algorithm (Steps)

Following are the Algorithm/steps to be followed to perform the desired task −

  • Importing the os, pandas, NumPy, and seaborn libraries.

  • Read the given CSV file using the read_csv() function(loads a CSV file as a pandas dataframe).

  • Create the list of columns from the given dataset for which the correlation matrix must be created.

  • Create a correlation matrix using the corr() function(It calculates the pairwise correlation of all columns in a data frame. Any na(null) values are automatically filtered out. It is discarded for any non-numeric data type columns in the dataframe).

  • Print the correlation matrix of the specified columns of the dataset.

  • Plot the correlation matrix using the heatmap() function(For each value to be plotted, a heatmap has values indicating several shades of the same color. The darker colors of the chart typically represent higher values than the lighter shades. A completely different color can likewise be utilized for a significantly different value) of the seaborn library.

Importing the Dataset into a Pandas Dataframe

We are now first importing any sample dataset(here we are using starbucksMenu.csv ) into pandas dataframe and printing it.

Example 1

# Import pandas module as pd using the import keyword import pandas as pd # Reading a dataset givenDataset = pd.read_csv('starbucksMenu.csv') print(givenDataset)

Output

Item Name Calories Fat Carb Fiber Protein Sodium
             
Cool Lime Starbucks Refreshers™ 45 0 11 0 0 10
Evolution Fresh™ Organic Ginger Limeade 80 0 18 1 0 10
Iced Coffee 60 0 14 1 0 10
Tazo® Bottled Berry Blossom White 0 0 0 0 0 0
Tazo® Bottled Brambleberry 130 2.5 21 0 5 65
Tazo® Bottled Giant Peach 140 2.5 23 0 5 90
Tazo® Bottled Iced Passion 130 2.5 21 0 5 65
Tazo® Bottled Plum Pomegranate 80 0 19 0 0 10
Tazo® Bottled Tazoberry 60 0 15 0 0 10
Tazo® Bottled White Cranberry 150 0 38 0 0 15

Creating correlation matrix after importing the dataset

The following program finds out how to create a correlation matrix for the given dataset, prints them, and plots the correlation matrix −

Example 2

import os # Importing pandas module import pandas as pd import numpy as np import seaborn # Reading a dataset givenDataset = pd.read_csv('starbucksMenu.csv') # Assigning the list of columns from the dataset numericColumns = ['Carb','Protein','Sodium'] # Creating a correlation matrix correlationMatrix = givenDataset.loc[:,numericColumns].corr() # Printing the correlation matrix. print(correlationMatrix) # Displaying the correlation matrix seaborn.heatmap(correlationMatrix, annot=True)

Output

On executing, the above program will generate the following output −

You learned how to compute a correlation matrix using Python and Pandas in this tutorial. Along with that you have learned how to generate a correlation matrix using the Pandas corr() method and also how to utilize the Seaborn library's heatmap function to show a matrix, allowing you to better visualize and understand the data at a glance.

Updated on: 10-Aug-2022

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements