How to Create a Correlation Matrix using Pandas?


Correlation analysis is a crucial technique in data analysis, helping to identify relationships between variables in a dataset. A correlation matrix is a table showing the correlation coefficients between variables in a dataset. It is a powerful tool that provides valuable insights into the underlying patterns in the data and is widely used in many fields, including finance, economics, social sciences, and engineering.

In this tutorial, we will explore how to create a correlation matrix using Pandas, a popular data manipulation library in Python.

To generate a correlation matrix with pandas, the following steps must be followed −

  • Acquire the data

  • Construct a pandas DataFrame

  • Produce a correlation matrix using pandas

Example

Now let's work on different examples to understand how we can create correlation matrices using pandas.

This code demonstrates how to use the pandas library in Python to create a correlation matrix from a given dataset. The dataset contains three variables: Sales, Expenses, and Profit for three different time periods. The code creates a pandas DataFrame using the data and then uses the DataFrame to create a correlation matrix.

The correlation coefficients between Sales and Expenses and Sales and Profit are then extracted and displayed along with the correlation matrix. The correlation coefficients indicate the degree of correlation between two variables, with a value of "1" representing perfect positive correlation, "-1" representing perfect negative correlation, and "0" indicating no correlation.

Consider the code shown below.

# Import the pandas library
import pandas as pd

# Create a dictionary containing the data to be used in the correlation analysis 
data = {
   'Sales': [25, 36, 12], # Values for sales in three different time periods
   'Expenses': [30, 25, 20], # Values for expenses in the same time periods
   'Profit': [15, 20, 10] # Values for profit in the same time periods
}

# Create a pandas DataFrame using the dictionary
sales_data = pd.DataFrame(data)

# Use the DataFrame to create a correlation matrix
correlation_matrix = sales_data.corr()

# Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

# Get the correlation coefficient between Sales and Expenses
sales_expenses_correlation = correlation_matrix.loc['Sales', 'Expenses']

# Get the correlation coefficient between Sales and Profit
sales_profit_correlation = correlation_matrix.loc['Sales', 'Profit']

# Display the correlation coefficients
print("Correlation Coefficients:")
print(f"Sales and Expenses: {sales_expenses_correlation:.2f}")
print(f"Sales and Profit: {sales_profit_correlation:.2f}") 

Output

On execution, you will get the following output −

Correlation Matrix:
              Sales   Expenses     Profit
Sales      1.000000   0.541041   0.998845
Expenses   0.541041   1.000000   0.500000
Profit     0.998845   0.500000   1.000000
Correlation Coefficients:
Sales and Expenses: 0.54
Sales and Profit: 1.00

The values on the diagonal represent the correlation between a variable and itself, therefore the diagonal values indicate a correlation of 1.

Example

Let's explore one more example. Consider the code shown below.

In this example, we create a simple DataFrame with three columns and three rows. We then use the .corr() method on the DataFrame to calculate the correlation matrix, and finally print the correlation matrix to the console.

# Import the pandas library
import pandas as pd

# Create a sample data frame
data = {
   'A': [1, 2, 3],
   'B': [4, 5, 6],
   'C': [7, 8, 9]
}
df = pd.DataFrame(data)

# Create the correlation matrix
corr_matrix = df.corr()

# Display the correlation matrix
print(corr_matrix) 

Output

On execution, you will get the following output −

     A    B    C
A  1.0  1.0  1.0
B  1.0  1.0  1.0
C  1.0  1.0  1.0 

Conclusion

In conclusion, creating a correlation matrix using pandas in Python is a straightforward process. First, a pandas DataFrame is created with the desired data, and then the .corr() method is used to calculate the correlation matrix. The resulting correlation matrix provides valuable insights into the relationships between the different variables, with the diagonal values indicating the correlation of each variable with itself.

The correlation coefficients range from -1 to 1, where values closer to -1 or 1 indicate stronger correlation, while values closer to 0 indicate weaker or no correlation. Correlation matrices are useful in a wide range of applications, such as data analysis, finance, and machine learning.

Updated on: 20-Apr-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements