How to find the correlation matrix for a data frame that contains missing values in R?


To find the correlation matrix for a data frame, we can use cor function with the data frame object name but if there exist missing values in the data frame then it is not that straight forward. In such type of situations, we can use complete.obs with the cor function so that the missing values will be ignored while calculating the correlation coefficients.

Example1

Consider the below data frame:

Live Demo

> x1<-sample(c(NA,24,5,7,8),20,replace=TRUE)
> x2<-sample(c(NA,2,3,1,4,7),20,replace=TRUE)
> x3<-sample(c(NA,512,520,530),20,replace=TRUE)
> df1<-data.frame(x1,x2,x3)
> df1

Output

  x1 x2 x3
1 NA 3 512
2 8 7 512
3 5 2 520
4 NA 1 NA
5 NA 2 512
6 NA 4 NA
7 5 NA 530
8 NA NA 530
9 24 3 NA
10 NA 1 512
11 5 2 530
12 NA 7 520
13 5 1 NA
14 8 3 530
15 7 1 NA
16 7 4 530
17 7 3 512
18 5 2 530
19 7 3 530
20 NA 1 512

Finding the correlation matrix for df1:

Example

> cor(df1,use="complete.obs",method="pearson")

Output

x1 x2 x3
x1 1.0000000 0.7190925 -0.2756960
x2 0.7190925 1.0000000 -0.5200868
x3 -0.2756960 -0.5200868 1.0000000

Example2

Live Demo

> y1<-sample(c(NA,rnorm(5,5,1)),20,replace=TRUE)
> y2<-sample(c(NA,rnorm(5,2,1)),20,replace=TRUE)
> y3<-sample(c(NA,rnorm(10,10,1)),20,replace=TRUE)
> y4<-sample(c(NA,rnorm(10,5,2.5)),20,replace=TRUE)
> df2<-data.frame(y1,y2,y3,y4)
> df2

Output

   y1 y2 y3 y4
1 NA 2.955947 NA 2.8623715
2 NA 3.087940 9.099791 4.5996351
3 NA 3.087940 9.589898 5.6097088
4 3.500343 1.150117 10.985979 NA
5 4.831364 3.087940 10.107124 NA
6 7.041597 1.840461 9.416738 2.8601661
7 NA 2.212388 10.453622 5.0717510
8 4.831364 3.087940 10.928925 6.3030777
9 7.041597 NA 9.099791 5.2709332
10 4.831364 2.212388 NA 2.6219274
11 4.831364 2.212388 10.928925 6.3030777
12 3.500343 NA 8.779948 6.3030777
13 4.772150 1.840461 9.589898 5.2709332
14 7.041597 2.955947 10.453622 5.5989568
15 NA 2.955947 9.827149 5.5989568
16 7.041597 1.840461 9.099791 5.5989568
17 3.500343 2.212388 8.779948 4.5996351
18 4.772150 2.212388 10.985979 NA
19 NA 2.955947 10.453622 0.3151969
20 4.772150 1.150117 9.099791 6.3030777

Finding the correlation matrix for df2:

Example

> cor(df2,use="complete.obs",method="pearson")

Output

       y1           y2       y3          y4
y1 1.00000000 0.07343574 0.06408734 -0.3103069
y2 0.07343574 1.00000000 0.70344970 0.1674528
y3 0.06408734 0.70344970 1.00000000 0.4544444
y4 -0.31030689 0.16745277 0.45444435 1.0000000

Updated on: 19-Nov-2020

363 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements