- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to find the correlation matrix for a data frame that contains missing values in R?
To find the correlation matrix for a data frame, we can use cor function with the data frame object name but if there exist missing values in the data frame then it is not that straight forward. In such type of situations, we can use complete.obs with the cor function so that the missing values will be ignored while calculating the correlation coefficients.
Example1
Consider the below data frame:
> x1<-sample(c(NA,24,5,7,8),20,replace=TRUE) > x2<-sample(c(NA,2,3,1,4,7),20,replace=TRUE) > x3<-sample(c(NA,512,520,530),20,replace=TRUE) > df1<-data.frame(x1,x2,x3) > df1
Output
x1 x2 x3 1 NA 3 512 2 8 7 512 3 5 2 520 4 NA 1 NA 5 NA 2 512 6 NA 4 NA 7 5 NA 530 8 NA NA 530 9 24 3 NA 10 NA 1 512 11 5 2 530 12 NA 7 520 13 5 1 NA 14 8 3 530 15 7 1 NA 16 7 4 530 17 7 3 512 18 5 2 530 19 7 3 530 20 NA 1 512
Finding the correlation matrix for df1:
Example
> cor(df1,use="complete.obs",method="pearson")
Output
x1 x2 x3 x1 1.0000000 0.7190925 -0.2756960 x2 0.7190925 1.0000000 -0.5200868 x3 -0.2756960 -0.5200868 1.0000000
Example2
> y1<-sample(c(NA,rnorm(5,5,1)),20,replace=TRUE) > y2<-sample(c(NA,rnorm(5,2,1)),20,replace=TRUE) > y3<-sample(c(NA,rnorm(10,10,1)),20,replace=TRUE) > y4<-sample(c(NA,rnorm(10,5,2.5)),20,replace=TRUE) > df2<-data.frame(y1,y2,y3,y4) > df2
Output
y1 y2 y3 y4 1 NA 2.955947 NA 2.8623715 2 NA 3.087940 9.099791 4.5996351 3 NA 3.087940 9.589898 5.6097088 4 3.500343 1.150117 10.985979 NA 5 4.831364 3.087940 10.107124 NA 6 7.041597 1.840461 9.416738 2.8601661 7 NA 2.212388 10.453622 5.0717510 8 4.831364 3.087940 10.928925 6.3030777 9 7.041597 NA 9.099791 5.2709332 10 4.831364 2.212388 NA 2.6219274 11 4.831364 2.212388 10.928925 6.3030777 12 3.500343 NA 8.779948 6.3030777 13 4.772150 1.840461 9.589898 5.2709332 14 7.041597 2.955947 10.453622 5.5989568 15 NA 2.955947 9.827149 5.5989568 16 7.041597 1.840461 9.099791 5.5989568 17 3.500343 2.212388 8.779948 4.5996351 18 4.772150 2.212388 10.985979 NA 19 NA 2.955947 10.453622 0.3151969 20 4.772150 1.150117 9.099791 6.3030777
Finding the correlation matrix for df2:
Example
> cor(df2,use="complete.obs",method="pearson")
Output
y1 y2 y3 y4 y1 1.00000000 0.07343574 0.06408734 -0.3103069 y2 0.07343574 1.00000000 0.70344970 0.1674528 y3 0.06408734 0.70344970 1.00000000 0.4544444 y4 -0.31030689 0.16745277 0.45444435 1.0000000
Advertisements