To find the percentage of missing values in each column of an R data frame, we can use colMeans function with is.na function. This will find the mean of missing values in each column. After that we can multiply the output with 100 to get the percentage.
Check out the below given examples to understand how it can be done.
Following snippet creates a sample data frame −
x1<-sample(c(NA,1,2),20,replace=TRUE) x2<-sample(c(NA,5),20,replace=TRUE) x3<-sample(c(NA,10,12),20,replace=TRUE) df1<-data.frame(x1,x2,x3) df1
The following dataframe is created −
x1 x2 x3 1 NA NA 12 2 2 5 10 3 2 5 12 4 1 5 12 5 1 5 NA 6 NA 5 10 7 1 NA 10 8 NA 5 10 9 2 NA 12 10 2 NA NA 11 NA NA NA 12 NA 5 12 13 NA NA 10 14 1 NA NA 15 2 NA 12 16 1 5 NA 17 NA 5 10 18 2 5 10 19 NA 5 12 20 NA 5 12
To find the percentage of NA in each column of df1, add the following code to the above snippet −
x1<-sample(c(NA,1,2),20,replace=TRUE) x2<-sample(c(NA,5),20,replace=TRUE) x3<-sample(c(NA,10,12),20,replace=TRUE) df1<-data.frame(x1,x2,x3) (colMeans(is.na(df1)))*100
If you execute all the above given codes as a single program, it generates the following output −
x1 x2 x3 45 40 25
Following snippet creates a sample data frame −
y1<-sample(c(NA,rnorm(2)),20,replace=TRUE) y2<-sample(c(NA,rnorm(2)),20,replace=TRUE) df2<-data.frame(y1,y2) df2
The following dataframe is created −
y1 y2 1 -1.407410 NA 2 -1.771819 NA 3 -1.771819 NA 4 NA -0.05582021 5 NA NA 6 -1.407410 -0.05582021 7 NA NA 8 NA -0.05582021 9 -1.407410 1.19697209 10 -1.407410 NA 11 -1.771819 -0.05582021 12 NA NA 13 -1.771819 NA 14 -1.771819 -0.05582021 15 NA -0.05582021 16 -1.407410 1.19697209 17 -1.771819 -0.05582021 18 NA NA 19 -1.407410 -0.05582021 20 -1.407410 1.19697209
To find the percentage of NA in each column of df2, add the following code to the above snippet −
y1<-sample(c(NA,rnorm(2)),20,replace=TRUE) y2<-sample(c(NA,rnorm(2)),20,replace=TRUE) df2<-data.frame(y1,y2) (colMeans(is.na(df2)))*100
If you execute all the above given codes as a single program, it generates the following output −
y1 y2 35 45
Following snippet creates a sample data frame −
z1<-sample(c(NA,round(runif(2,1,5),2)),20,replace=TRUE) z2<-sample(c(NA,round(runif(2,2,10),2)),20,replace=TRUE) z3<-sample(c(NA,round(runif(2,5,10),2)),20,replace=TRUE) df3<-data.frame(z1,z2,z3) df3
The following dataframe is created −
z1 z2 z3 1 1.69 2.76 NA 2 NA 7.59 NA 3 NA 2.76 9.13 4 4.24 NA 9.13 5 1.69 NA 9.13 6 NA 2.76 8.85 7 NA 7.59 NA 8 NA NA 9.13 9 NA 7.59 NA 10 1.69 2.76 NA 11 4.24 7.59 8.85 12 1.69 NA 8.85 13 4.24 NA NA 14 NA NA 8.85 15 4.24 7.59 9.13 16 4.24 7.59 NA 17 1.69 2.76 9.13 18 NA NA 9.13 19 4.24 2.76 8.85 20 4.24 NA NA
To find the percentage of NA in each column of df3, add the following code to the above snippet −
z1<-sample(c(NA,round(runif(2,1,5),2)),20,replace=TRUE) z2<-sample(c(NA,round(runif(2,2,10),2)),20,replace=TRUE) z3<-sample(c(NA,round(runif(2,5,10),2)),20,replace=TRUE) df3<-data.frame(z1,z2,z3) (colMeans(is.na(df3)))*100
If you execute all the above given codes as a single program, it generates the following output −
z1 z2 z3 40 40 40