How to fill the missing values of an R data frame from the mean of columns?


Dealing with missing values is one of the initial steps in data analysis and it is also most difficult because we don’t fill the missing values with the appropriate method then the result of the whole analysis might become meaningless. Therefore, we must be very careful about dealing with missing values. Mostly for learning purposes, people use mean to fill the missing values but can use many other values depending on our data characteristic. To fill the missing value with mean of columns, we can use na.aggregate function of zoo package.

Example

Consider the below data frame −

x1<-c(1:5,NA,17:30)
x2<-c(1:2,NA,4:20)
x3<-sample(c(1,5,8,NA,6,3),20,replace=TRUE)
x4<-sample(c(45,75,68,NA,36,43),20,replace=TRUE)
x5<-rep(c(23,45,55,78,NA),times=4)
df<-data.frame(x1,x2,x3,x4,x5)
df

Output

x1 x2 x3 x4 x5
1 1 1 6 36 23
2 2 2 5 36 45
3 3 NA 1 68 55
4 4 4 1 43 78
5 5 5 3 45 NA
6 NA 6 3 75 23
7 17 7 5 68 45
8 18 8 6 43 55
9 19 9 NA 75 78
10 20 10 8 75 NA
11 21 11 3 43 23
12 22 12 1 68 45
13 23 13 8 45 55
14 24 14 5 36 78
15 25 15 5 36 NA
16 26 16 5 75 23
17 27 17 5 75 45
18 28 18 6 43 55
19 29 19 8 NA 78
20 30 20 6 75 NA

Example

library(zoo)
na.aggregate(df)

Output

      x1    x2    x3    x4    x5
1 1.00000 1.00000 6.000000 36.00000 23.00
2 2.00000 2.00000 5.000000 36.00000 45.00
3 3.00000 10.89474 1.000000 68.00000 55.00
4 4.00000 4.00000 1.000000 43.00000 78.00
5 5.00000 5.00000 3.000000 45.00000 50.25
6 18.10526 6.00000 3.000000 75.00000 23.00
7 17.00000 7.00000 5.000000 68.00000 45.00
8 18.00000 8.00000 6.000000 43.00000 55.00
9 19.00000 9.00000 4.736842 75.00000 78.00
10 20.00000 10.00000 8.000000 75.00000 50.25
11 21.00000 11.00000 3.000000 43.00000 23.00
12 22.00000 12.00000 1.000000 68.00000 45.00
13 23.00000 13.00000 8.000000 45.00000 55.00
14 24.00000 14.00000 5.000000 36.00000 78.00
15 25.00000 15.00000 5.000000 36.00000 50.25
16 26.00000 16.00000 5.000000 75.00000 23.00
17 27.00000 17.00000 5.000000 75.00000 45.00
18 28.00000 18.00000 6.000000 43.00000 55.00
19 29.00000 19.00000 8.000000 55.78947 78.00
20 30.00000 20.00000 6.000000 75.00000 50.25

Let’s have a look at another example −

Example

var1 <-sample(c(1,2,NA),20,replace=TRUE)
var2 <-sample(c(2,NA),20,replace=TRUE)
var3 <-c(rnorm(10),rep(NA,10))
var_data <-data.frame(var1,var2,var3)
var_data

Output

   var1 var2 var3
1 1 NA 0.15883062
2 NA 2 0.65976414
3 NA 2 2.22051966
4 NA NA -1.18394507
5 1 NA -0.07395583
6 NA 2 -0.41635467
7 NA NA -0.19148234
8 NA NA 0.06954478
9 1 2 1.15534832
10 2 2 0.59495735
11 2 NA NA
12 1 2 NA
13 NA 2 NA
14 NA NA NA
15 1 NA NA
16 1 NA NA
17 1 2 NA
18 NA 2 NA
19 1 2 NA
20 2 NA NA

Example

na.aggregate(var_data)

Output

   var1 var2 var3
1 1.000000 2 0.15883062
2 1.272727 2 0.65976414
3 1.272727 2 2.22051966
4 1.272727 2 -1.18394507
5 1.000000 2 -0.07395583
6 1.272727 2 -0.41635467
7 1.272727 2 -0.19148234
8 1.272727 2 0.06954478
9 1.000000 2 1.15534832
10 2.000000 2 0.59495735
11 2.000000 2 0.29932269
12 1.000000 2 0.29932269
13 1.272727 2 0.29932269
14 1.272727 2 0.29932269
15 1.000000 2 0.29932269
16 1.000000 2 0.29932269
17 1.000000 2 0.29932269
18 1.272727 2 0.29932269
19 1.000000 2 0.29932269
20 2.000000 2 0.29932269

Updated on: 24-Aug-2020

217 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements