Dealing with Missing Data in R


In data science, one of the common tasks is dealing with missing data. If we have missing data in your dataset, there are several ways to handle it in R programming. One way is to simply remove any rows or columns that contain missing data. Another way to handle missing data is to impute the missing values using a statistical method. This means replacing the missing values with estimates based on the other values in the dataset. For example, we can replace missing values with the mean or median value of the variable in which the missing values are found.

Missing Data

In R, the NA symbol is used to define the missing values, and to represent impossible arithmetic operations (like dividing by zero) we use the NAN symbol which stands for “not a number”. In simple words, we can say that both NA or NAN symbols represent missing values in R.

Let us consider a scenario in which a teacher is inserting the marks (or data) of all the students in a spreadsheet. But by mistake, she forgot to insert data from one student in her class. Thus, missing data/values are practical in nature.

Finding Missing Data in R

R provides us with inbuilt functions using which we can find the missing values. Such inbuilt functions are explained in detail below −

Using the is.na() Function

We can use the is.na() inbuilt function in R to check for NA values. This function returns a vector that contains only logical value (either True or False). For the NA values in the original dataset, the corresponding vector value should be True otherwise it should be False.

Example

# vector with some data
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
myVector

Output

[1] NA    "TP"  "4"   "6.7" "c"   NA    "12" 

Let’s find the NAs

# finding NAs
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
is.na(myVector)

Output

[1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

Let’s identify NAs in Vector

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
which(is.na(myVector))

Output

[1] 1 6

Let’s identify total number of NAs −

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
sum(is.na(myVector))

Output

[1] 2

As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NA value.

Using the is.nan() Function

We can apply the is.nan() function to check for NAN values. This function returns a vector containing logical values (either True or False). If there are some NAN values present in the vector, then it returns True corresponding to that position in the vector otherwise it returns False.

Example

myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0)

is.nan(myVector)

Output

[1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE

As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NAN value.

Some of the traits of missing values have listed below −

  • Multiple NA or NAN values can exist in a vector.

  • To deal with NA type of missing values in a vector we can use is.na() function by passing the vector as an argument.

  • To deal with the NAN type of missing values in a vector we can use is.nan() function by passing the vector as an argument.

  • Generally, NAN values can be included in the NA type but the vice-versa is not true.

Removing Missing Data/ Values

Let us consider a scenario in which we want to filter values except for missing values. In R, we have two ways to remove missing values. These methods are explained below −

Remove Values Using Filter functions

The first way to remove missing values from a dataset is to use R's modeling functions. These functions accept a na.action parameter that lets the function what to do in case an NA value is encountered. This makes the modeling function invoke one of its missing value filter functions.

These functions are capable enough to replace the original data set with a new data set in which the NA values have been changed. It has the default setting as na.omit that completely removes a row if this row contains any missing value. An alternative to this setting is −

It just terminates whenever it encounters any missing values. The following are the filter functions −

  • na.omit − It simply rules out any rows that contain any missing value and forgets those rows forever.

  • na.exclude − This agument ignores rows having at least one missing value.

  • na.pass − Take no action.

  • na.fail − It terminates the execution if any of the missing values are found.

Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.exclude(myVector)

Output

[1] "TP"  "4"   "6.7" "c"   "12" 
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "exclude"

Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.omit(myVector)

Output

[1] "TP"  "4"   "6.7" "c"   "12" 
attr(,"na.action")
[1] 1 6
attr(,"class")
[1] "omit"

Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12)
na.fail(myVector)

Output

Error in na.fail.default(myVector) : missing values in object

As you can see in the output, execution halted for rows containing at least one missing value.

Selecting values that are not NA or NAN

In order to select only those values which are not missing, firstly we are required to produce a logical vector having corresponding values as True for NA or NAN value and False for other values in the given vector.

Example

Let logicalVector be such a vector (we can easily get this vector by applying is.na() function).

myVector1 <- c(200, 112, NA, NA, NA, 49, NA, 190)
logicalVector1 <- is.na(myVector1)
newVector1 = myVector1[! logicalVector1]
print(newVector1)

Output

[1] 200 112  49 190

Applying the is.nan() function

myVector2 <- c(100, 121, 0 / 0, 123, 0 / 0, 49, 0 / 0, 290)
logicalVector2 <- is.nan(myVector2)
newVector2 = myVector2[! logicalVector2]
print(newVector2)

Output

[1] 100 121 123  49 290

As you can see in the output missing values of type NA and NAN have been successfully removed from myVector1 and myVector2 respectively.

Filling Missing Values with Mean or Median

In this section, we will see how we can fill or populate missing values in a dataset using mean and median. We will use the apply method to get the mean and median of missing columns.

Step 1 − The very first step is to get the list of columns that contain at least one missing value (NA) value.

Example

# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
   Physics = c(98, 87, 91, 94),
   Chemistry = c(NA, 84, 93, 87),
   Mathematics = c(91, 86, NA, NA) )
#Print dataframe
print(dataframe)

Output

       Name   Physics Chemistry Mathematics
1 Bhuwanesh        98        NA          91
2      Anil        87        84          86
3       Jai        91        93          NA
4    Naveen        94        87          NA

Let’s print the column names having at least one NA value.

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]
print(listMissingColumns)

Output

[1] "Chemistry"   "Mathematics"

In our dataframe, we have two columns with NA values.

Step 2 − Now we are required to compute the mean and median of the corresponding columns. Since we need to omit NA values in the missing columns, therefore, we can pass "na.rm = True" argument to the apply() function.

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, mean, na.rm =  TRUE)
print(meanMissing)

Output

Chemistry Mathematics 
   88.0          88.5 

The mean of Column Chemistry is 88.0 and that of Mathematics is 88.5.

Now let’s find the median of the columns −

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, median, na.rm =  TRUE)

print(medianMissing)

Output

Chemistry Mathematics 
87.0        88.5 

The median of Column Chemistry is 87.0 and that of Mathematics is 88.5.

Step 3 − Now our mean and median values of corresponding columns are ready. In this step, we will replace NA values with mean and median using mutate() function which is defined under “dplyr” package.

Example

# Importing library
library(dplyr)

# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
   Physics = c(98, 87, 91, 94),
   Chemistry = c(NA, 84, 93, 87),
   Mathematics = c(91, 86, NA, NA) )

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, mean, na.rm =  TRUE)

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, median, na.rm =  TRUE)

newDataFrameMean <- dataframe %>% mutate(
   Chemistry = ifelse(is.na(Chemistry), meanMissing[1], Chemistry),
   Mathematics = ifelse(is.na(Mathematics), meanMissing[2], Mathematics))

newDataFrameMean

Output

       Name    Physics Chemistry Mathematics
1 Bhuwanesh         98        88        91.0
2      Anil         87        84        86.0
3       Jai         91        93        88.5
4    Naveen         94        87        88.5

Notice the missing values are filled with the mean of the corresponding column.

Example

Now let’s fill the NA values with the median of the corresponding column.

# Importing library
library(dplyr)

# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
   Physics = c(98, 87, 91, 94),
   Chemistry = c(NA, 84, 93, 87),
   Mathematics = c(91, 86, NA, NA) )

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)]

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, mean, na.rm =  TRUE)

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 
   2, median, na.rm =  TRUE)

newDataFrameMedian <- dataframe %>% mutate( 
   Chemistry = ifelse(is.na(Chemistry), medianMissing[1], Chemistry),
   Mathematics =  ifelse(is.na(Mathematics), medianMissing[2],Mathematics))

print(newDataFrameMedian)

Output

       Name  Physics Chemistry Mathematics
1 Bhuwanesh      98        87         91.0
2      Anil      87        84         86.0
3       Jai      91        93         88.5
4    Naveen      94        87         88.5

The missing values are filled with the median of the corresponding column.

Conclusion

In this tutorial, we discussed how we can deal with missing data in R. We started the tutorial with a discussion on missing values, finding missing values, removing missing values and lastly we saw ways to populate missing values by mean and median. We hope this tutorial will help you to enhance your knowledge in the field of data science.

Updated on: 17-Jan-2023

22K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements