- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Dealing with Missing Data in R
In data science, one of the common tasks is dealing with missing data. If we have missing data in your dataset, there are several ways to handle it in R programming. One way is to simply remove any rows or columns that contain missing data. Another way to handle missing data is to impute the missing values using a statistical method. This means replacing the missing values with estimates based on the other values in the dataset. For example, we can replace missing values with the mean or median value of the variable in which the missing values are found.
Missing Data
In R, the NA symbol is used to define the missing values, and to represent impossible arithmetic operations (like dividing by zero) we use the NAN symbol which stands for “not a number”. In simple words, we can say that both NA or NAN symbols represent missing values in R.
Let us consider a scenario in which a teacher is inserting the marks (or data) of all the students in a spreadsheet. But by mistake, she forgot to insert data from one student in her class. Thus, missing data/values are practical in nature.
Finding Missing Data in R
R provides us with inbuilt functions using which we can find the missing values. Such inbuilt functions are explained in detail below −
Using the is.na() Function
We can use the is.na() inbuilt function in R to check for NA values. This function returns a vector that contains only logical value (either True or False). For the NA values in the original dataset, the corresponding vector value should be True otherwise it should be False.
Example
# vector with some data myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) myVector
Output
[1] NA "TP" "4" "6.7" "c" NA "12"
Let’s find the NAs
# finding NAs myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) is.na(myVector)
Output
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE
Let’s identify NAs in Vector
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) which(is.na(myVector))
Output
[1] 1 6
Let’s identify total number of NAs −
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) sum(is.na(myVector))
Output
[1] 2
As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NA value.
Using the is.nan() Function
We can apply the is.nan() function to check for NAN values. This function returns a vector containing logical values (either True or False). If there are some NAN values present in the vector, then it returns True corresponding to that position in the vector otherwise it returns False.
Example
myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0) is.nan(myVector)
Output
[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NAN value.
Some of the traits of missing values have listed below −
Multiple NA or NAN values can exist in a vector.
To deal with NA type of missing values in a vector we can use is.na() function by passing the vector as an argument.
To deal with the NAN type of missing values in a vector we can use is.nan() function by passing the vector as an argument.
Generally, NAN values can be included in the NA type but the vice-versa is not true.
Removing Missing Data/ Values
Let us consider a scenario in which we want to filter values except for missing values. In R, we have two ways to remove missing values. These methods are explained below −
Remove Values Using Filter functions
The first way to remove missing values from a dataset is to use R's modeling functions. These functions accept a na.action parameter that lets the function what to do in case an NA value is encountered. This makes the modeling function invoke one of its missing value filter functions.
These functions are capable enough to replace the original data set with a new data set in which the NA values have been changed. It has the default setting as na.omit that completely removes a row if this row contains any missing value. An alternative to this setting is −
It just terminates whenever it encounters any missing values. The following are the filter functions −
na.omit − It simply rules out any rows that contain any missing value and forgets those rows forever.
na.exclude − This agument ignores rows having at least one missing value.
na.pass − Take no action.
na.fail − It terminates the execution if any of the missing values are found.
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.exclude(myVector)
Output
[1] "TP" "4" "6.7" "c" "12" attr(,"na.action") [1] 1 6 attr(,"class") [1] "exclude"
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.omit(myVector)
Output
[1] "TP" "4" "6.7" "c" "12" attr(,"na.action") [1] 1 6 attr(,"class") [1] "omit"
Example
myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.fail(myVector)
Output
Error in na.fail.default(myVector) : missing values in object
As you can see in the output, execution halted for rows containing at least one missing value.
Selecting values that are not NA or NAN
In order to select only those values which are not missing, firstly we are required to produce a logical vector having corresponding values as True for NA or NAN value and False for other values in the given vector.
Example
Let logicalVector be such a vector (we can easily get this vector by applying is.na() function).
myVector1 <- c(200, 112, NA, NA, NA, 49, NA, 190) logicalVector1 <- is.na(myVector1) newVector1 = myVector1[! logicalVector1] print(newVector1)
Output
[1] 200 112 49 190
Applying the is.nan() function
myVector2 <- c(100, 121, 0 / 0, 123, 0 / 0, 49, 0 / 0, 290) logicalVector2 <- is.nan(myVector2) newVector2 = myVector2[! logicalVector2] print(newVector2)
Output
[1] 100 121 123 49 290
As you can see in the output missing values of type NA and NAN have been successfully removed from myVector1 and myVector2 respectively.
Filling Missing Values with Mean or Median
In this section, we will see how we can fill or populate missing values in a dataset using mean and median. We will use the apply method to get the mean and median of missing columns.
Step 1 − The very first step is to get the list of columns that contain at least one missing value (NA) value.
Example
# Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) #Print dataframe print(dataframe)
Output
Name Physics Chemistry Mathematics 1 Bhuwanesh 98 NA 91 2 Anil 87 84 86 3 Jai 91 93 NA 4 Naveen 94 87 NA
Let’s print the column names having at least one NA value.
listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] print(listMissingColumns)
Output
[1] "Chemistry" "Mathematics"
In our dataframe, we have two columns with NA values.
Step 2 − Now we are required to compute the mean and median of the corresponding columns. Since we need to omit NA values in the missing columns, therefore, we can pass "na.rm = True" argument to the apply() function.
meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) print(meanMissing)
Output
Chemistry Mathematics 88.0 88.5
The mean of Column Chemistry is 88.0 and that of Mathematics is 88.5.
Now let’s find the median of the columns −
medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) print(medianMissing)
Output
Chemistry Mathematics 87.0 88.5
The median of Column Chemistry is 87.0 and that of Mathematics is 88.5.
Step 3 − Now our mean and median values of corresponding columns are ready. In this step, we will replace NA values with mean and median using mutate() function which is defined under “dplyr” package.
Example
# Importing library library(dplyr) # Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) newDataFrameMean <- dataframe %>% mutate( Chemistry = ifelse(is.na(Chemistry), meanMissing[1], Chemistry), Mathematics = ifelse(is.na(Mathematics), meanMissing[2], Mathematics)) newDataFrameMean
Output
Name Physics Chemistry Mathematics 1 Bhuwanesh 98 88 91.0 2 Anil 87 84 86.0 3 Jai 91 93 88.5 4 Naveen 94 87 88.5
Notice the missing values are filled with the mean of the corresponding column.
Example
Now let’s fill the NA values with the median of the corresponding column.
# Importing library library(dplyr) # Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) newDataFrameMedian <- dataframe %>% mutate( Chemistry = ifelse(is.na(Chemistry), medianMissing[1], Chemistry), Mathematics = ifelse(is.na(Mathematics), medianMissing[2],Mathematics)) print(newDataFrameMedian)
Output
Name Physics Chemistry Mathematics 1 Bhuwanesh 98 87 91.0 2 Anil 87 84 86.0 3 Jai 91 93 88.5 4 Naveen 94 87 88.5
The missing values are filled with the median of the corresponding column.
Conclusion
In this tutorial, we discussed how we can deal with missing data in R. We started the tutorial with a discussion on missing values, finding missing values, removing missing values and lastly we saw ways to populate missing values by mean and median. We hope this tutorial will help you to enhance your knowledge in the field of data science.