- Trending Categories
- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

# Dealing with Missing Data in R

In data science, one of the common tasks is dealing with missing data. If we have **missing data** in your dataset, there are several ways to handle it in **R** programming. One way is to simply remove any rows or columns that contain missing data. Another way to handle missing data is to impute the missing values using a statistical method. This means replacing the missing values with estimates based on the other values in the dataset. For example, we can replace missing values with the mean or median value of the variable in which the missing values are found.

## Missing Data

In R, the **NA** symbol is used to define the missing values, and to represent impossible arithmetic operations (like dividing by zero) we use the NAN symbol which stands for “not a number”. In simple words, we can say that both NA or **NAN** symbols represent missing values in R.

Let us consider a scenario in which a teacher is inserting the marks (or data) of all the students in a spreadsheet. But by mistake, she forgot to insert data from one student in her class. Thus, missing data/values are practical in nature.

## Finding Missing Data in R

R provides us with inbuilt functions using which we can find the missing values. Such inbuilt functions are explained in detail below −

### Using the is.na() Function

We can use the **is.na()** inbuilt function in R to check for NA values. This function returns a vector that contains only logical value (either True or False). For the NA values in the original dataset, the corresponding vector value should be True otherwise it should be False.

### Example

# vector with some data myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) myVector

### Output

[1] NA "TP" "4" "6.7" "c" NA "12"

Let’s find the NAs

# finding NAs myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) is.na(myVector)

### Output

[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE

Let’s identify NAs in Vector

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) which(is.na(myVector))

### Output

[1] 1 6

Let’s identify total number of NAs −

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) sum(is.na(myVector))

### Output

[1] 2

As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NA value.

### Using the is.nan() Function

We can apply the **is.nan()** function to check for **NAN** values. This function returns a vector containing logical values (either True or False). If there are some NAN values present in the vector, then it returns True corresponding to that position in the vector otherwise it returns False.

### Example

myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0) is.nan(myVector)

### Output

[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE

As you can see in the output this function produces a vector having True boolean value at those positions in which myVector holds a NAN value.

Some of the traits of missing values have listed below −

Multiple NA or NAN values can exist in a vector.

To deal with NA type of missing values in a vector we can use is.na() function by passing the vector as an argument.

To deal with the NAN type of missing values in a vector we can use is.nan() function by passing the vector as an argument.

Generally, NAN values can be included in the NA type but the vice-versa is not true.

## Removing Missing Data/ Values

Let us consider a scenario in which we want to filter values except for missing values. In R, we have two ways to remove missing values. These methods are explained below −

### Remove Values Using Filter functions

The first way to remove missing values from a dataset is to use R's modeling functions. These functions accept a na.action parameter that lets the function what to do in case an NA value is encountered. This makes the modeling function invoke one of its missing value filter functions.

These functions are capable enough to replace the original data set with a new data set in which the NA values have been changed. It has the default setting as na.omit that completely removes a row if this row contains any missing value. An alternative to this setting is −

It just terminates whenever it encounters any missing values. The following are the filter functions −

**na.omit**− It simply rules out any rows that contain any missing value and forgets those rows forever.**na.exclude**− This agument ignores rows having at least one missing value.**na.pass**− Take no action.**na.fail**− It terminates the execution if any of the missing values are found.

### Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.exclude(myVector)

### Output

[1] "TP" "4" "6.7" "c" "12" attr(,"na.action") [1] 1 6 attr(,"class") [1] "exclude"

### Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.omit(myVector)

### Output

[1] "TP" "4" "6.7" "c" "12" attr(,"na.action") [1] 1 6 attr(,"class") [1] "omit"

### Example

myVector <- c(NA, "TP", 4, 6.7, 'c', NA, 12) na.fail(myVector)

### Output

Error in na.fail.default(myVector) : missing values in object

As you can see in the output, execution halted for rows containing at least one missing value.

### Selecting values that are not NA or NAN

In order to select only those values which are not missing, firstly we are required to produce a logical vector having corresponding values as True for NA or NAN value and False for other values in the given vector.

### Example

Let logicalVector be such a vector (we can easily get this vector by applying is.na() function).

myVector1 <- c(200, 112, NA, NA, NA, 49, NA, 190) logicalVector1 <- is.na(myVector1) newVector1 = myVector1[! logicalVector1] print(newVector1)

### Output

[1] 200 112 49 190

Applying the is.nan() function

myVector2 <- c(100, 121, 0 / 0, 123, 0 / 0, 49, 0 / 0, 290) logicalVector2 <- is.nan(myVector2) newVector2 = myVector2[! logicalVector2] print(newVector2)

### Output

[1] 100 121 123 49 290

As you can see in the output missing values of type NA and NAN have been successfully removed from myVector1 and myVector2 respectively.

## Filling Missing Values with Mean or Median

In this section, we will see how we can fill or populate missing values in a dataset using mean and median. We will use the apply method to get the mean and median of missing columns.

**Step 1** − The very first step is to get the list of columns that contain at least one missing value (NA) value.

### Example

# Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) #Print dataframe print(dataframe)

### Output

Name Physics Chemistry Mathematics 1 Bhuwanesh 98 NA 91 2 Anil 87 84 86 3 Jai 91 93 NA 4 Naveen 94 87 NA

Let’s print the column names having at least one NA value.

listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] print(listMissingColumns)

### Output

[1] "Chemistry" "Mathematics"

In our dataframe, we have two columns with NA values.

**Step 2** − Now we are required to compute the mean and median of the corresponding columns. Since we need to omit NA values in the missing columns, therefore, we can pass **"na.rm = True"** argument to the **apply()** function.

meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) print(meanMissing)

### Output

Chemistry Mathematics 88.0 88.5

The mean of Column Chemistry is 88.0 and that of Mathematics is 88.5.

Now let’s find the median of the columns −

medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) print(medianMissing)

### Output

Chemistry Mathematics 87.0 88.5

The median of Column Chemistry is 87.0 and that of Mathematics is 88.5.

**Step 3** − Now our mean and median values of corresponding columns are ready. In this step, we will replace NA values with mean and median using mutate() function which is defined under “dplyr” package.

### Example

# Importing library library(dplyr) # Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) newDataFrameMean <- dataframe %>% mutate( Chemistry = ifelse(is.na(Chemistry), meanMissing[1], Chemistry), Mathematics = ifelse(is.na(Mathematics), meanMissing[2], Mathematics)) newDataFrameMean

### Output

Name Physics Chemistry Mathematics 1 Bhuwanesh 98 88 91.0 2 Anil 87 84 86.0 3 Jai 91 93 88.5 4 Naveen 94 87 88.5

Notice the missing values are filled with the mean of the corresponding column.

### Example

Now let’s fill the NA values with the median of the corresponding column.

# Importing library library(dplyr) # Create a data frame dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"), Physics = c(98, 87, 91, 94), Chemistry = c(NA, 84, 93, 87), Mathematics = c(91, 86, NA, NA) ) listMissingColumns <- colnames(dataframe)[ apply(dataframe, 2, anyNA)] meanMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, mean, na.rm = TRUE) medianMissing <- apply(dataframe[,colnames(dataframe) %in% listMissingColumns], 2, median, na.rm = TRUE) newDataFrameMedian <- dataframe %>% mutate( Chemistry = ifelse(is.na(Chemistry), medianMissing[1], Chemistry), Mathematics = ifelse(is.na(Mathematics), medianMissing[2],Mathematics)) print(newDataFrameMedian)

### Output

Name Physics Chemistry Mathematics 1 Bhuwanesh 98 87 91.0 2 Anil 87 84 86.0 3 Jai 91 93 88.5 4 Naveen 94 87 88.5

The missing values are filled with the median of the corresponding column.

## Conclusion

In this tutorial, we discussed how we can deal with missing data in R. We started the tutorial with a discussion on missing values, finding missing values, removing missing values and lastly we saw ways to populate missing values by mean and median. We hope this tutorial will help you to enhance your knowledge in the field of data science.