Introduction to the Tidyverse

R Programming Data Science Server Side Programming

The R package collection known as tidyverse was created with the goal of collaborating and handling data effectively. The Tidyverse package is open-source and constantly improved by the data science community. A data scientist must have a fundamental understanding of every package included under the tidyverse umbrella. All eight packages?purr, ggplot2, dplyr, tidyr, stringr, tibble, readr, and forcats ?will be covered in depth.

Tidyverse Packages

Tidyverse groups several packages in R. It consists of the following packages ?

Package Name	Usage
purrr	Used for function programming
ggplot2	Used for creating graphics
dplyr	Used for data manipulation
tidyr	Provide functions to create tidy data
stringr	Provide functions to work with character data
tibble	Provides robust table system
readr	Provides a fast method to import data
forcats	Provides tools that solve common problems with factors

Installing Tidyverse

Before moving further, we need to install the tidyverse package in R. You can install this package by using the following command in CRAN ?

install.packages("tidyverse")

All the tidyverse packages mentioned above are installed. No need to install these packages separately.

Importing the tidyverse

To import the tidyverse into your R script, you can use the library() function and pass in the tidyverse package as an argument ?

library("tidyverse")

Reading Data in the Tidyverse

The "readr" package in R allows us to read from and write into different file formats with the help of functions that start with read% and write%. These functions work very fast and deal smoothly with problematic header names.

These functions are listed below ?

Function	Working
read_csv()	Works with a semicolon or comma
read_csv2()	Works with a semicolon or comma
read_delim()	Works with general separator
read_table()	Works with white space containing data

Example

Let us see an example illustrating the working of read_csv() function in R ?

# Import library
library("tidyverse")

# Import a CSV file
myFile <- read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv")

# Print the file
myFile

Output

  John                      Doe        `120 jefferson st.`    River?Â¹  NJ     `08075`
  <chr>                     <chr>      <chr>                  <chr>    <chr>  <chr>  
1 "Jack"                    McGinnis   "220 hobo Av."         Phila    PA     09119  
2 "John "Da Man""           Repici     "120 Jefferson St."    Rivers?  NJ     08075  
3 "Stephen"                 Tyler      "7452 Terrace "At t?   SomeTo?  SD     91234  
4  NA                       Blankman   NA                     SomeTo?  SD     00298  
5 "Joan "the bone", Anne"   Jet        "9th, at Terrace plc"  Desert?  CO     00123

Data Wrangling in the Tidyverse

dplyr package

The dplyr package allows us to deal with tabular data efficiently. It provides us with verb functions like select() which is used to extract particular columns based on some conditions passed as start_with() or contains() function.

Example

Consider the following program that illustrates the working of these functions ?


library(dplyr)
	
# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
   Physics = c(98, 87, 91, 94),
   Chemistry = c(93, 84, 93, 87),
   Mathematics = c(91, 86, 92, 83) )
# Create a data frame
print(dataframe)

Output

       Name Physics Chemistry Mathematics
1 Bhuwanesh      98        93          91
2      Anil      87        84          86
3       Jai      91        93          92
4    Naveen      94        87          83

Print only "Ph" data using the starts_with() function ?


library(dplyr)

# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(93, 84, 93, 87),
Mathematics = c(91, 86, 92, 83) )
print(select(dataframe, starts_with("Ph")))

Output

Let's print everything that contains "mist" ?


library(dplyr)

# Create a data frame
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(93, 84, 93, 87),
Mathematics = c(91, 86, 92, 83) )
print(select(dataframe, contains("mist")))

Output

  Chemistry
1        93
2        84
3        93
4        87

As you can see in the output, column names that start with "Ph" and contain "mist" have been extracted.

summary() function

This function produces the summary of a dataset.


print(summary(iris))

Output

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

The summary() function has produced the summary of the iris model.

filter() function

This function is used to pick data that fulfill specific criteria. For example, consider the following program that shows data in which mathematics marks lie between 90 and 93 ?

Example


library(dplyr)

# Create dataframe
dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(93, 84, 93, 87),
Mathematics = c(91, 86, 92, 83) )

# Display
print(dataframe %>% filter(Mathematics > 90 & Mathematics < 94))

Output

       Name Physics Chemistry Mathematics
1 Bhuwanesh      98        93          91
2       Jai      91        93          92

arrange() function

This function is used to arrange the dataset on the basis of a specific column. For example, consider the following program that displays the dataset based on sorted order of physics marks ?

Example


library(dplyr)

dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(93, 84, 93, 87),
Mathematics = c(91, 86, 92, 83) )

# Display the data
print(arrange(dataframe, Physics))

Output

       Name Physics Chemistry Mathematics
1      Anil      87        84          86
2       Jai      91        93          92
3    Naveen      94        87          83
4 Bhuwanesh      98        93          91

rename() function

This function is used to rename a column in the dataframe. The first argument corresponds to the new name and the second argument after the equal sign corresponds to the old name. For example, consider the following program that renames the Chemistry column to CHEMISTRY ?

Example


library(dplyr)

dataframe <- data.frame( Name = c("Bhuwanesh", "Anil", "Jai", "Naveen"),
Physics = c(98, 87, 91, 94),
Chemistry = c(93, 84, 93, 87),
Mathematics = c(91, 86, 92, 83) )

# Display the dataset
# after renaming a column
print(rename(dataframe, CHEMISTRY = Chemistry))

Output

       Name Physics CHEMISTRY Mathematics
1 Bhuwanesh      98        93          91
2      Anil      87        84          86
3       Jai      91        93          92
4    Naveen      94        87          83

Data Visualization in the Tidyverse

ggplot2 package

The ggplot2 package is an open-source package specially used for data visualization. This is a powerful package designed by Hardley Wickham. This package provides us with various functions. One important function is ggplot(). This function displays the data visualization of a dataset

Example

Consider the following program ?


library(ggplot2)

# Plot the data visualization
ggplot(iris, 
aes(x = Sepal.Length, y = Sepal.Width, col = Petal.Length)
) + geom_point()

Output

As you can see in the output, the ggplot() function has plotted its visualization.

Function Programming with purrr package

The purrr package is used to achieve functional programming in R. The purrr package provides us map_() family functions, using which we can achieve functional programming to get the same outcome as for and while loops.

Let us discuss the map() function in it. This is the most fundamental function. It accepts a vector and a function as parameters and then calls the function for each element in the vector.

Example

Consider the following program ?


library("purrr")

myVector <- c(8, 3, 7, 2, 11, 20)

addByThree <- function(x) + 3

print(map(myVector, addByThree))

Output

[[1]]
[1] 3

[[2]]
[1] 3

[[3]]
[1] 3

[[4]]
[1] 3

[[5]]
[1] 3

[[6]]
[1] 3

As you can see in the above output, a list is produced after adding three to all the elements of the given vector.

String Manipulation with stringr package

The Stringr package is used for string manipulation in R. It provides functions that start with string%. Its most used functions are str_replace() and str_length(). The str_replace() function replaces a pattern or string with another string. Let us consider the following program that illustrates the working of the str_replace() function ?

Example


library("stringr")

# Create a list of strings
myList <- c("tutorialspoint", "pointer", "pointable")

# Replace the string equal to "point"
# with "Point"
print(str_replace(myList, "point", "Point"))

Output

[1] "tutorialsPoint" "Pointer"        "Pointable"

As you can see in the output, the strings containing "point" have been replaced with "Point".

Example

Let's consider another example


library("stringr")

myString = "Bhuwanesh Nainwal"

# Replace if string starts with "Bhuwanesh"
print(str_replace(myString, "^Bhuwanesh", "Harshit"))

# Replace if string ends with "Nainwal"
print(str_replace(myString, "Nainwal$", ""))

str_length(myString)

Output

[1] "Harshit Nainwal"
[1] "Bhuwanesh "
17

As you can see in the output, the string that starts with "Bhuwanesh" has been replaced by "Harshit" and the string that ends with "Nainwal" has been replaced with "".

The str_length() function is used to replace a pattern or string with another string.

Let us consider the following program that illustrates the working of the str_length() function ?

Example


library("stringr")
myList <- c("tutorialspoint", "pointer", "pointable")
print(str_length(myList))

Output

[1] 14  7  9

Conclusion

In this tutorial, we discussed the universe of packages, the tidyverse. We discussed how these packages work and illustrated them. This tutorial has definitely helped you to enhance your knowledge in the field of data science.

Bhuwanesh Nainwal

Updated on: 2023-01-17T15:38:22+05:30

765 Views

Kickstart Your Career

Get certified by completing the course

Get Started