Data Manipulation in R with data.table


Data manipulation is a crucial step in the data analysis process, as it allows us to prepare and organize our data in a way that is suitable for the specific analysis or visualization. There are many different tools and techniques for data manipulation, depending on the type and structure of the data, as well as the specific goals of the manipulation.

The data.table package is an R package that provides an enhanced version of the data.frame class in R. It’s syntax and features make it easier and faster to manipulate and work with large datasets.

The date.table is one of the most downloaded packages by developers and an ideal choice for Data Scientists.

Installating data.table package

Installing data.table package is as simple as installing other packages. You can use the below commands in CRAN’s command line tool to install this package −

Installing ‘data.table’ package using CRAN

install.packages('data.table')

Installing dev version from Gitlab

install.packages("data.table",
repos="https://Rdatatable.gitlab.io/data.table")

Importing Datasets

In R programming language, we have tons of built-in datasets that one may use as demo data to demonstrate how the R functions work.

One such popular inbuilt dataset is “Iris” dataset. This dataset provides us the measurement of four different attributes of 50 flowers (three different species).

The way we deal with datasets in data.table is quite different from dealing datasets in data.frame. Let’s go deep into this and get some insights.

The data.table provides us fread() function (fast read) which is basically data.table’s version of read.csv() function. Similar to read.csv() function it can read a file stored locally as well as capable enough to read files hosted on a website.

Example

Consider the below program that imports iris data stored as a CSV file on the internet −

# Importing library library(data.table) # Creating a dataset myDataset <- fread("https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # print the iris dataset print(myDataset)

Output

[1] "data.table" "data.frame"

As you see from the above output, the imported data is directly stored as a data.table.

The data.table generally inherits from a data.frame class and therefore is a data.frame by itself. Therefore, those functions that accept a data.frame will get the job done for data.table as well.

Displaying IRIS Dataset

Example

# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.cs v") # print the iris dataset print(myDataset)

Output

   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:        5.1         3.5          1.4         0.2    setosa
  2:        4.9         3.0          1.4         0.2    setosa
  3:        4.7         3.2          1.3         0.2    setosa
  4:        4.6         3.1          1.5         0.2    setosa
  5:        5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:        6.7         3.0          5.2         2.3 virginica
147:        6.3         2.5          5.0         1.9 virginica
148:        6.5         3.0          5.2         2.0 virginica
149:        6.2         3.4          5.4         2.3 virginica
150:        5.9         3.0          5.1         1.8 virginica

There are 150 rows and 5 columns in the Iris data set.

Let’s print first six rows from the iris dataset

head(myDataset)

Output

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

Filtering Rows Based on a Condition

The main problem with data.frame package was that this package is not well aware of its column names. Therefore, it becomes difficult sometimes when we need to select or filter some rows on the basis of column conditions.

The data.table package comes with advanced features that make it capable of knowing its column names. Using data.table package we can easily filter out rows by passing column conditions inside the square bracket.

myDataset[column_condition]

Here column_condition specifies the column conditions on the basis of which certain rows will be selected.

Let us consider an example to filter the dataset with the condition "Sepal.Length==5.1 & Petal.Length==1.4".

Example

# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # datatable syntax to filter rows # based on column condition myDataset[Sepal.Length==5.1 & Petal.Length==1.4,]

Output

	    Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          5.1         3.5          1.4         0.3  setosa

As you can see above in the output, two rows have been filtered out that matches with the column condition provided inside of square brackets.

Selecting Columns

We will now see how we can select columns of a dataset using data.table package. The basic syntax of selecting columns is given below,

myDataset[, column_number, with = F]

Her column_number must be equal to the column that you want to subset (Columns are 1-based)

Example

Let’s consider an example in which we want to select second column of the iris dataset −

library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # data.table syntax to subset second column myDataset[, 2, with = F]

Output

     Sepal.Width
  1:         3.5
  2:         3.0
  3:         3.2
  4:         3.1
  5:         3.6
 ---            
146:         3.0
147:         2.5
148:         3.0
149:         3.4
150:         3.0

As you can see above in the output, the second column of the iris dataset is selected.

Example

Now let’s select multiple columns. In the below example, we select two columns, i.e., 'Petal.Length' and 'Species'.

# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") columns <- c('Petal.Length', 'Species') # selecting two columns- 'Petal.Length' and 'Species' myDataset[, columns, with = F]

Output

     Petal.Length   Species
  1:          1.4    setosa
  2:          1.4    setosa
  3:          1.3    setosa
  4:          1.5    setosa
  5:          1.4    setosa
 ---                       
146:          5.2 virginica
147:          5.0 virginica
148:          5.2 virginica
149:          5.4 virginica
150:          5.1 virginica

Here we selected two columns, 'Petal.Length' and 'Species'.

Conclusion

In this tutorial, we have covered different data manipulation techniques like importing datasets, filtering out rows on the basis of column conditions, etc. I hope this tutorial will help you to strengthen your knowledge in the field of data science.

Updated on: 17-Jan-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements