- Trending Categories
- Data Structure
- Operating System
- MS Excel
- C Programming
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Data Manipulation in R with data.table
Data manipulation is a crucial step in the data analysis process, as it allows us to prepare and organize our data in a way that is suitable for the specific analysis or visualization. There are many different tools and techniques for data manipulation, depending on the type and structure of the data, as well as the specific goals of the manipulation.
The data.table package is an R package that provides an enhanced version of the data.frame class in R. It’s syntax and features make it easier and faster to manipulate and work with large datasets.
The date.table is one of the most downloaded packages by developers and an ideal choice for Data Scientists.
Installating data.table package
Installing data.table package is as simple as installing other packages. You can use the below commands in CRAN’s command line tool to install this package −
Installing ‘data.table’ package using CRAN
Installing dev version from Gitlab
In R programming language, we have tons of built-in datasets that one may use as demo data to demonstrate how the R functions work.
One such popular inbuilt dataset is “Iris” dataset. This dataset provides us the measurement of four different attributes of 50 flowers (three different species).
The way we deal with datasets in data.table is quite different from dealing datasets in data.frame. Let’s go deep into this and get some insights.
The data.table provides us fread() function (fast read) which is basically data.table’s version of read.csv() function. Similar to read.csv() function it can read a file stored locally as well as capable enough to read files hosted on a website.
Consider the below program that imports iris data stored as a CSV file on the internet −
# Importing library library(data.table) # Creating a dataset myDataset <- fread("https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # print the iris dataset print(myDataset)
 "data.table" "data.frame"
As you see from the above output, the imported data is directly stored as a data.table.
The data.table generally inherits from a data.frame class and therefore is a data.frame by itself. Therefore, those functions that accept a data.frame will get the job done for data.table as well.
Displaying IRIS Dataset
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.cs v") # print the iris dataset print(myDataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa --- 146: 6.7 3.0 5.2 2.3 virginica 147: 6.3 2.5 5.0 1.9 virginica 148: 6.5 3.0 5.2 2.0 virginica 149: 6.2 3.4 5.4 2.3 virginica 150: 5.9 3.0 5.1 1.8 virginica
There are 150 rows and 5 columns in the Iris data set.
Let’s print first six rows from the iris dataset
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa 6: 5.4 3.9 1.7 0.4 setosa
Filtering Rows Based on a Condition
The main problem with data.frame package was that this package is not well aware of its column names. Therefore, it becomes difficult sometimes when we need to select or filter some rows on the basis of column conditions.
The data.table package comes with advanced features that make it capable of knowing its column names. Using data.table package we can easily filter out rows by passing column conditions inside the square bracket.
Here column_condition specifies the column conditions on the basis of which certain rows will be selected.
Let us consider an example to filter the dataset with the condition "Sepal.Length==5.1 & Petal.Length==1.4".
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # datatable syntax to filter rows # based on column condition myDataset[Sepal.Length==5.1 & Petal.Length==1.4,]
Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 5.1 3.5 1.4 0.3 setosa
As you can see above in the output, two rows have been filtered out that matches with the column condition provided inside of square brackets.
We will now see how we can select columns of a dataset using data.table package. The basic syntax of selecting columns is given below,
myDataset[, column_number, with = F]
Her column_number must be equal to the column that you want to subset (Columns are 1-based)
Let’s consider an example in which we want to select second column of the iris dataset −
library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # data.table syntax to subset second column myDataset[, 2, with = F]
Sepal.Width 1: 3.5 2: 3.0 3: 3.2 4: 3.1 5: 3.6 --- 146: 3.0 147: 2.5 148: 3.0 149: 3.4 150: 3.0
As you can see above in the output, the second column of the iris dataset is selected.
Now let’s select multiple columns. In the below example, we select two columns, i.e., 'Petal.Length' and 'Species'.
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") columns <- c('Petal.Length', 'Species') # selecting two columns- 'Petal.Length' and 'Species' myDataset[, columns, with = F]
Petal.Length Species 1: 1.4 setosa 2: 1.4 setosa 3: 1.3 setosa 4: 1.5 setosa 5: 1.4 setosa --- 146: 5.2 virginica 147: 5.0 virginica 148: 5.2 virginica 149: 5.4 virginica 150: 5.1 virginica
Here we selected two columns, 'Petal.Length' and 'Species'.
In this tutorial, we have covered different data manipulation techniques like importing datasets, filtering out rows on the basis of column conditions, etc. I hope this tutorial will help you to strengthen your knowledge in the field of data science.
Kickstart Your Career
Get certified by completing the courseGet Started