- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Data Manipulation in R with data.table
Data manipulation is a crucial step in the data analysis process, as it allows us to prepare and organize our data in a way that is suitable for the specific analysis or visualization. There are many different tools and techniques for data manipulation, depending on the type and structure of the data, as well as the specific goals of the manipulation.
The data.table package is an R package that provides an enhanced version of the data.frame class in R. It’s syntax and features make it easier and faster to manipulate and work with large datasets.
The date.table is one of the most downloaded packages by developers and an ideal choice for Data Scientists.
Installating data.table package
Installing data.table package is as simple as installing other packages. You can use the below commands in CRAN’s command line tool to install this package −
Installing ‘data.table’ package using CRAN
install.packages('data.table')
Installing dev version from Gitlab
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
Importing Datasets
In R programming language, we have tons of built-in datasets that one may use as demo data to demonstrate how the R functions work.
One such popular inbuilt dataset is “Iris” dataset. This dataset provides us the measurement of four different attributes of 50 flowers (three different species).
The way we deal with datasets in data.table is quite different from dealing datasets in data.frame. Let’s go deep into this and get some insights.
The data.table provides us fread() function (fast read) which is basically data.table’s version of read.csv() function. Similar to read.csv() function it can read a file stored locally as well as capable enough to read files hosted on a website.
Example
Consider the below program that imports iris data stored as a CSV file on the internet −
# Importing library library(data.table) # Creating a dataset myDataset <- fread("https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # print the iris dataset print(myDataset)
Output
[1] "data.table" "data.frame"
As you see from the above output, the imported data is directly stored as a data.table.
The data.table generally inherits from a data.frame class and therefore is a data.frame by itself. Therefore, those functions that accept a data.frame will get the job done for data.table as well.
Displaying IRIS Dataset
Example
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.cs v") # print the iris dataset print(myDataset)
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa --- 146: 6.7 3.0 5.2 2.3 virginica 147: 6.3 2.5 5.0 1.9 virginica 148: 6.5 3.0 5.2 2.0 virginica 149: 6.2 3.4 5.4 2.3 virginica 150: 5.9 3.0 5.1 1.8 virginica
There are 150 rows and 5 columns in the Iris data set.
Let’s print first six rows from the iris dataset
head(myDataset)
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa 6: 5.4 3.9 1.7 0.4 setosa
Filtering Rows Based on a Condition
The main problem with data.frame package was that this package is not well aware of its column names. Therefore, it becomes difficult sometimes when we need to select or filter some rows on the basis of column conditions.
The data.table package comes with advanced features that make it capable of knowing its column names. Using data.table package we can easily filter out rows by passing column conditions inside the square bracket.
myDataset[column_condition]
Here column_condition specifies the column conditions on the basis of which certain rows will be selected.
Let us consider an example to filter the dataset with the condition "Sepal.Length==5.1 & Petal.Length==1.4".
Example
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # datatable syntax to filter rows # based on column condition myDataset[Sepal.Length==5.1 & Petal.Length==1.4,]
Output
Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 5.1 3.5 1.4 0.3 setosa
As you can see above in the output, two rows have been filtered out that matches with the column condition provided inside of square brackets.
Selecting Columns
We will now see how we can select columns of a dataset using data.table package. The basic syntax of selecting columns is given below,
myDataset[, column_number, with = F]
Her column_number must be equal to the column that you want to subset (Columns are 1-based)
Example
Let’s consider an example in which we want to select second column of the iris dataset −
library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") # data.table syntax to subset second column myDataset[, 2, with = F]
Output
Sepal.Width 1: 3.5 2: 3.0 3: 3.2 4: 3.1 5: 3.6 --- 146: 3.0 147: 2.5 148: 3.0 149: 3.4 150: 3.0
As you can see above in the output, the second column of the iris dataset is selected.
Example
Now let’s select multiple columns. In the below example, we select two columns, i.e., 'Petal.Length' and 'Species'.
# Importing library library(data.table) # Creating a dataset myDataset <- fread( "https://raw.githubusercontent.com/gexijin/learnR/master/datasets/iris.csv") columns <- c('Petal.Length', 'Species') # selecting two columns- 'Petal.Length' and 'Species' myDataset[, columns, with = F]
Output
Petal.Length Species 1: 1.4 setosa 2: 1.4 setosa 3: 1.3 setosa 4: 1.5 setosa 5: 1.4 setosa --- 146: 5.2 virginica 147: 5.0 virginica 148: 5.2 virginica 149: 5.4 virginica 150: 5.1 virginica
Here we selected two columns, 'Petal.Length' and 'Species'.
Conclusion
In this tutorial, we have covered different data manipulation techniques like importing datasets, filtering out rows on the basis of column conditions, etc. I hope this tutorial will help you to strengthen your knowledge in the field of data science.