Joining Data in R with data.table


In this article, we will discuss joining data in R using data.table package. By the term “joining data” we mean to say that performing different types of joins operations like INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, AND FULL OUTER JOIN between two or more tables. The main purpose of doing join operations between tables is to access data from multiple tables on the basis of some attribute (or column) condition.

R provides us data.table package with the help of which we can handle tabular data (having rows and columns) very efficiently. This package was launched as an alternative to R’s default data.frame to manipulate data. The data.table is very efficient in terms of speed and code size as compared to data.frame. The syntax of data.table is also very intuitive and terse which makes it an ideal choice for developers. This package provides us functionalities using which we can efficiently perform join operations between tables.

In SQL, a join clause is used to join rows from two or more relations or tables based on the column condition between them.

Before proceeding further, we need to define two tables on which we can perform different join operations.

Creating tables with data.table

The data.table package provides us data.table() function using which we can easily construct these tables in R.

Syntax

This function has the following syntax −

data.table(column_name1 = values, column_name2 = values, column_name3 = values, …)

It Returns a table having data stored as rows and columns.

Example

Let’s create the Employees table −

# Importing data.table package library("data.table") # Creating table1 employeesTable <- data.table(Employee_ID = 1:5, Employee_Name = c("Bhuwanesh", "Jai", "Piyush", "Hitesh", "Anil"), Department_ID = c(1, 11, 4, 11, 8), Salary = c(30000, 27000, 19000, 21000, 25000)) # Print table1 print(employeesTable)

Output

   Employee_ID Employee_Name Department_ID Salary
1:           1     Bhuwanesh             1  30000
2:           2           Jai            11  27000
3:           3        Piyush             4  19000
4:           4        Hitesh            11  21000
5:           5          Anil             8  25000

Creating Departments Table −

# Importing data.table package library("data.table") # Creating table2 departmentsTable <- data.table(Department_ID = c(4, 11, 5, 17, 3), Department_Name = c("IT", "Finance", "Sales", "HR", "Marketing")) # Print table2 print(departmentsTable)

Output

   Department_ID Department_Name
1:             4              IT
2:            11         Finance
3:             5           Sales
4:            17              HR
5:             3       Marketing

Joining Data with data.table

In R, we have the following types of joins −

  • Inner join

  • Left outer join

  • Right outer join

  • Full outer join

Note that to perform a join between two tables using data.table you need to set a key on which we can set the ON clause.

The data.table package provides us setKey() function which has the following syntax −

setKey(table1, ON = column_name)
setKey(table2, ON = column_name)

Inner Join with data.table

This type of Join returns records that have a matching value in the respective tables.

For example, consider a pair of tables table1 and table2 having some data. Then, we can represent the inner join of both the tables in terms of the Venn diagram as −

Syntax

The inner join is the basic join under data.table and follows the below-given syntax −

innerJoin <- table1[table2, nomatch=0]

Example

Now let’s perform the inner join operation between tables Employees and Departments using data.table.

# Import data.table library library("data.table") # Create table1 employeesTable <- data.table(Employee_ID = 1:5, Employee_Name = c("Bhuwanesh", "Jai", "Piyush", "Hitesh", "Anil"), Department_ID = c(1, 11, 4, 11, 8), Salary = c(30000, 27000, 19000, 21000, 25000)) # Create table2 departmentsTable <- data.table(Department_ID = c(4, 11, 5, 17, 3), Department_Name = c("IT", "Finance", "Sales", "HR", "Marketing")) # Set the ON clause on Department_ID as the key of the tables setkey(employeesTable, ON = Department_ID) setkey(departmentsTable, ON = Department_ID) # Perform the inner join by eliminating not matched rows innerJoin <- employeesTable[departmentsTable, nomatch=0] # Display the result of inner join print(innerJoin)

Output

   Employee_ID Employee_Name Department_ID Salary Department_Name
1:           3        Piyush             4  19000              IT
2:           2           Jai            11  27000         Finance
3:           4        Hitesh            11  21000         Finance

Left Outer Join with data.table

This type of Join returns the matching records as well as the remaining records from the left table.

For example, consider a pair of tables table1 and table2 having some data. Then, we can represent left outer join of both the tables in terms of venn diagram as −

The left outer join is the basic join under data.table and follows the below-given syntax −

Syntax

table2[table1]

Example

Now let’s perform the left outer join operation between tables Employees and Departments using data.table −

# Import data.table library library("data.table") # Create table1 employeesTable <- data.table(Employee_ID = 1:5, Employee_Name = c("Bhuwanesh", "Jai", "Piyush", "Hitesh", "Anil"), Department_ID = c(1, 11, 4, 11, 8), Salary = c(30000, 27000, 19000, 21000, 25000)) # Create table2 departmentsTable <- data.table(Department_ID = c(4, 11, 5, 17, 3), Department_Name = c("IT", "Finance", "Sales", "HR", "Marketing")) # Set the ON clause on Department_ID as the key of the tables setkey(employeesTable,Department_ID) setkey(departmentsTable,Department_ID) # Perform the left outer join by including matched rows # and remaining rows from the left leftOuterJoin <- departmentsTable[employeesTable] # Display the result of left outer join print(leftOuterJoin)

Output

   Department_ID Department_Name Employee_ID Employee_Name Salary
1:             1            <NA>           1     Bhuwanesh  30000
2:             4              IT           3        Piyush  19000
3:             8            <NA>           5          Anil  25000
4:            11         Finance           2           Jai  27000
5:            11         Finance           4        Hitesh  21000

Right Outer Join with data.table

The right outer join returns the matching records as well as the remaining records from the right table. For example, consider a pair of tables table 1 and table 2 having some data. Then, we can represent right outer join of both the tables in terms of Venn diagram as −

Syntax

The right outer join is the basic join under data.table and follows the below-given syntax −

table1[table2]

Example

Now let’s perform the right outer join operation between tables Employees and Departments using data.table −

# Import data.table library library("data.table") # Create table1 employeesTable <- data.table(Employee_ID = 1:5, Employee_Name = c("Bhuwanesh", "Jai", "Piyush", "Hitesh", "Anil"), Department_ID = c(1, 11, 4, 11, 8), Salary = c(30000, 27000, 19000, 21000, 25000)) # Create table2 departmentsTable <- data.table(Department_ID = c(4, 11, 5, 17, 3), Department_Name = c("IT", "Finance", "Sales", "HR", "Marketing")) # Set the ON clause on Department_ID as the key of the tables setkey(employeesTable,Department_ID) setkey(departmentsTable,Department_ID) # Perform the right join by including matched rows # and remaining rows from the right rightOuterJoin <- employeesTable[departmentsTable] # Print the result of right join result print(rightOuterJoin)

Output

   Employee_ID Employee_Name Department_ID Salary Department_Name
1:          NA          <NA>             3     NA       Marketing
2:           3        Piyush             4  19000              IT
3:          NA          <NA>             5     NA           Sales
4:           2           Jai            11  27000         Finance
5:           4        Hitesh            11  21000         Finance
6:          NA          <NA>            17     NA              HR

Full Outer Join

A full outer join returns the matching records as well as the remaining records from the left table and the right table.

For example, consider a pair of tables table1 and table2 having some data. Then, we can represent the full outer join of both the tables in terms of the Venn diagram as −

Syntax

The full outer join is the basic join under data.table and follows the below-given syntax −

merge(table1, table2, all=TRUE)

Example

Now let’s perform the full outer join operation between tables Employees and Departments using data.table −

# Perform the full outer join fullOuterJoin <- merge(employeesTable, departmentsTable, all=TRUE) # Print the full outer join result print(fullOuterJoin)

Output

   Department_ID Employee_ID Employee_Name Salary Department_Name
1:             1           1     Bhuwanesh  30000            <NA>
2:             3          NA          <NA>     NA       Marketing
3:             4           3        Piyush  19000              IT
4:             5          NA          <NA>     NA           Sales
5:             8           5          Anil  25000            <NA>
6:            11           2           Jai  27000         Finance
7:            11           4        Hitesh  21000         Finance
8:            17          NA          <NA>     NA              HR

Conclusion

In this tutorial, we have discussed about how we can Join data in R with data.table. We discussed in detail different types of joins: Inner join, Left outer join, Right outer join, and Full outer join. I believe that this tutorial will strengthen your knowledge in the field of data science.

Updated on: 17-Jan-2023

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements