Parallel Programming in R

R Programming Data Science Server Side Programming

Parallel programming is a software development practice that involves dividing a computation or task into smaller parts that can be executed concurrently or in parallel. Parallel programming can help improve the performance and efficiency of your R code by utilizing multiple processors or cores in a computer or cluster. The main concept of parallel programming is, if one operation can be performed in S seconds using a single processor, then it should be able to get executed in S / N seconds when N processors are involved.

Need for Parallel Programming in R

Most of the time the code in R works fast on a single core only. But sometimes operations can −

Consume too much CPU time.
Occupy too much memory space.
Consumes too much time to read from or write into a disk.
Takes a lot of time for transferring.

Hidden Parallelism

R provides us with robust support of libraries. Sometimes, we do parallel programming even without knowing it. This is because nowadays R provides such libraries that offer built-in parallelism and we can use them in the background. Such a kind of hidden parallelism improves our programming efficiency. But it is nice to have the knowledge of what is happening actually (even behind the scenes).

Let us consider an example of hidden parallelism

Parallel blass

The basic linear algebra subroutines (BLAS) library is custom-coded in R for a particular type of CPU in order to take benefit of the architecture of the chipset. It is always beneficial to have an optimized BLAS as it improves the performance of execution.

Embarrassing Parallelism

Embarrassing parallelism is a common methodology in statistics and data science. It is capable to tackle many problems in data science and statistics. In this type of parallelism, the problem is divided into multiple independent sections and all are executed simultaneously as they don’t have any link with each other.

Syntax

Embarrassing parallelism is achievable in R using the lapply() function. This function has the following syntax −

lapply(list, function)

Example

It accepts a list and a function. It returns a list whose length is equal to the input listLet us consider a program illustrating the working of this function −

# Creating a list
myList <- list(data1 = 1:5, data2 = 10:15)

# Use lapply() function and
# calculate the mean
lapply(myList, mean)

Output

$data1
[1] 3

$data2
[1] 12.5

As you can see in the output, mean values for list elements have been displayed.

The lapply() function works similarly to the loop-it cycle where we iterate over each of the elements of the list and apply the function to it.

Now let us get more insights into what is happening actually −

We iterate each element one by one and that is why the other elements just sit idle in the memory while we apply the function to a single element of the list. We can be parallelized this thing in R. The main idea is to divide list objects and put them into multiple processors and then we can apply the function to all the subsets of the list simultaneously.

So, we can achieve parallelism using the following steps −

Break the list into multiple processors.
Clone the supplied function into multiple processors.
Apply the function to multiple cores simultaneously.
Combine the result from multiple cores into a single list.
Display the result.

Parallel Programming package in R

The parallel package in R comes with the installation of R. This package comes as a combination of two packages: snow and multicore in R.

The parallel package is specifically used to deliver tasks to each of the cores in a parallel way. Specifically, it is carried out by mclapply() function. The mclapply() function is analogous to lapply but the former is capable of distributing the task to multiple processors. The mclapply() function also collects the results from the function calls, combine them, and returns the result as a list having the length same as the original list. Note that R allows us detectCores() function using which we can get the number of cores present in the system.

Let us consider the following program illustrating the working of mclapply() function −

Note − Please note that the value of “mc.cores” greater than one works only in a non-window operating system. So, the below code is executed in an operating system other than windows.

Example

# Import library
library(parallel)
library(MASS)

# Creating a list
myList <- list(data1 = 1:10000000, data2 = 1:100000000)


cat("The estimated time using lapply() function:
")

# Calculate the time taken using lapply
system.time(
   results <- lapply(myList, mean)
)

# Get the number of cores
numberOfCores <- detectCores()

cat("The estimated time using clapply() function:
")

# Calculate the time taken using lapply() using mclapply()
system.time(
   results <- mclapply(myList, mean, mc.cores = numberOfCores)
)

Output

The estimated time using lapply() function:
   user  system elapsed 
   0.40    0.00    0.43 

The estimated time using clapply() function:
   user  system elapsed 
   0.12    0.00    0.17

You can see in the output the difference in times while using apply() and mcapply() function.

Parallel Programming using foreach and doParallel packages

Now we will see how we can implement parallel programming using foreach library in R. But before going into it let us see how a basic for loop works in R −

Example

# Iterate using the for loop from 1 to 5
# And print the square of each number
for (data in 1:5) {
   print(data * data)
}

Output

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

As you can see in the output, the square of each number from 1 to 5 displayed on the console.

Foreach Package

Now let us talk about foreach package and method. The foreach package provides us foreach() method using which we can easily achieve parallel programming.

Syntax

If you haven’t installed foreach library yet in your system, then use the following command in CRAN’s terminal −

install.packages("foreach")

The foreach method is similar to a basic for loop method but the former uses %do% operator which means running a specific type of expression. Both differ in term of the return data structure as well.

Example

Consider the following program that illustrates the working of the foreach method −

# Import foreach library
library(foreach)

# Iterate using the foreach loop from 1 to 5
# And print the square of each number
foreach (data=1:5) %do%  {
   data * data
}

Output

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

As you can see in the output, the square of each number from 1 to 5 displayed on the console.

doParallel Package

The doParallel package provides us %dopar% operator which we can be used with foreach. By using this operator along with foreach we will be able to use different processing cores for each iteration. You may download the “doParallel” package using the following command in CRAN −

install.packages("doParallel")

Example

Now let us consider the following program demonstrates the working of foreach method along with %dopar% operator -

# Import foreach library
library(foreach)
library(doParallel)
library(MASS)

# Get the total number of cores
numOfCores <- detectCores()

# Register all the cores
registerDoParallel(numberOfCores)

# Iterate using the for loop from 1 to 5
# And print the square of each number
# Using parallelism
foreach (data=1:5) %dopar%  {
   print(data * data)
}

Output

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

The square of each number from 1 to 5 is displayed on the console.

Conclusion

In this tutorial, we discussed parallel programming in R. We talked about libraries like foreach and doParallel using which parallel programming is achievable in R. We saw the working of functions like mcapply() also. Parallel programming is one of the most important concepts for any programming language and I believe that this tutorial has surely helped to gain good knowledge in the field of data science.

Bhuwanesh Nainwal

Updated on: 17-Jan-2023

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started