Writing Efficient R Code

Writing efficient code is very important as it makes the development time faster and leads our program to be able to understand, debug and maintain easily. We will discuss various techniques like benchmarking, vectorization and parallel programming to make our R code faster. You must learn these techniques if you are aspiring to be a data scientist. So, let?s get started ?

Benchmarking

One of the easiest optimizations is to have the latest R version to work for. The new version cannot modify our existing code but it always comes with robust library functions that provide improved execution time.

The following command in R displays a list of version information of R ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">print</span><span class="token punctuation">(</span>version<span class="token punctuation">)</span>
</div>

Output

               _                                          
platform       x86_64-pc-linux-gnu                        
arch           x86_64                                     
os             linux-gnu                                  
system         x86_64, linux-gnu                          
status         Patched                                    
major          4                                          
minor          2.2                                        
year           2022                                       
month          11                                         
day            10                                         
svn rev        83330                                      
language       R                                          
version.string R version 4.2.2 Patched (2022-11-10 r83330)
nickname       Innocent and Trusting                              

Reading a CSV file as RDS file

Loading files using the read.csv() takes a lot of time. The efficient way to deal with it is to read and save the .csv file in .rds format first and then read the binary file. R provides us saveRDS() function to a .csv file in .rds format.

Example

Consider the following program that benchmarks the difference between the reading time of the same file present in two different formats ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token class-name">Display</span> the time taken <span class="token keyword">to</span> <span class="token namespace">read</span> file using read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token function">print</span><span class="token punctuation">(</span>system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span>
   <span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>

# <span class="token class-name">Save</span> the file in <span class="token punctuation">.</span>rds format
<span class="token function">saveRDS</span><span class="token punctuation">(</span><span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">,</span> 
<span class="token string">"myFile.rds"</span> <span class="token punctuation">)</span>

# <span class="token class-name">Display</span> the time taken <span class="token keyword">to</span> <span class="token namespace">read</span> in binary format
<span class="token function">print</span><span class="token punctuation">(</span>system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token function">readRDS</span><span class="token punctuation">(</span><span class="token string">"myFile.rds"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>

Output

 user   system  elapsed 
0.017    0.002    0.603 
 user   system  elapsed 
    0        0        0

Notice the difference between the execution time of both the methods. The time taken to read the same file in .RDS format is almost negligible. Thus reading an RDS file is more efficient than reading a CSV file.

Assigning using "<-" and "=" operators

R provides us with several ways to assign variables and files to objects. Two operators are widely used for this purpose: "<-" and "=". It is interesting to note that when we use the "<-" operator inside a function then it either creates a new object or overrides the existing ones. Since we want to store the result, using the "<-" operator is the useful inside system.time() function.

Elapsed time microbenchmark function

The system.time() function is reliable for computing the time taken by certain operations but it has a limitation to not compare many operations simultaneously.

R provides us with a microbenchmark library that provides us with a microbenchmark() function using which we can compare the time taken by two functions or operations.

Example

Consider the following program that uses the microbenchmark() function to compare the same file present in two different formats: CSV and RDS

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span>
# <span class="token class-name">Save</span> the file in <span class="token punctuation">.</span>rds format
<span class="token function">saveRDS</span><span class="token punctuation">(</span><span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">,</span> 
<span class="token string">"myFile.rds"</span> <span class="token punctuation">)</span>

# <span class="token class-name">Compare</span> using <span class="token function">microbenchmark</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
difference <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">microbenchmark</span><span class="token punctuation">(</span>read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span>
   <span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> 
      <span class="token function">readRDS</span><span class="token punctuation">(</span><span class="token string">"myFile.rds"</span><span class="token punctuation">)</span><span class="token punctuation">,</span> 
         times <span class="token operator">=</span> <span class="token number">2</span><span class="token punctuation">)</span>

# <span class="token class-name">Display</span> the time difference
<span class="token function">print</span><span class="token punctuation">(</span>difference<span class="token punctuation">)</span>
</div>

Output

        min         lq       mean     median         uq        max neval
 405062.028 405062.028 409947.146 409947.146 414832.264 414832.264     2
     41.151     41.151    102.355    102.355    163.559    163.559     2

Notice the difference between the execution time of both the methods.

Efficient Vectorisation

The increasing size of a vector with the flow of the code is not desirable in programming and it should be avoided as much as possible. This is because it consumes a lot of time and makes our program inefficient.

Example

For example, the following source code increases the size of the vector ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token function">expand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
expand <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">function</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span> <span class="token punctuation">{</span>
   myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token class-name">NULL</span>
   <span class="token keyword">for</span><span class="token punctuation">(</span>currentNumber in <span class="token number">1</span><span class="token operator">:</span>n<span class="token punctuation">)</span>
      myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span>myVector<span class="token punctuation">,</span> currentNumber<span class="token punctuation">)</span>
    
   myVector
<span class="token punctuation">}</span>

# <span class="token class-name">Using</span> system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>res_grow <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">expand</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>

Output

 user  system elapsed 
0.003   0.000   0.003

As you can see in the output, the expand() function is consuming a lot of time.

Example

We can optimize the above code by preallocating the vector. For example, consider the following program ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token function">expand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
expand <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">function</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span> <span class="token punctuation">{</span>
   myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span>
   <span class="token keyword">for</span><span class="token punctuation">(</span>currentNumber in <span class="token number">1</span><span class="token operator">:</span>n<span class="token punctuation">)</span>
      myVector<span class="token punctuation">[</span>currentNumber<span class="token punctuation">]</span> <span class="token operator">=</span> currentNumber
    
<span class="token punctuation">}</span>

# <span class="token class-name">Using</span> system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>res_grow <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">expand</span><span class="token punctuation">(</span><span class="token number">10000</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>

Output

  user  system elapsed 
 0.001   0.000   0.001

As you can see in the output, the execution time has been reduced drastically.

We should vectorize our code whenever possible.

Example

For example, consider the following program that adds the values in a vector using the simple loop method ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token class-name">Initialize</span> a vector <span class="token keyword">with</span> <span class="token namespace">random</span> values
myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">rnorm</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span>

# <span class="token class-name">Declare</span> another vector 
myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span>

# <span class="token class-name">Compute</span> the sum 
<span class="token keyword">for</span><span class="token punctuation">(</span>index in <span class="token number">1</span><span class="token operator">:</span><span class="token number">20</span><span class="token punctuation">)</span>
   myVector2<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator"><</span><span class="token operator">-</span> myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator">+</span> myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span>

# <span class="token class-name">Display</span>
<span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span>
</div>

Output

 [1]   1.31044581 -1.98035551  0.14009657 -1.62789103  1.23248277  0.49893302
 [7]  -0.53349928 -0.02553238 -0.06886832  1.16296981  0.90072271  0.20713797
[13]  -1.72293906  0.62083278  2.77900829  4.15732558  1.71227621  2.09531955
[19]  -0.06520153  0.62591177

The output represents the sum of corresponding vector values with itself.

Example

The following do the same thing as done above but this time we will use the vectorization method that will make decrease our code size and increases the execution time ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">rnorm</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span>

myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span>

# <span class="token class-name">Add</span> using vectorization
myVector2 <span class="token operator"><</span><span class="token operator">-</span> myVector1 <span class="token operator">+</span> myVector1

# <span class="token class-name">Display</span>
<span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span>
</div>

Output

 [1] -1.0100098  3.2932186 -3.5650312 -3.2800819  0.1513545 -1.5786916
 [7]  2.0485566  2.6009810 -0.8015987 -0.6965471 -1.4298714  1.1251865
[13]  1.2536663  2.6258258  1.1093443 -1.7895628  0.3472878 -1.4783578
[19] -0.7717328 -2.2734743

The output represents the sum of corresponding vector values with itself but this time we have used a vectorization method.

Note that we can apply vectorization techniques even with the R inbuilt functions.

Example

For example, consider the following program that computes the log of individual values present in a vector ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">13</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">57</span><span class="token punctuation">,</span> <span class="token number">88</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">,</span> <span class="token number">110</span><span class="token punctuation">)</span>

myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span>

# <span class="token class-name">Compute</span> the sum 
<span class="token keyword">for</span><span class="token punctuation">(</span>index in <span class="token number">1</span><span class="token operator">:</span><span class="token number">10</span><span class="token punctuation">)</span>
   myVector2<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">log</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span><span class="token punctuation">)</span>

# <span class="token class-name">Display</span> the vector
<span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span>
</div>

Output

[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337
[9] 4.605170 4.700480

As you can see in the output, the logarithm of the corresponding vector values have been displayed.

Example

Now let us try to achieve the same thing but using vectorization technique this time ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">13</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">57</span><span class="token punctuation">,</span> <span class="token number">88</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">,</span> <span class="token number">110</span><span class="token punctuation">)</span>

myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span>

myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">log</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span>

# <span class="token class-name">Display</span>
<span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span>
</div>

Output

[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337
[9] 4.605170 4.700480

As you can see in the output, the logarithm of the corresponding vector values have been displayed but this time we have used the vectorization method.

Example

The matrix that contains elements of the same data type has the faster column access as compared to a dataframe. For example, consider the following program ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span>

# <span class="token class-name">Create</span> a matrix
myMatrix <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">matrix</span><span class="token punctuation">(</span><span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token operator">:</span><span class="token number">12</span><span class="token punctuation">)</span><span class="token punctuation">,</span> nrow <span class="token operator">=</span> <span class="token number">4</span><span class="token punctuation">,</span> byrow <span class="token operator">=</span> TRUE<span class="token punctuation">)</span>  

# <span class="token class-name">Display</span>
<span class="token function">print</span><span class="token punctuation">(</span>myMatrix<span class="token punctuation">)</span>  

# <span class="token class-name">Create</span> rows 
data1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span>
data2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">11</span><span class="token punctuation">)</span>
data3 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">6</span><span class="token punctuation">,</span> <span class="token number">9</span><span class="token punctuation">,</span> <span class="token number">12</span><span class="token punctuation">)</span>

# <span class="token class-name">Create</span> a dataframe
myDataframe <span class="token operator"><</span><span class="token operator">-</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data1<span class="token punctuation">,</span> data2<span class="token punctuation">,</span> data3<span class="token punctuation">)</span>

# <span class="token class-name">Display</span> the dataframe
<span class="token function">print</span><span class="token punctuation">(</span><span class="token function">microbenchmark</span><span class="token punctuation">(</span>myMatrix<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">,</span> myDataframe<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>

Output

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
Unit: nanoseconds
             expr  min     lq    mean median   uq   max neval
    myMatrix[, 1]  493  525.0  669.64  595.5  661  5038   100
 myDataframe[, 1] 6880 7110.5 8003.56 7247.0 7437 53752   100

You can spot the difference in execution time for the column access method of a matrix and a dataframe.

Parallel Programming for efficient R code

R provides us with a parallel package using which we can write efficient R code. Parallelism is most of the time beneficial to get things done in less time and make the proper use of the system resources. The parallel package in R provides us the parApply() function that uses the following steps to run a program in parallel ?

  • Make a cluster using the makeCluster() function.

  • Write some statements.

  • Eventually, stop the cluster using the stopCluster() function.

Example

The following source code calculates the mean of all the columns using parApply() function in R ?

<div class="code-mirror  language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>parallel<span class="token punctuation">)</span>
<span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span>

# <span class="token class-name">Create</span> rows 
data1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span>
data2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">11</span><span class="token punctuation">)</span>
data3 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">6</span><span class="token punctuation">,</span> <span class="token number">9</span><span class="token punctuation">,</span> <span class="token number">12</span><span class="token punctuation">)</span>

# <span class="token class-name">Create</span> a dataframe
myDataframe <span class="token operator"><</span><span class="token operator">-</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data1<span class="token punctuation">,</span> data2<span class="token punctuation">,</span> data3<span class="token punctuation">)</span>

# <span class="token class-name">Create</span> a cluster
cluster <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">makeCluster</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span>

# <span class="token class-name">Apply</span> <span class="token function">parApply</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
<span class="token function">print</span><span class="token punctuation">(</span><span class="token function">parApply</span><span class="token punctuation">(</span>cluster<span class="token punctuation">,</span> myDataframe<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> mean<span class="token punctuation">)</span><span class="token punctuation">)</span>

# <span class="token class-name">Stop</span> the cluster
<span class="token function">stopCluster</span><span class="token punctuation">(</span>cluster<span class="token punctuation">)</span>
</div>

Output

data1 data2 data3 
  5.5   6.5   7.5

As you can see in the output, the mean of the corresponding columns has been computed using parallel programming which is faster.

Conclusion

In this article, we briefly discussed how you can write efficient code in R. We discussed benchmarking, different vectorization techniques, and parallel programming. I hope this tutorial has surely helped you to expand your knowledge in the field of data science.

Updated on: 2023-01-17T16:05:04+05:30

351 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements