Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Writing Efficient R Code
Writing efficient code is very important as it makes the development time faster and leads our program to be able to understand, debug and maintain easily. We will discuss various techniques like benchmarking, vectorization and parallel programming to make our R code faster. You must learn these techniques if you are aspiring to be a data scientist. So, let?s get started ?
Benchmarking
One of the easiest optimizations is to have the latest R version to work for. The new version cannot modify our existing code but it always comes with robust library functions that provide improved execution time.
The following command in R displays a list of version information of R ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">print</span><span class="token punctuation">(</span>version<span class="token punctuation">)</span> </div>
Output
_ platform x86_64-pc-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status Patched major 4 minor 2.2 year 2022 month 11 day 10 svn rev 83330 language R version.string R version 4.2.2 Patched (2022-11-10 r83330) nickname Innocent and Trusting
Reading a CSV file as RDS file
Loading files using the read.csv() takes a lot of time. The efficient way to deal with it is to read and save the .csv file in .rds format first and then read the binary file. R provides us saveRDS() function to a .csv file in .rds format.
Example
Consider the following program that benchmarks the difference between the reading time of the same file present in two different formats ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token class-name">Display</span> the time taken <span class="token keyword">to</span> <span class="token namespace">read</span> file using read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token function">print</span><span class="token punctuation">(</span>system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span> <span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> # <span class="token class-name">Save</span> the file in <span class="token punctuation">.</span>rds format <span class="token function">saveRDS</span><span class="token punctuation">(</span><span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">,</span> <span class="token string">"myFile.rds"</span> <span class="token punctuation">)</span> # <span class="token class-name">Display</span> the time taken <span class="token keyword">to</span> <span class="token namespace">read</span> in binary format <span class="token function">print</span><span class="token punctuation">(</span>system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token function">readRDS</span><span class="token punctuation">(</span><span class="token string">"myFile.rds"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </div>
Output
user system elapsed
0.017 0.002 0.603
user system elapsed
0 0 0
Notice the difference between the execution time of both the methods. The time taken to read the same file in .RDS format is almost negligible. Thus reading an RDS file is more efficient than reading a CSV file.
Assigning using "<-" and "=" operators
R provides us with several ways to assign variables and files to objects. Two operators are widely used for this purpose: "<-" and "=". It is interesting to note that when we use the "<-" operator inside a function then it either creates a new object or overrides the existing ones. Since we want to store the result, using the "<-" operator is the useful inside system.time() function.
Elapsed time microbenchmark function
The system.time() function is reliable for computing the time taken by certain operations but it has a limitation to not compare many operations simultaneously.
R provides us with a microbenchmark library that provides us with a microbenchmark() function using which we can compare the time taken by two functions or operations.
Example
Consider the following program that uses the microbenchmark() function to compare the same file present in two different formats: CSV and RDS
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span>
# <span class="token class-name">Save</span> the file in <span class="token punctuation">.</span>rds format
<span class="token function">saveRDS</span><span class="token punctuation">(</span><span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">,</span>
<span class="token string">"myFile.rds"</span> <span class="token punctuation">)</span>
# <span class="token class-name">Compare</span> using <span class="token function">microbenchmark</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
difference <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">microbenchmark</span><span class="token punctuation">(</span>read<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span>
<span class="token string">"https://people.sc.fsu.edu/~jburkardt/data/csv/snakes_count_10000.csv"</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
<span class="token function">readRDS</span><span class="token punctuation">(</span><span class="token string">"myFile.rds"</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
times <span class="token operator">=</span> <span class="token number">2</span><span class="token punctuation">)</span>
# <span class="token class-name">Display</span> the time difference
<span class="token function">print</span><span class="token punctuation">(</span>difference<span class="token punctuation">)</span>
</div>
Output
min lq mean median uq max neval
405062.028 405062.028 409947.146 409947.146 414832.264 414832.264 2
41.151 41.151 102.355 102.355 163.559 163.559 2
Notice the difference between the execution time of both the methods.
Efficient Vectorisation
The increasing size of a vector with the flow of the code is not desirable in programming and it should be avoided as much as possible. This is because it consumes a lot of time and makes our program inefficient.
Example
For example, the following source code increases the size of the vector ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token function">expand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
expand <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">function</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span> <span class="token punctuation">{</span>
myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token class-name">NULL</span>
<span class="token keyword">for</span><span class="token punctuation">(</span>currentNumber in <span class="token number">1</span><span class="token operator">:</span>n<span class="token punctuation">)</span>
myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span>myVector<span class="token punctuation">,</span> currentNumber<span class="token punctuation">)</span>
myVector
<span class="token punctuation">}</span>
# <span class="token class-name">Using</span> system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>res_grow <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">expand</span><span class="token punctuation">(</span><span class="token number">1000</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>
Output
user system elapsed 0.003 0.000 0.003
As you can see in the output, the expand() function is consuming a lot of time.
Example
We can optimize the above code by preallocating the vector. For example, consider the following program ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token function">expand</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
expand <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">function</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span> <span class="token punctuation">{</span>
myVector <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span>n<span class="token punctuation">)</span>
<span class="token keyword">for</span><span class="token punctuation">(</span>currentNumber in <span class="token number">1</span><span class="token operator">:</span>n<span class="token punctuation">)</span>
myVector<span class="token punctuation">[</span>currentNumber<span class="token punctuation">]</span> <span class="token operator">=</span> currentNumber
<span class="token punctuation">}</span>
# <span class="token class-name">Using</span> system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function
system<span class="token punctuation">.</span><span class="token function">time</span><span class="token punctuation">(</span>res_grow <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">expand</span><span class="token punctuation">(</span><span class="token number">10000</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
</div>
Output
user system elapsed 0.001 0.000 0.001
As you can see in the output, the execution time has been reduced drastically.
We should vectorize our code whenever possible.
Example
For example, consider the following program that adds the values in a vector using the simple loop method ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"># <span class="token class-name">Initialize</span> a vector <span class="token keyword">with</span> <span class="token namespace">random</span> values myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">rnorm</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> # <span class="token class-name">Declare</span> another vector myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span> # <span class="token class-name">Compute</span> the sum <span class="token keyword">for</span><span class="token punctuation">(</span>index in <span class="token number">1</span><span class="token operator">:</span><span class="token number">20</span><span class="token punctuation">)</span> myVector2<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator"><</span><span class="token operator">-</span> myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator">+</span> myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span> # <span class="token class-name">Display</span> <span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span> </div>
Output
[1] 1.31044581 -1.98035551 0.14009657 -1.62789103 1.23248277 0.49893302 [7] -0.53349928 -0.02553238 -0.06886832 1.16296981 0.90072271 0.20713797 [13] -1.72293906 0.62083278 2.77900829 4.15732558 1.71227621 2.09531955 [19] -0.06520153 0.62591177
The output represents the sum of corresponding vector values with itself.
Example
The following do the same thing as done above but this time we will use the vectorization method that will make decrease our code size and increases the execution time ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">rnorm</span><span class="token punctuation">(</span><span class="token number">20</span><span class="token punctuation">)</span> myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span> # <span class="token class-name">Add</span> using vectorization myVector2 <span class="token operator"><</span><span class="token operator">-</span> myVector1 <span class="token operator">+</span> myVector1 # <span class="token class-name">Display</span> <span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span> </div>
Output
[1] -1.0100098 3.2932186 -3.5650312 -3.2800819 0.1513545 -1.5786916 [7] 2.0485566 2.6009810 -0.8015987 -0.6965471 -1.4298714 1.1251865 [13] 1.2536663 2.6258258 1.1093443 -1.7895628 0.3472878 -1.4783578 [19] -0.7717328 -2.2734743
The output represents the sum of corresponding vector values with itself but this time we have used a vectorization method.
Note that we can apply vectorization techniques even with the R inbuilt functions.
Example
For example, consider the following program that computes the log of individual values present in a vector ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">13</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">57</span><span class="token punctuation">,</span> <span class="token number">88</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">,</span> <span class="token number">110</span><span class="token punctuation">)</span> myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span> # <span class="token class-name">Compute</span> the sum <span class="token keyword">for</span><span class="token punctuation">(</span>index in <span class="token number">1</span><span class="token operator">:</span><span class="token number">10</span><span class="token punctuation">)</span> myVector2<span class="token punctuation">[</span>index<span class="token punctuation">]</span> <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">log</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">[</span>index<span class="token punctuation">]</span><span class="token punctuation">)</span> # <span class="token class-name">Display</span> the vector <span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span> </div>
Output
[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337 [9] 4.605170 4.700480
As you can see in the output, the logarithm of the corresponding vector values have been displayed.
Example
Now let us try to achieve the same thing but using vectorization technique this time ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;">myVector1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">,</span> <span class="token number">13</span><span class="token punctuation">,</span> <span class="token number">16</span><span class="token punctuation">,</span> <span class="token number">32</span><span class="token punctuation">,</span> <span class="token number">64</span><span class="token punctuation">,</span> <span class="token number">57</span><span class="token punctuation">,</span> <span class="token number">88</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">,</span> <span class="token number">110</span><span class="token punctuation">)</span> myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">numeric</span><span class="token punctuation">(</span><span class="token function">length</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span><span class="token punctuation">)</span> myVector2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">log</span><span class="token punctuation">(</span>myVector1<span class="token punctuation">)</span> # <span class="token class-name">Display</span> <span class="token function">print</span><span class="token punctuation">(</span>myVector2<span class="token punctuation">)</span> </div>
Output
[1] 2.079442 2.302585 2.564949 2.772589 3.465736 4.158883 4.043051 4.477337 [9] 4.605170 4.700480
As you can see in the output, the logarithm of the corresponding vector values have been displayed but this time we have used the vectorization method.
Example
The matrix that contains elements of the same data type has the faster column access as compared to a dataframe. For example, consider the following program ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span> # <span class="token class-name">Create</span> a matrix myMatrix <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">matrix</span><span class="token punctuation">(</span><span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token operator">:</span><span class="token number">12</span><span class="token punctuation">)</span><span class="token punctuation">,</span> nrow <span class="token operator">=</span> <span class="token number">4</span><span class="token punctuation">,</span> byrow <span class="token operator">=</span> TRUE<span class="token punctuation">)</span> # <span class="token class-name">Display</span> <span class="token function">print</span><span class="token punctuation">(</span>myMatrix<span class="token punctuation">)</span> # <span class="token class-name">Create</span> rows data1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span> data2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">11</span><span class="token punctuation">)</span> data3 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">6</span><span class="token punctuation">,</span> <span class="token number">9</span><span class="token punctuation">,</span> <span class="token number">12</span><span class="token punctuation">)</span> # <span class="token class-name">Create</span> a dataframe myDataframe <span class="token operator"><</span><span class="token operator">-</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data1<span class="token punctuation">,</span> data2<span class="token punctuation">,</span> data3<span class="token punctuation">)</span> # <span class="token class-name">Display</span> the dataframe <span class="token function">print</span><span class="token punctuation">(</span><span class="token function">microbenchmark</span><span class="token punctuation">(</span>myMatrix<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">,</span> myDataframe<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </div>
Output
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
Unit: nanoseconds
expr min lq mean median uq max neval
myMatrix[, 1] 493 525.0 669.64 595.5 661 5038 100
myDataframe[, 1] 6880 7110.5 8003.56 7247.0 7437 53752 100
You can spot the difference in execution time for the column access method of a matrix and a dataframe.
Parallel Programming for efficient R code
R provides us with a parallel package using which we can write efficient R code. Parallelism is most of the time beneficial to get things done in less time and make the proper use of the system resources. The parallel package in R provides us the parApply() function that uses the following steps to run a program in parallel ?
Make a cluster using the makeCluster() function.
Write some statements.
Eventually, stop the cluster using the stopCluster() function.
Example
The following source code calculates the mean of all the columns using parApply() function in R ?
<div class="code-mirror language-java" contenteditable="plaintext-only" spellcheck="false" style="outline: none; overflow-wrap: break-word; overflow-y: auto; white-space: pre-wrap;"><span class="token function">library</span><span class="token punctuation">(</span>parallel<span class="token punctuation">)</span> <span class="token function">library</span><span class="token punctuation">(</span>microbenchmark<span class="token punctuation">)</span> # <span class="token class-name">Create</span> rows data1 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">,</span> <span class="token number">4</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">10</span><span class="token punctuation">)</span> data2 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> <span class="token number">5</span><span class="token punctuation">,</span> <span class="token number">8</span><span class="token punctuation">,</span> <span class="token number">11</span><span class="token punctuation">)</span> data3 <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">c</span><span class="token punctuation">(</span><span class="token number">3</span><span class="token punctuation">,</span> <span class="token number">6</span><span class="token punctuation">,</span> <span class="token number">9</span><span class="token punctuation">,</span> <span class="token number">12</span><span class="token punctuation">)</span> # <span class="token class-name">Create</span> a dataframe myDataframe <span class="token operator"><</span><span class="token operator">-</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data1<span class="token punctuation">,</span> data2<span class="token punctuation">,</span> data3<span class="token punctuation">)</span> # <span class="token class-name">Create</span> a cluster cluster <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">makeCluster</span><span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> # <span class="token class-name">Apply</span> <span class="token function">parApply</span><span class="token punctuation">(</span><span class="token punctuation">)</span> function <span class="token function">print</span><span class="token punctuation">(</span><span class="token function">parApply</span><span class="token punctuation">(</span>cluster<span class="token punctuation">,</span> myDataframe<span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">,</span> mean<span class="token punctuation">)</span><span class="token punctuation">)</span> # <span class="token class-name">Stop</span> the cluster <span class="token function">stopCluster</span><span class="token punctuation">(</span>cluster<span class="token punctuation">)</span> </div>
Output
data1 data2 data3 5.5 6.5 7.5
As you can see in the output, the mean of the corresponding columns has been computed using parallel programming which is faster.
Conclusion
In this article, we briefly discussed how you can write efficient code in R. We discussed benchmarking, different vectorization techniques, and parallel programming. I hope this tutorial has surely helped you to expand your knowledge in the field of data science.
