Parallel Computing with Dask

Data Science Pandas Server Side Programming

Dask is a flexible open-source Python library which is used for parallel computing. In this article, we will learn about parallel computing and why we should choose Dask for this purpose.

We will compare it with various other libraries like spark, ray and modin. We have also discussed use cases of Dask.

Parallel Computing

A type of computation known as parallel computing carries out several computations or processes simultaneously. Large issues are typically divided into manageable pieces that may be solved separately.

The four categories of parallel computing are

Bit-level
Instruction-level
Data-level
Job parallelism.

Although parallelism has been utilised in high-performance computing for a long time, it has only lately become more popular because of physical restrictions on frequency expansion.

Need of Dask

A question that comes to mind is why do we even need Dask.

Data manipulation and machine learning tasks are made simple with the help of Python libraries like Numpy, Sklearn, Sklearn, Seaborn, and others. For the majority of data analysis tasks, the Python [pandas] module is enough. Data may be manipulated in many different ways, and machine learning models can be created using that data.

Pandas will, however, become insufficient if your data gets greater than the RAM that is available. This is a quite common problem. You may employ Spark or Hadoop to get around this. However, these aren't Python environments. You are unable to use NumPy, pandas, sklearn, TensorFlow, and other well-known Python machine learning tools as a result. Exists a method to get around this? Yes! Here is when Dask comes into play.

Introduction of Dask

Dask is a framework for parallel computation that integrates seamlessly with Jupyter notebook. Initially, it was created to expand the compute power of NumPy, Pandas, and Scit-kit to get beyond the storage restrictions of a single machine. DASK analogs may be used to learn, but soon it was being used as a general distributed system

Dask has two primary strengths −

Scalability

Dask scales natively with Python versions of Pandas, NumPy, and Scikit-Learn and runs resiliently on clusters with many cores. It can also be scaled down to run on a single system.

Planning

Similar to Airflow, Luigi, Dask Task Schedulers are optimised for computation. It offers quick feedback, manages tasks with Task graphs, and supports local and distributed diagnostics, making it dynamic and responsive.

Additionally, Dask offers a real-time, dynamic dashboard that updates every 100 milliseconds and displays various information like progress, memory utilisation, etc.

Depending on your taste, you may either clone the git repository or use Conda/pip to install Dask.

conda install dask

To only install core -

conda install dask-core

Dask-core is a restricted version of Dask that only installs the essential components. The same is true for pip. If scaling up pandas, numpy, or both with dask dataframes and dask arrays is all that matters to you, you may also install just the dask data frame or dask array.

python -m pip install dask

To Install requirements for dataframe

python -m pip install "dask[dataframe]" #

To install requirements for array

python -m pip install "dask[list]"

Let's look at several instances of this library being used for parallel computation. Our code uses dask.delayed to achieve parallelism.

Note − The below two code snippets should be run in two different cells in Jupyter notebook

import time
import random
def calcprofit(a, b):
   time.sleep(random.random())
   return a + b
def calcloss(a, b):
   time.sleep(random.random())
   return a - b
def calctotal(a, b):
   time.sleep(random.random())
   return a + b

Now run the below code snippet −

%%time
profit = calcprofit(10, 22)
loss = calcloss(18, 3)
total = calctotal(profit, loss)
print(total)

Output

47
CPU times: user 4.13 ms, sys: 1.23 ms, total: 5.36 ms
Wall time: 1.35 s

Although they are independent of one another, these functions will be performed one after the other in sequential order. We may thus execute them concurrently to save time.

import dask
calcprofit = dask.delayed(calcprofit)
calcloss = dask.delayed(calcloss)
calctotal = dask.delayed(calctotal)

Now run the below code snippet −

%%time
profit = calcprofit(10, 22)
loss = calcloss(18, 3)
total = calctotal(profit, loss)
print(total)

Output

Delayed('calctotal-9e3e896e-b4de-400c-aeb8-9e4c0961fe11')
CPU times: user 3.3 ms, sys: 0 ns, total: 3.3 ms
Wall time: 10.2 ms

Even in this simple example, the running time has improved. We may also see the task graph as follows −

total.visualize(rankdir='LR')

Spark vs. Dask

Spark is a robust cluster computing framework tool which divides data and processing into manageable pieces, distributes them over any size of cluster, and executes them concurrently.

Despite the fact that Spark is the de facto standard technology for Big Data analysis, Dask appears to be quite promising. Dask is lightweight and developed as a Python component, whereas Spark has the additional capability, primarily developed in Scala, and also supports Python/R. If you want a realistic solution or even have JVM infrastructure, Spark can be your first pick. However, Dask is a viable option if you want quick, lightweight parallel processing. It will be available at your disposal after a quick pip install.

Dask, Ray, and Modin

Ray, and Dask have different scheduling strategies. All jobs for a cluster are managed by a central scheduler used by Dask. Since Ray is decentralised, each computer has its own scheduler, allowing problems with scheduled tasks to be resolved at the level of the specific machine rather than the entire cluster. Ray lacks the rich high-level collections APIs that Dask offers (such as dataframes, distributed arrays, etc.).

Modin, on the contrary hand, rides over Dask or Ray. With the simple addition of a single line of code, import modin.pandas as pd, we can quickly expand our Pandas process with Modin. Although Modin makes an effort to parallelize a large portion of the Pandas API as feasible, Dask DataFrame sometimes doesn't scale the full Pandas API.

Examples of Dask Use Cases

The Dask application cases are split into two categories −

We can optimise our computations with the use of dynamic task scheduling.
Large datasets may be handled using "Big Data" collections, such as parallel arrays and dataframes.

The Task Graph, a visual depiction of the organisation of our data processing jobs, is made using Dask collections.

The task graph is carried out using Dask schedulers.

Dask does the jobs using parallel programming.

The term "parallel programming" refers to the simultaneous execution of numerous tasks.

By doing so, we may efficiently use our resources and complete several tasks at once.

Let's look at a few of the data sets that Dask has made available.

Dask.array - Using the Numpy interface, dask.array divides the enormous array into smaller ones enabling us to do computations on arrays that are larger than the system's memory
Dask.bag - It offers operations on collections of standard Python objects, such as filter, map, group by , and fold.
Dask.dataframe - Distributes data frames resembling Pandas It is an enormous parallel data frame constructed from several tiny data frames.

Conclusion

In this article, we learned about Dask and parallel computing. We have it helped you in enhancing your knowledge about Dask, its need and its comparison with other libraries.

Prerna Tiwari

Updated on: 09-Jan-2023

412 Views

Kickstart Your Career

Get certified by completing the course

Get Started