No doubt that the buzzing words in the technological world these days are machine learning and artificial intelligence. More rapidly than ever, organizations are now trying to adopt machine learning techniques to analyse and improve their businesses and to provide more customer satisfaction. While on the other hand, the revolutionary concept of containerization has made the lives of millions of developers a lot easier and has helped them in adopting better techniques to maintain, reuse and keep track of their projects a seamless experience.
In this article, we will talk about how to build a full fledged data science container. We will create a containerized environment with all the essential libraries and packages installed which would help you get started with your machine learning and data science journey. We will see how to install the core packages involved in any machine learning code and embed them in a dockerfile to build our image. Without any further ado, let’s see how to do it.
To begin with, let’s see what are the python are going to install that you would need to build a basic machine learning python script.
Pandas − It helps in performing data analysis by helping us maintain our structured and unstructured datasets. It also provides tools to re-shape and organize your data to extract important insights from it.
Numpy − It provides methods for computation of large matrices and multi dimensional arrays. It has numerous pre compiled methods which makes computation less expensive than normal algebra.
Matplotlib − It helps you to retrieve important insights from your data frames by plotting graphs between features and helps you to perform careful examination of your data points to help you decide which machine learning model would be suitable for your problem statement.
Scikit learn − This package is the core of all machine learning python scripts. It contains numerous machine learning algorithms. Be it a supervised or unsupervised algorithm, you name it, they have it. It also helps you to split your dataset, structure your dataset by removing illicit values and many more.
Scipy − This library helps you to perform scientific computations on your dataset very easily. It provides several methods to perform advanced operations such as calculating probability, transformations etc.
NLTK − If you are working on NLP domain, you have surely heard about this library. It helps you to perform stemming, lemmatization, pos-tagging, semantic analysis etc.
Now that we have seen the core libraries used for machine learning and data science, we will try to build our dockerfile to install these packages.
We will use alpine as our base image to get a python environment as it has very low size.
Check out the dockerfile below.
FROM alpine:latest WORKDIR /usr/src/app RUN apt-get −y update RUN pip3 install −r requirements.txt RUN [ "python3", "−c", "import nltk; nltk.download('all')" ] CMD ["python3"]
The dockerfile above, pulls the alpine image, sets the working directory, runs an apt−get command, installs the libraries from a requirements file and downloads all the methods from nltk library for use.
The requirements file should contain the following content −
pandas numpy matplotlib scikit-learn scipy nltk
After you have created both the dockerfile and the requirements.txt file, you can build your docker image using the docker build command.
sudo docker build −t <username>/<image−name>:<tag−name> .
The −t flag is used to tag the image. However, it's not mandatory to do so, but always advisable
After you have successfully built the image, you can run the docker container using the following docker run command.
sudo docker run −it <username>/<image−name>:<tag−name>
To conclude, in this article, we have seen how to build a docker container which contains all the basic python libraries and packages that you would need to start your journey in machine learning and data science. You can always install additional libraries, by launching the bash and running the appropriate command as and when required.