Dockerfile Basics and Best Practices

In this section, you will learn how to create custom Docker images for your team of data scientists.

In order to build Docker images that contain the tools and dependencies your team needs, you need to write instructions in a Dockerfile, which is a text file that contains all the commands (in order) that need to be run to build the desired image. If you are not familiar with Dockerfile, we recommend reading this Docker tutorial.

Prior to beginning, please review the following best practices and warnings for building environments on the Platform with Dockerfiles.

Best Practices

  • If you reference context files in your instructions, use relative paths, not absolute paths.
  • Conda and R don’t play well together. If you intend to have R and Python cross dependencies, avoid using Conda. Instead, install Python, R, and their respective libraries via pip, R and apt-get commands.
  • Make sure that your Python and R interpreters are in your PATH. The version that is in your PATH will be executed. If you have installed multiple versions of the Python interpreter (e.g. Python 2.7, Python 3.6), make sure you activate the right one in your PATH. The same goes for R.
  • We recommend having separate environments for Python 2 and 3.
  • Take advantage of image “inheritance”. Build base images that could be used for other Base or User environments. Avoid creating images with very intricate sets of dependencies by breaking them into smaller images. This will help with debugging.
  • Review the build logs very carefully. Sometimes installation errors will occur, yet the image build could still be successful.
  • Version lock packages and libraries via ==x.x.x whenever possible. This will maximize reproducibility and consistency.

Dockerfile Basics

This section outlines the basics of writing Dockerfiles. For those who are already familiar with Dockerfile, you may skip this section and proceed to the next one.

Below is a description of all Dockerfile instructions currently supported for Base and User environments.

Warning

Each time a Dockerfile instruction is executed, it creates a layer. At present, a maximum of 127 layers is allowed. You can minimize the number of layers by chaining shell commands and specifying dependencies using requirements files (see the following sections).

Dockerfile Supported Instructions

RUN Command

The RUN command executes shell commands (/bin/sh -c by default on Linux systems). Here’s one example installing the Python package gensim using the package manager Conda:

RUN conda install --yes -n python3 gensim

You can chain shell commands within RUN by adding && between each command. Use && \ to start a new line. For example:

RUN conda install --yes -n python3 gensim && conda install --yes -n python2 gensim

OR

RUN conda install --yes -n python3 gensim && \
    conda install --yes -n python2 gensim

You can run a variety of commands mostly related to package managers (pip, conda, apt) or others like wget or curl.

Warning

If you are using the Conda package manager, avoid creating a Conda environment. Instead, update the root environment with whatever dependencies you want to install.

As an example to the above warning, it would be best practice to run this: RUN conda env update -n root --file environment.yml and avoid this: RUN conda env create -f environment.yml

SHELL Instruction

You can change the default shell using the SHELL command. This changes all subsequent RUN commands. Simply add the following in your Dockerfile:

SHELL ["/bin/bash", "-c"]

This is an example where the bourne shell (bash) is used instead of the default.

COPY Instruction

The COPY instruction is implicit in the Upload TAR button. After selecting a local tarball file, the tarball is exploded and all files will implicitly COPY to the Docker image. You do not need manually run COPY. All these supporting files will be available for you to use in Dockerfile instructions. For example, the following line would use a .yml file from the uploaded tarball:

RUN conda env update -n root --file environment.yml

ADD Instruction

The ADD instruction is similar to the COPY instruction. All files included in the tarball automatically COPY to the Docker image with the Upload TAR button.

ENV Instruction

The ENV instruction allows you to set environment variables in the Docker image. For example:

ENV my_variable itsvalue
ENV my_variable="itsvalue"

Instructions Not Allowed

ARG Instruction

The ARG instruction is not allowed.

FROM Instruction

Although the FROM instruction is not allowed, it is generated implicitly when you select a base image to inherit from. Thus, every Dockerfile on the Platform actually starts with a FROM statement. To make it easier to trace the lineage of your environments, you can use comments to indicate the parent Base environment.

For example:

# this is a dockerfile
# FROM Default Base Environment

CMD Instruction

The CMD instruction is not allowed.

ENTRYPOINT Instruction

We do not allow custom ENTRYPOINT instructions at the present time.

EXPOSE Instruction

The EXPOSE instruction is not allowed for either Base or User environments.

VOLUME Instruction

We do not allow VOLUME instructions at the present time.

Context Files

When building environments, you may need to reference other files in your Dockerfile. Commonly, these files will contain lists of dependencies that you want to add to your environment. You can supply these context files in a tarball at build time using the Upload TAR button, which is located below the Enter Dockerfile field. During the environment build process this tarball will be decompressed, thus exposing to Docker the context files it contained.

For example, you may want to specify a set of pip packages and CRAN libraries to install. You could create requirements.txt for the former and cran.txt and install.R for the latter; in all cases, these filenames are examples and can be modified. You then compress these files into a tarball, upload to the environments build page, and reference them in your Dockerfile as follows:

# Install python packages
RUN pip install -r requirements.txt

# Install r packages from text file
cat cran.txt | awk '{system("/usr/bin/Rscript ./install.r "$1)}'

Putting It All Together

This section contains a few examples of Dockerfiles you can use in your workflow. The example starts with a base image that installs the package manager Conda. This base image should be built from the default base image environment.

Example 1: Building a Conda Python 2.7 Environment with ML and Stats Dependencies

A Conda Base Dockerfile:

RUN wget --quiet https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
/bin/bash Miniconda3-latest-Linux-x86_64.sh -f -b -p /opt/conda && \
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH /opt/conda/bin:$PATH

Note that Conda is located in /opt/conda/bin. It is added to the PATH in line 4. Also note that Miniconda3 is installed. That implies that Python 3.6 is installed. To install Python 2.7, see the user environment below. Alternatively, you can install Miniconda2.

Example User Environment Dockerfile:

This is an example of a user environment Dockerfile where a user would select the Conda base environment, activate the Python 2.7 kernel, and install a series of packages with both Conda and pip package managers.

RUN conda install python=2.7 && conda install numpy && \
conda install pandas && conda install scipy &&  pip install scikit-learn

Note that in this particular example, we don’t specify version numbers for the libraries listed above. As a result, the most recent versions will be installed. Specifying version numbers is generally best practice.

The diagram below shows the inheritance structure of the different Docker images for this example:

../_images/9c2d049-docker_images_dep.png

In the User environment, do not forget to select the tools you want.

Example 2: Installing R Dependencies (rJava)

In this example, you’re creating a base image that contains all the dependencies needed to install rJava. In fact, you install rJava as part of the last command of this Dockerfile.

A Base Dockerfile for rJava Dependencies:

RUN apt-get update && \
apt-get -y install default-jre && \
apt-get -y install default-jdk && \
apt-get -y install r-base && \
apt-get -y install r-base-dev

RUN R CMD javareconf
RUN apt-get -y install r-cran-rjava