Using containers for reproducible and collaborative computation

2021-Nov-06

Here, I work through how Docker can be used as a way of creating a customised computational platform for reproducibility and collaboration in my research.

I have previously used the Code Ocean site (Clyburne-Sherin, Fei, & Green, 2019) to create and host containers for reproducing analyses from research projects. For example, this capsule relates to the first experiment reported in Peterson, Kersten, & Mannion (2018). These containers allow for an operating system and all the software needed to run an analysis to be bundled together and stored for future execution on a given host system, which largely solves (or at least considerably changes) the problem of bit rot. I have found this to be a great way of establishing a reproducible record of an analysis.

I recently read an excellent tutorial article by Wiebels & Moreau (2021) on Docker containers, which are the technology underlying Code Ocean. This has motivated me to look more into the additional use of containers that they describe—as a way of establishing a common computational platform for research collaboration. In this way, containers can be used during a project’s implementation rather than being primarily used at its end-point.

Here, I will build on the tutorial article by Wiebels & Moreau (2021) and explore the use of Docker in the context of my research. I will begin by describing how I have created a Docker “image” containing an operating system and the software that is used to run a typical analysis in my research. Then, I will explain how this image can be shared with others and used to run analyses. Finally, I will discuss a couple of potential issues that have arisen and consider how I might use containers in the future.

Creating the image
Building and sharing the image
Using the image
Limitations and potential issues
- Awkwardness with shared volume permissions
- Docker installation
Future directions
- Including optimised packages
- Containers for running experiments

Creating the image

We create a Docker image by writing a Dockerfile, which contains the instructions for building the image.

Specifying the base image

We build our image by adding functionality onto a base image. Here, I will use an image from the Arch Linux Docker project. I have used Arch Linux for over a decade now on a variety of desktops, laptops, and servers and I have found it to be a great distribution—and one that, by now, I know my way around.

FROM archlinux:base-devel

Installing the software

Now we want to install the additional software (beyond that which is installed in the base image) that we are likely to need when running our computations. For me, that mostly consists of Python packages:

FROM archlinux:base-devel

RUN pacman -Syu --noconfirm \
    && pacman -S --noconfirm \
        python \
        python-pip \
        python-wheel \
        python-pyqt5 \
        sip \
        qt5-svg \
        xorg-server-xvfb \
        gsfonts \
        vim \
    && yes | pacman -Scc

RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
    && pip --no-cache-dir install veusz

A few things to note:

We install xorg-server-xvfb so that we can use a ‘virtual’ graphical server, which is necessary for creating figures with veusz in headless mode.
We install gsfonts to give us “Nimbus Sans” as a substitute for “Arial”.
It feels a little wrong to be using pip to install to the system Python directory, but it doesn’t seem worth the hassle to create a virtual environment within the image. Many of the packages are also present in the system repositories, so could also be installed that way rather than via pip.
We remove the installation files (via the pacman -Scc command and the --no-cache-dir flag to pip) because they are not required in the image and they just take up space.

Facilitating the use of custom Python packages

We want to be able to use our own custom Python packages while running a container based on the image. How can we make Python aware of the existence of those packages so that they can be easily imported?

The solution I have come up with is to use Python’s site-specific configuration hook. If Python finds a file called sitecustomize.py within its system paths, the file is executed each time Python starts up.

We can utilise that behaviour by creating our own sitecustomize.py file, which adds any directories within a code sub-directory of the home directory to the list of import paths:

import pathlib
import sys

base_code_dir = pathlib.Path("~/code").expanduser()

if base_code_dir.exists():

    code_dirs = [
        str(path.absolute())
        for path in base_code_dir.iterdir()
        if path.is_dir()
    ]

    sys.path.extend(code_dirs)

That means that any Python packages that are present in ~/code (within the container) will automatically be available for importing.

We can then copy this sitecustomize.py file into the Python system directory in the image:

FROM archlinux:base-devel

RUN pacman -Syu --noconfirm \
    && pacman -S --noconfirm \
        python \
        python-pip \
        python-wheel \
        python-pyqt5 \
        sip \
        qt5-svg \
        xorg-server-xvfb \
        gsfonts \
        vim \
    && yes | pacman -Scc

RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
    && pip --no-cache-dir install veusz

COPY sitecustomize.py /usr/lib/python3.9/site-packages/

Initialising a user

All of the commands so far have been executed by the root user. However, when running regular commands it is better to use a user with fewer privileges.

Here, we create a user labmember who is part of the labmembers group (we will discuss these choices in the Limitations and potential issues section below). We then switch to using this user and set the working directory to the user’s home directory:

FROM archlinux:base-devel

RUN pacman -Syu --noconfirm \
    && pacman -S --noconfirm \
        python \
        python-pip \
        python-wheel \
        python-pyqt5 \
        sip \
        qt5-svg \
        xorg-server-xvfb \
        gsfonts \
        vim \
    && yes | pacman -Scc

RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
    && pip --no-cache-dir install veusz

COPY sitecustomize.py /usr/lib/python3.9/site-packages/

RUN groupadd -g 10001 labmembers \
    && useradd --create-home --groups labmembers --uid 10000 labmember \
    && passwd -l labmember

USER labmember
WORKDIR /home/labmember

Setting up the environment

Finally, we just do a small bit of environment cleanup. I have found that a system library used by veusz produces a warning if this directory doesn’t exist—so we create it here to minimise any distracting output later on:

FROM archlinux:base-devel

RUN pacman -Syu --noconfirm \
    && pacman -S --noconfirm \
        python \
        python-pip \
        python-wheel \
        python-pyqt5 \
        sip \
        qt5-svg \
        xorg-server-xvfb \
        gsfonts \
        vim \
    && yes | pacman -Scc

RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
    && pip --no-cache-dir install veusz

COPY sitecustomize.py /usr/lib/python3.9/site-packages/

RUN groupadd -g 10001 labmembers \
    && useradd --create-home --groups labmembers --uid 10000 labmember \
    && passwd -l labmember

USER labmember
WORKDIR /home/labmember

ENV XDG_RUNTIME_DIR=/tmp/runtime-root/
RUN mkdir -m 0700 -p ${XDG_RUNTIME_DIR}

Building and sharing the image

Now that we have prepared our Dockerfile, we can use docker to build our image—which we will call labenv:

docker build --tag labenv .

After having set myself up on Docker Hub, I can then share the image publicly by tagging it with my username and pushing it to the hub:

docker tag labenv djmannion/labenv
docker push djmannion/labenv

Using the image

Accessing the image

If we have built the image ourselves, we can go ahead and start using it. If not, we don’t actually need to build it—we can just retrieve the image from the Docker hub:

docker pull djmannion/labenv

A simple example of command execution

We can execute a command inside an instance of the image (the container) using the docker run command. For example (line separated for clarity):

docker run \
    --rm \
    djmannion/labenv \
    python -c 'print("Hello, world!")'

In the above example, the --rm means to remove the container after it has finished executing.

Running Jupyter Lab in the container

To actually use the container practically, I think the Jupyter Lab project provides a great interface. We will launch Jupyter Lab within the container and access it via a web browser on our host system. We can do that via:

docker run \
    --rm \
    --publish 8888:8888 \
    djmannion/labenv \
    jupyter-lab --no-browser --ip=0.0.0.0

Accessing local files within the container

So far, the container has given us a computational environment that we can use to execute code and process data. However, we don’t have any of our own code or data existing within the container.

To allow the container to access local files, we can use the --volume argument to docker run. This will allow us to mount a local directory within the container.

For example, a typical project might have directories like code, data, and results within a project directory. If using a Linux host, that project directory might be something like /home/damien/myproject. We can make this available to the container via:

docker run \
    --rm \
    --publish 8888:8888 \
    --volume /home/damien/myproject:/home/labmember/myproject \
    djmannion/labenv \
    jupyter-lab --no-browser --ip=0.0.0.0

The files within /home/damien/myproject are then available within the myproject directory within the home directory of the container (and hence within the running Jupyter Lab).

After that, we’re all set to go!

Limitations and potential issues

While relatively straightforward compared to the functionality that it affords, using Docker and containers has raised a couple of limitations and potential issues in my usage so far.

Awkwardness with shared volume permissions

As described above, we can use the --volume option to mount a local directory within the container. However, this can introduce some challenges on Linux (and probably Mac) systems regarding file permissions.

For example, files in the local myproject directory might be only readable and/or writeable by the user damien. This may cause permission problems when the container user (labmember) tries to access and/or modify the files.

The best (though still clunky) solution that I have come up with thus far is to:

Make all of the files in the myproject directory have a group ownership with an ID of 10001 (named labmembers) and have read and write permissions for the group.
Set the sticky bit set such that any new files created are owned by this group (e.g., chmod g+s myproject).
Set the Access Control List on the directory so that new files are created with group read/write permissions (overriding the active umask, which would normally not allow new files to be group-writeable). For example, setfacl -d -m "g::rwx" myproject.

That way, the labmember user that we created in the container (and which we made part of a labmembers group) has appropriate access to the files and any new files that it creates are appropriately accessible on the local system.

This approach does leave a bit of residual unpleasantness in that any files created within the container will be owned by the labmember user, who has a user ID of 10000. However, that user ID is unlikely to map to a user on the local system—and so any such files will show up on the local system as being owned by 10000.

Docker installation

While making a computational environment able to be run on a local system is much less of a burden when using Docker relative to other alternatives, it is still a non-trivial requirement. Installing Docker involves some heavy-duty components and system processes, particularly on Windows. Furthermore, users may even need to dig into their BIOS to enable virtualisation settings before Docker will run—creating a barrier for less experienced users. This may also present somewhat of a risk and potential liability for those requiring a Docker installation from others (e.g., teachers requiring students to install Docker for a class) if something goes awry with these low-level system settings.

Future directions

Including optimised packages

The image we have prepared above has used standard software packages. However, the efficiency of some software can sometimes be improved by installing alternatives.

For example, a good candidate for optimisation in the current image is to install the Intel Math Kernel Library. I have found this to offer fairly substantial speedups for packages such as numpy, scipy, and particularly pymc3. However, it would require compiling numpy and scipy and setting some runtime flags for pymc3 when building the image—increasing the complexity of the Dockerfile.

Containers for running experiments

Might these containers also be used for running experiments? For example, could an image be created that would allow an experiment written in psychopy to be executed in a container? This form of reproducibility—of the process used to collect the data—seems to receive a lot less attention than the reproducibility of an analysis given a dataset. Unfortunately, it doesn’t seem like this sort of graphical or multimedia output is possible from a Docker container (at least not in a cross-platform way). Hopefully there may be progress in this direction in the future.

References

Clyburne-Sherin, A., Fei, X., & Green, S.A. (2019) Computational reproducibility via containers in psychology. Meta-Psychology, 3, 1–9.
Peterson, L.M., Kersten, D.J., & Mannion, D.J. (2018) Surface curvature from kinetic depth can affect lightness. Journal of Experimental Psychology: Human Perception & Performance, 44(12), 1856–1864.
Wiebels, K. & Moreau, D. (2021) Leveraging containers for reproducible psychological research. Advances in Methods and Practices in Psychological Science, 4(2), 1–18.