Using containers for reproducible and collaborative computation
Here, I work through how Docker can be used as a way of creating a customised computational platform for reproducibility and collaboration in my research.
I have previously used the Code Ocean site (Clyburne-Sherin, Fei, & Green, 2019) to create and host containers for reproducing analyses from research projects. For example, this capsule relates to the first experiment reported in Peterson, Kersten, & Mannion (2018). These containers allow for an operating system and all the software needed to run an analysis to be bundled together and stored for future execution on a given host system, which largely solves (or at least considerably changes) the problem of bit rot. I have found this to be a great way of establishing a reproducible record of an analysis.
I recently read an excellent tutorial article by Wiebels & Moreau (2021) on Docker containers, which are the technology underlying Code Ocean. This has motivated me to look more into the additional use of containers that they describe—as a way of establishing a common computational platform for research collaboration. In this way, containers can be used during a project’s implementation rather than being primarily used at its end-point.
Here, I will build on the tutorial article by Wiebels & Moreau (2021) and explore the use of Docker in the context of my research. I will begin by describing how I have created a Docker “image” containing an operating system and the software that is used to run a typical analysis in my research. Then, I will explain how this image can be shared with others and used to run analyses. Finally, I will discuss a couple of potential issues that have arisen and consider how I might use containers in the future.
- Creating the image
- Building and sharing the image
- Using the image
- Limitations and potential issues
- Future directions
Creating the image
We create a Docker image by writing a Dockerfile
, which contains the instructions for building the image.
Specifying the base image
We build our image by adding functionality onto a base image. Here, I will use an image from the Arch Linux Docker project. I have used Arch Linux for over a decade now on a variety of desktops, laptops, and servers and I have found it to be a great distribution—and one that, by now, I know my way around.
FROM archlinux:base-devel
Installing the software
Now we want to install the additional software (beyond that which is installed in the base image) that we are likely to need when running our computations. For me, that mostly consists of Python packages:
FROM archlinux:base-devel
RUN pacman -Syu --noconfirm \
&& pacman -S --noconfirm \
python \
python-pip \
python-wheel \
python-pyqt5 \
sip \
qt5-svg \
xorg-server-xvfb \
gsfonts \
vim \
&& yes | pacman -Scc
RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
&& pip --no-cache-dir install veusz
A few things to note:
-
We install
xorg-server-xvfb
so that we can use a ‘virtual’ graphical server, which is necessary for creating figures withveusz
in headless mode. -
We install
gsfonts
to give us “Nimbus Sans” as a substitute for “Arial”. -
It feels a little wrong to be using
pip
to install to the system Python directory, but it doesn’t seem worth the hassle to create a virtual environment within the image. Many of the packages are also present in the system repositories, so could also be installed that way rather than viapip
. -
We remove the installation files (via the
pacman -Scc
command and the--no-cache-dir
flag topip
) because they are not required in the image and they just take up space.
Facilitating the use of custom Python packages
We want to be able to use our own custom Python packages while running a container based on the image. How can we make Python aware of the existence of those packages so that they can be easily imported?
The solution I have come up with is to use Python’s site-specific configuration hook.
If Python finds a file called sitecustomize.py
within its system paths, the file is executed each time Python starts up.
We can utilise that behaviour by creating our own sitecustomize.py
file, which adds any directories within a code
sub-directory of the home directory to the list of import paths:
import pathlib
import sys
base_code_dir = pathlib.Path("~/code").expanduser()
if base_code_dir.exists():
code_dirs = [
str(path.absolute())
for path in base_code_dir.iterdir()
if path.is_dir()
]
sys.path.extend(code_dirs)
That means that any Python packages that are present in ~/code
(within the container) will automatically be available for importing.
We can then copy this sitecustomize.py
file into the Python system directory in the image:
FROM archlinux:base-devel
RUN pacman -Syu --noconfirm \
&& pacman -S --noconfirm \
python \
python-pip \
python-wheel \
python-pyqt5 \
sip \
qt5-svg \
xorg-server-xvfb \
gsfonts \
vim \
&& yes | pacman -Scc
RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
&& pip --no-cache-dir install veusz
COPY sitecustomize.py /usr/lib/python3.9/site-packages/
Initialising a user
All of the commands so far have been executed by the root
user.
However, when running regular commands it is better to use a user with fewer privileges.
Here, we create a user labmember
who is part of the labmembers
group (we will discuss these choices in the Limitations and potential issues section below).
We then switch to using this user and set the working directory to the user’s home directory:
FROM archlinux:base-devel
RUN pacman -Syu --noconfirm \
&& pacman -S --noconfirm \
python \
python-pip \
python-wheel \
python-pyqt5 \
sip \
qt5-svg \
xorg-server-xvfb \
gsfonts \
vim \
&& yes | pacman -Scc
RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
&& pip --no-cache-dir install veusz
COPY sitecustomize.py /usr/lib/python3.9/site-packages/
RUN groupadd -g 10001 labmembers \
&& useradd --create-home --groups labmembers --uid 10000 labmember \
&& passwd -l labmember
USER labmember
WORKDIR /home/labmember
Setting up the environment
Finally, we just do a small bit of environment cleanup.
I have found that a system library used by veusz
produces a warning if this directory doesn’t exist—so we create it here to minimise any distracting output later on:
FROM archlinux:base-devel
RUN pacman -Syu --noconfirm \
&& pacman -S --noconfirm \
python \
python-pip \
python-wheel \
python-pyqt5 \
sip \
qt5-svg \
xorg-server-xvfb \
gsfonts \
vim \
&& yes | pacman -Scc
RUN pip --no-cache-dir install numpy scipy ipython matplotlib pymc3 statsmodels scikit-learn python-dateutil scikit-image jupyter jupyterlab librosa soundfile resampy seaborn black pylint user-agents tabulate openpyxl factor-analyzer \
&& pip --no-cache-dir install veusz
COPY sitecustomize.py /usr/lib/python3.9/site-packages/
RUN groupadd -g 10001 labmembers \
&& useradd --create-home --groups labmembers --uid 10000 labmember \
&& passwd -l labmember
USER labmember
WORKDIR /home/labmember
ENV XDG_RUNTIME_DIR=/tmp/runtime-root/
RUN mkdir -m 0700 -p ${XDG_RUNTIME_DIR}
Building and sharing the image
Now that we have prepared our Dockerfile
, we can use docker
to build our image—which we will call labenv
:
docker build --tag labenv .
After having set myself up on Docker Hub, I can then share the image publicly by tagging it with my username and pushing it to the hub:
docker tag labenv djmannion/labenv
docker push djmannion/labenv
Using the image
Accessing the image
If we have built the image ourselves, we can go ahead and start using it. If not, we don’t actually need to build it—we can just retrieve the image from the Docker hub:
docker pull djmannion/labenv
A simple example of command execution
We can execute a command inside an instance of the image (the container) using the docker run
command.
For example (line separated for clarity):
docker run \
--rm \
djmannion/labenv \
python -c 'print("Hello, world!")'
In the above example, the --rm
means to remove the container after it has finished executing.
Running Jupyter Lab in the container
To actually use the container practically, I think the Jupyter Lab project provides a great interface. We will launch Jupyter Lab within the container and access it via a web browser on our host system. We can do that via:
docker run \
--rm \
--publish 8888:8888 \
djmannion/labenv \
jupyter-lab --no-browser --ip=0.0.0.0
Accessing local files within the container
So far, the container has given us a computational environment that we can use to execute code and process data. However, we don’t have any of our own code or data existing within the container.
To allow the container to access local files, we can use the --volume
argument to docker run
.
This will allow us to mount a local directory within the container.
For example, a typical project might have directories like code
, data
, and results
within a project directory.
If using a Linux host, that project directory might be something like /home/damien/myproject
.
We can make this available to the container via:
docker run \
--rm \
--publish 8888:8888 \
--volume /home/damien/myproject:/home/labmember/myproject \
djmannion/labenv \
jupyter-lab --no-browser --ip=0.0.0.0
The files within /home/damien/myproject
are then available within the myproject
directory within the home directory of the container (and hence within the running Jupyter Lab).
After that, we’re all set to go!
Limitations and potential issues
While relatively straightforward compared to the functionality that it affords, using Docker and containers has raised a couple of limitations and potential issues in my usage so far.
Awkwardness with shared volume permissions
As described above, we can use the --volume
option to mount a local directory within the container.
However, this can introduce some challenges on Linux (and probably Mac) systems regarding file permissions.
For example, files in the local myproject
directory might be only readable and/or writeable by the user damien
.
This may cause permission problems when the container user (labmember
) tries to access and/or modify the files.
The best (though still clunky) solution that I have come up with thus far is to:
-
Make all of the files in the
myproject
directory have a group ownership with an ID of10001
(namedlabmembers
) and have read and write permissions for the group. -
Set the sticky bit set such that any new files created are owned by this group (e.g.,
chmod g+s myproject
). -
Set the Access Control List on the directory so that new files are created with group read/write permissions (overriding the active
umask
, which would normally not allow new files to be group-writeable). For example,setfacl -d -m "g::rwx" myproject
.
That way, the labmember
user that we created in the container (and which we made part of a labmembers
group) has appropriate access to the files and any new files that it creates are appropriately accessible on the local system.
This approach does leave a bit of residual unpleasantness in that any files created within the container will be owned by the labmember
user, who has a user ID of 10000
.
However, that user ID is unlikely to map to a user on the local system—and so any such files will show up on the local system as being owned by 10000
.
Docker installation
While making a computational environment able to be run on a local system is much less of a burden when using Docker relative to other alternatives, it is still a non-trivial requirement. Installing Docker involves some heavy-duty components and system processes, particularly on Windows. Furthermore, users may even need to dig into their BIOS to enable virtualisation settings before Docker will run—creating a barrier for less experienced users. This may also present somewhat of a risk and potential liability for those requiring a Docker installation from others (e.g., teachers requiring students to install Docker for a class) if something goes awry with these low-level system settings.
Future directions
Including optimised packages
The image we have prepared above has used standard software packages. However, the efficiency of some software can sometimes be improved by installing alternatives.
For example, a good candidate for optimisation in the current image is to install the Intel Math Kernel Library.
I have found this to offer fairly substantial speedups for packages such as numpy
, scipy
, and particularly pymc3
.
However, it would require compiling numpy
and scipy
and setting some runtime flags for pymc3
when building the image—increasing the complexity of the Dockerfile
.
Containers for running experiments
Might these containers also be used for running experiments?
For example, could an image be created that would allow an experiment written in psychopy
to be executed in a container?
This form of reproducibility—of the process used to collect the data—seems to receive a lot less attention than the reproducibility of an analysis given a dataset.
Unfortunately, it doesn’t seem like this sort of graphical or multimedia output is possible from a Docker container (at least not in a cross-platform way).
Hopefully there may be progress in this direction in the future.
References
- Clyburne-Sherin, A., Fei, X., & Green, S.A. (2019) Computational reproducibility via containers in psychology. Meta-Psychology, 3, 1–9.
- Peterson, L.M., Kersten, D.J., & Mannion, D.J. (2018) Surface curvature from kinetic depth can affect lightness. Journal of Experimental Psychology: Human Perception & Performance, 44(12), 1856–1864.
- Wiebels, K. & Moreau, D. (2021) Leveraging containers for reproducible psychological research. Advances in Methods and Practices in Psychological Science, 4(2), 1–18.