Setting up a Tensorflow dev env with Docker & NVIDIA Container Toolkit

Introduction

The first 2 sections are from Get Docker Engine - Community for Ubuntu, Post-installation steps for Linux, NVIDIA Container Toolkit and validation steps at TensorFlow Docker, so there’s not really anything new in those sections. However, the subsequent 5 sections contain notes about using Tensorflow with GPU support in a Docker container interactively, building a Docker image, running an IDE within the container, running Jupyter Notebooks from the container, and moving the data directory, which might be of more interest.

Install Docker

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

Test with:

sudo docker run hello-world

Add your userid to the docker group so you don’t have to run commands with sudo:

sudo usermod -aG docker $USER
newgrp docker

Verify with:

docker run hello-world

Install NVIDIA Container Toolkit and Tensorflow

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Test with:

docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
docker run --gpus all --rm nvidia/cuda nvidia-smi
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Note: Documentation says to run:

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

But as per AttributeError: module ‘tensorflow’ has no attribute ‘enable_eager_execution’ with the TensorFlow 2 this gives an error, so use the command above.

To confirm which version of TensorFlow is installed:

python -c 'import tensorflow as tf; print(tf.__version__)'

Using Tensorflow and Docker

For dev and testing purposes, you can simply start a container with a bind mount to a working directory where it can read the git managed python and source files and save the trained model to somewhere which will persist once the container is stopped.

Dev sessions are started via:

docker run --gpus all -it -u 1000:1000 -p 8888:8888 --mount type=bind,src=/home/<username>/projects,dst=/home/projects --env HOME=/home/projects -w /home/projects tensorflow/tensorflow:latest-gpu-py3 bash

Where:

  • 1000 and 1000 are the user the group IDs I want to run as (so files written by the container have the correct permissions outside the container)
  • /home/<username>/projects is the real file system and /home/projects the virtual one (so I have access to the local git repo, noting that in this container git isn’t installed so git pull/push etc. have to be performed outside the container)
  • the $HOME environment variable is set to /home/projects rather than the default / (this is for the Visual Studio Code Remote - Containers extension)
  • the working directory is set to /home/projects
  • the Tensorflow image with GPU support and Python 3 (tensorflow/tensorflow:latest-gpu-py3) is used
  • a bash shell is started

Building an image

If you want to install additional packages, it’ll be easier to build an image. In my case I want to install Jupyter Notebook, so I’ll create a basic Dockerfile like:

FROM tensorflow/tensorflow:latest-gpu-py3
WORKDIR /home/projects
RUN pip install notebook

Build with

docker build --tag tfdev .

And start subsequent sessions with a simplified version of the original command, i.e.:

docker run --gpus all -it -u 1000:1000 -p 8888:8888 --mount type=bind,src=/home/<username>/projects,dst=/home/projects --env HOME=/home/projects tfdev bash

Of course this is still running it interactively, as a user that can directly read and write the source. If you want to build a completely self-contained image which you can move from dev through to production, you’ll need additional steps such as copying in the source code.

Running Visual Studio Code from the Tensorflow Docker image

This is so that code completion etc. works in the IDE. I use Visual Studio Code, but there’s probably similar approaches for other IDEs. Visual Studio Code uses the Remote - Containers extension which in turn needs Docker Compose.

To Install Docker Compose:

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

To install Visual Studio Code Remote Development Extension Pack, go to Go > Go to File and enter:

ext install ms-vscode-remote.vscode-remote-extensionpack

Once set up, and a docker is running, go to Remote Explorer in the Visual Studio Code Activity Bar, select the container that is running, right click, and “Attach to container”. Note that if –env HOME is not set in the original docker run command (or equivalent), it will try to create .vscode-server in /root/ which will give a “Command in container failed: mkdir -p /root/.vscode-server/” error due to insufficient permissions.

Running Jupyter Notebooks from within Docker

If you do a pip list within the default tensorflow/tensorflow:latest-gpu image you’ll see that Jupyter Notebooks isn’t installed by default. It can be installed with a custom Docker image as per the “Building an image” section above.

To start Jupyter Notebook within the interactive Docker container:

jupyter notebook --ip 0.0.0.0 --no-browser

where –ip specifies the localhost and –no-browser tells it not to try to launch a browser session as it normally does. You can then access outside the container via http://localhost:8888, noting that you’ll be prompted to enter the token shown when the jupyter notebook process started.

Moving the Docker data directory

When installing some large models via Docker, e.g. GPT-2, I found my root partition, i.e. /, became full and the system barely usable, despite having separate partitions for home and data. This is because Docker on Ubuntu defaults to using /var/lib/docker.

It was possible to recover enough space on / for the system to become useable again by clearing unused Docker files via:

docker system prune

There were a couple of ways of moving the Docker data directory listed online, e.g. DOCKER_OPTS="-g /new/dir/" in /etc/default/docker, but these didn’t seem to work on Ubuntu, so I simply used a symlink instead:

service docker stop
sudo mv /var/lib/docker ~/docker.bak
sudo mkdir /var/lib/docker
sudo chmod go-r /var/lib/docker
sudo ln -s /data/docker /var/lib/
service docker start