Introduction Link to heading
The first 2 sections are from Get Docker Engine - Community for Ubuntu, Post-installation steps for Linux, NVIDIA Container Toolkit and validation steps at TensorFlow Docker, so there’s not really anything new in those sections. However, the subsequent 5 sections contain notes about using Tensorflow with GPU support in a Docker container interactively, building a Docker image, running an IDE within the container, running Jupyter Notebooks from the container, and moving the data directory, which might be of more interest.
Install Docker Link to heading
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo docker run hello-world
Add your userid to the docker group so you don’t have to run commands with sudo:
sudo usermod -aG docker $USER
docker run hello-world
Install NVIDIA Container Toolkit and Tensorflow Link to heading
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
docker run --gpus all --rm nvidia/cuda nvidia-smi
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
Note: Documentation says to run
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000]))) but as per AttributeError: module ’tensorflow’ has no attribute ’enable_eager_execution’ with the TensorFlow 2 this gives an error, so use the command above.
To confirm which version of TensorFlow is installed:
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c 'import tensorflow as tf; print(tf.__version__)'
Using Tensorflow and Docker Link to heading
For dev and testing purposes, you can simply start a container with a bind mount to a working directory where it can read the git managed python and source files and save the trained model to somewhere which will persist once the container is stopped.
Dev sessions are started via:
docker run --gpus all -it -u 1000:1000 -p 8888:8888 --mount type=bind,src=/home/<username>/projects,dst=/home/projects --env HOME=/home/projects -w /home/projects tensorflow/tensorflow:latest-gpu-py3 bash
- 1000 and 1000 are the user the group IDs I want to run as (so files written by the container have the correct permissions outside the container)
- /home/<username>/projects is the real file system and /home/projects the virtual one (so I have access to the local git repo, noting that in this container git isn’t installed so git pull/push etc. have to be performed outside the container)
- the $HOME environment variable is set to /home/projects rather than the default / (this is for the Visual Studio Code Remote - Containers extension)
- the working directory is set to /home/projects
- the Tensorflow image with GPU support and Python 3 (tensorflow/tensorflow:latest-gpu-py3) is used
- a bash shell is started
Building an image Link to heading
If you want to install additional packages, it’ll be easier to build an image. In my case I want to install Jupyter Notebook, so I’ll create a basic Dockerfile like:
RUN pip install notebook
docker build --tag tfdev .
And start subsequent sessions with a simplified version of the original command, i.e.:
docker run --gpus all -it -u 1000:1000 -p 8888:8888 --mount type=bind,src=/home/<username>/projects,dst=/home/projects --env HOME=/home/projects tfdev bash
Of course this is still running it interactively, as a user that can directly read and write the source. If you want to build a completely self-contained image which you can move from dev through to production, you’ll need additional steps such as copying in the source code.
Running Visual Studio Code from the Tensorflow Docker image Link to heading
This is so that code completion etc. works in the IDE. I use Visual Studio Code, but there’s probably similar approaches for other IDEs. Visual Studio Code uses the Remote - Containers extension which in turn needs Docker Compose.
sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
To install Visual Studio Code Remote Development Extension Pack, go to Go > Go to File and enter:
ext install ms-vscode-remote.vscode-remote-extensionpack
Once set up, and a docker is running, go to Remote Explorer in the Visual Studio Code Activity Bar, select the container that is running, right click, and “Attach to container”. Note that if –env HOME is not set in the original docker run command (or equivalent), it will try to create .vscode-server in /root/ which will give a “Command in container failed: mkdir -p /root/.vscode-server/” error due to insufficient permissions.
Running Jupyter Notebooks from within Docker Link to heading
If you do a
pip list within the default tensorflow/tensorflow:latest-gpu image you’ll see that Jupyter Notebooks isn’t installed by default. It can be installed with a custom Docker image as per the “Building an image” section above.
To start Jupyter Notebook within the interactive Docker container:
jupyter notebook --ip 0.0.0.0 --no-browser
where –ip specifies the localhost and –no-browser tells it not to try to launch a browser session as it normally does. You can then access outside the container via http://localhost:8888, noting that you’ll be prompted to enter the token shown when the jupyter notebook process started.
Moving the Docker data directory Link to heading
When installing some large models via Docker, e.g. GPT-2, I found my root partition, i.e. /, became full and the system barely usable, despite having separate partitions for home and data. This is because Docker on Ubuntu defaults to using /var/lib/docker.
It was possible to recover enough space on / for the system to become useable again by clearing unused Docker files via:
docker system prune
There were a couple of ways of moving the Docker data directory listed online, e.g. DOCKER_OPTS="-g /new/dir/" in /etc/default/docker, but these didn’t seem to work on Ubuntu, so I simply used a symlink instead:
service docker stop
sudo mv /var/lib/docker ~/docker.bak
sudo mkdir /var/lib/docker
sudo chmod go-r /var/lib/docker
sudo ln -s /data/docker /var/lib/
service docker start
Updating the docker images to use the latest Tensorflow Link to heading
If you set all of the above up, but then revisit it later, then you may find you need to update the images to use later versions of Tensorflow. This can be done via:
docker pull tensorflow/tensorflow:latest-gpu
docker pull tensorflow/tensorflow:latest-gpu-py3