Accessing GPUs from a Docker Swarm service

This article shows how to access GPUs from Docker Swarm services. In essence, we need to do two things:

Set the nodes in the cluster to advertise their GPUs as Docker generic resources;
Have the service specify the constraint that it needs GPU resources.

Once these are both in place, the swarm orchestrator can automatically allocate services that need GPUs to nodes that have GPUs, without us needing to manually place tasks on specific nodes. Yay!

However, please note that only one Docker service replica can be assigned to a given GPU; there is no time-sharing between services on a single node. Practically this means you need at least as many nodes with GPUs as tasks that require them. If you have 5 nodes with GPUs and start 6 replicas of your service, one replica will stay pending due to lack of resources.

This article assumes you are already familiar with a number of concepts. Here are some resources for more background information:

Why GPUs and Docker Swarm?

Why might you want to access GPUs from Docker Swarm services? For this article I’ll assume that you want to rapidly train a lot of neural networks using Apache Spark. We can use Docker Swarm to manage our Spark cluster, deploying the Spark master on one node and replicating the Spark workers across the remaining nodes. With this architecture, we can direct each worker to train a single network, and use the GPU on a given worker node to speed up the training time.

Accessing the GPU from your own software

Before we get to Spark workers and Docker services, we need to ensure that our neural network training code can access the GPU in the first place. The nodes in the cluster should have an nVidia GPU (e.g. AWS EC2 instances starting with p), and the nVidia CUDA toolkit installed.

You also need a framework for designing and training neural networks such as Tensorflow or Theano, or the higher-level wrapper Keras. If installing these Python packages yourself make sure to install the GPU-enabled versions, e.g.:

pip install tensorflow-gpu keras-gpu

If running on EC2, Amazon provides an AMI for their GPU-enabled nodes that comes with CUDA, Tensorflow, and Python already installed.

Now your Keras or Tensorflow neural network program should run on the GPU!

Accessing the GPU from a Docker container

Containers are great for abstracting away the details of the native system that we’re running on, but the GPU is one of the details that gets abstracted away! In order for a Docker container to access the GPU, we need to use nvidia-docker instead of docker to run containers.

On Linux we install nvidia-docker through the package manager in the usual way (e.g. apt-get). Then launching a container becomes:

nvidia-docker run <image_name>

If we launch our Keras program in this container, it will run on the GPU!

However, originally nvidia-docker didn’t support Docker Swarm. This meant that Spark workers couldn’t be replicated across nodes in a cluster. The work-around was to manually allocate a Spark worker to a specific node by issuing an nvidia-docker run command on that node, instead of issuing a service create --replicas request to the swarm manager. It gets the job done, but it misses all the nice benefits of orchestration.

In December 2017 nvidia-docker2 was released which supports Docker Swarm. Yay! The rest of this article draws from a GitHub comment explaining how to use nvidia-docker with Docker Swarm from January this year. If you previously had nvidia-docker installed, you need to uninstall it and change to nvidia-docker2 for swarm support. For example:

sudo apt-get -y install nvidia-docker2

Accessing the GPU from a Docker service

So how do we get Docker services to use the GPU? Well, in addition to the requirements above (CUDA, keras-gpu, nvidia-docker2) we need to do three more things:

Configure the Docker daemon on each node to advertises its GPU
Make the Docker daemon on each node default to using nvidia-docker
Add a constraint to our Docker service specifying that it needs a GPU

Once we take these steps, the orchestrator will be able to see which nodes have GPUs and which services require them, and deploy our services accordingly!

Configuring the Docker daemon

The first step is to find the identifier of the GPU on a specific node, so we can pass it to the daemon later. We find it and store it in an environment variable with this command:

export GPU_ID=`nvidia-smi -a | grep UUID | awk '{print substr($4,0,12)}'`

What this is doing is running nvidia-smi -a, finding the line containing ‘UUID’, then extracting the first 12 characters of the 4th column of this line. You can see an example of the output of nvidia-smi -a in the comment here. Line 19 contains the UUID; columns 1, 2, and 3 are ‘GPU’, ‘UUID’, and ‘:’ respectively. The first 12 characters of column 4 should be enough to uniquely identify this GPU.

If we echo $GPU_ID, we can see it looks something like GPU-c143e771 or GPU-c5c84263.

Docker is launched and managed as a service through systemd. We can change its default behaviour by adding an override file, called /etc/systemd/system/docker.service.d/override.conf.

This file should contain the following lines:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fdd:// --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}

Note: the second line is essential, because it clears any previously set ExecStart commands. You’ll get an error if this is missing.

What is the third line doing? Three things: it’s saying that when we start the Docker daemon we want the default runtime to be nvidia-docker (instead of docker), and that this node provides a generic resource of type gpu. (The name gpu could be anything, but it should be the same thing across all the nodes in our cluster so that the orchestrator sees which nodes offer the same resource type.) Finally, it’s saying that on this specific node, the generic gpu resource has the identifier we previously stored in $GPU_ID.

Next, we modify the file /etc/nvidia-container-runtime/config.toml to allow the GPU to be advertised as a swarm resource. Uncomment or add the following line to this file:

swarm-resource = "DOCKER_RESOURCE_GPU"

After taking these three steps, we need to reload the Docker daemon (to pick up the new configuration override file), and start it:

sudo systemctl daemon-reload
sudo systemctl start docker

Scripting these steps

It’s a bit tedious to manually take these steps on every node in our cluster. They can be scripted as follows:

export GPU_ID=`nvidia-smi -a | grep UUID | awk '{print substr($4,0,12)}'`
sudo mkdir -p /etc/systemd/system/docker.service.d
cat <<EOF | sudo tee --append /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fdd:// --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}
EOF
sudo sed -i '1iswarm-resource = "DOCKER_RESOURCE_GPU"' /etc/nvidia-container-runtime/config.toml
sudo systemctl daemon-reload
sudo systemctl start docker

Adding a service constraint

Now our cluster nodes are advertising to the swarm that they offer access to a GPU. The final step is to ensure that the service requests a GPU. We do this by adding to the Docker service create command --generic-resource "gpu=1". The full command looks something like this:

docker service create --generic-resource "gpu=1" --replicas 10 \
--name sparkWorker <image_name> \"service ssh start && \
/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<spark_master_ip>:7077\"

The name of the generic resource being requested (gpu here) should match the name of the resource being advertised by the nodes.

Congratulations! The Docker swarm orchestrator will now distribute your Spark workers onto nodes with GPU capability.