Singularity and NVIDIA Containers

Help Desk

Theta GPU Nodes

Nvidia delivers docker containers that contain their latest release of CUDA, tensorflow, pytorch, etc. You can see the full support matrix for all of their containers here:

Nvidia support matrix

Building with Singularity

Docker is not runnable on ALCF's ThetaGPU system for most users, but singularity is. To convert one of these images to singularity you can use the following command:

singularity build $OUTPUT_NAME $NVIDIA_CONTAINER_LOCATION

where $OUTPUT_NAME is typically of the form tf2_20.09-py3.simg and $NVIDIA_CONTAINER_LOCATION can be a docker url such as docker://nvcr.io/nvidia/tensorflow:20.09-tf2-py3

You can find the latest containers from Nvidia here: - Tensorflow 1 and 2 - Pytorch

For your convienience, we've converted these containers to singularity and are available here:

/lus/theta-fs0/software/thetagpu/nvidia-containers/

Running with Singularity

After logging into ThetaGPU with ssh thetagpusn1, one can submit job using the container one a single node by doing:
qsub -n 1 -t 10 -A <project-name> submit.sh where submit.shcontians the following bash scripting:

#!/bin/bash 
CONTAINER=$HOME/tensorflow-20.08-tf2-py3.simg singularity exec --nv $CONTAINER python /usr/local/lib/python3.6/dist-packages/tensorflow/python/debug/examples/debug_mnist.py

make sure to make the script executable with chmod a+x submit.sh.

The log file <cobalt-jobid>.output should contain some text like this:

Accuracy at step 0: 0.2159
Accuracy at step 1: 0.098
Accuracy at step 2: 0.098
Accuracy at step 3: 0.098
Accuracy at step 4: 0.098
Accuracy at step 5: 0.098
Accuracy at step 6: 0.098
Accuracy at step 7: 0.098
Accuracy at step 8: 0.098
Accuracy at step 9: 0.098

The numbers may be different.

Running TensorFlow-2 with Horovod on ThetaGPU

To run on ThetaGPU with MPI you can do the follow test:

git clone git@github.com:jtchilders/tensorflow_skeleton.git 
cd tensorflow_skeleton
qsub -n 2 -t 20 -A <project-name> submit_scripts/thetagpu_mnist.sh

You can inspect the submit script for details on how the job is constructed.

To extend the python libraries in these containers, please see building python packages.

For issues with these containers, please email support@alcf.anl.gov