Nvidia delivers docker containers that contain their latest release of CUDA, tensorflow, pytorch, etc. You can see the full support matrix for all of their containers here:
Building with Singularity
Docker is not runnable on ALCF's ThetaGPU system for most users, but singularity is. To convert one of these images to singularity you can use the following command:
singularity build $OUTPUT_NAME $NVIDIA_CONTAINER_LOCATION
$OUTPUT_NAME is typically of the form
$NVIDIA_CONTAINER_LOCATION can be a docker url such as
For your convienience, we've converted these containers to singularity and are available here:
Running with Singularity
After logging into ThetaGPU with
ssh thetagpusn1, one can submit job using the container one a single node by doing:
qsub -n 1 -t 10 -A <project-name> submit.sh where
submit.shcontians the following bash scripting:
#!/bin/bash CONTAINER=$HOME/tensorflow-20.08-tf2-py3.simg singularity exec --nv $CONTAINER python /usr/local/lib/python3.6/dist-packages/tensorflow/python/debug/examples/debug_mnist.py
make sure to make the script executable with
chmod a+x submit.sh.
The log file
<cobalt-jobid>.output should contain some text like this:
Accuracy at step 0: 0.2159 Accuracy at step 1: 0.098 Accuracy at step 2: 0.098 Accuracy at step 3: 0.098 Accuracy at step 4: 0.098 Accuracy at step 5: 0.098 Accuracy at step 6: 0.098 Accuracy at step 7: 0.098 Accuracy at step 8: 0.098 Accuracy at step 9: 0.098
The numbers may be different.
Running TensorFlow-2 with Horovod on ThetaGPU
To run on ThetaGPU with MPI you can do the follow test:
git clone email@example.com:jtchilders/tensorflow_skeleton.git cd tensorflow_skeleton qsub -n 2 -t 20 -A <project-name> submit_scripts/thetagpu_mnist.sh
You can inspect the submit script for details on how the job is constructed.