NVIDIA Container Notes
Getting the container
To get NVidia docker containers which have the latest CUDA and Tensorflow installed, go to NVidia NGC, create an account, search for
Tensorflow. Notice there are containers tagged with
tf2. The page tells you how to select the right one.
You can convert the command at the top, for instance:
docker pull nvcr.io/nvidia/tensorflow:20.08-tf2-py3
to a singularity command by doing this:
singularity build tensorflow-20.08-tf2-py3.simg docker://nvcr.io/nvidia/tensorflow:20.08-tf2-py3
You'll need to run this command on a Theta login node which has network access (
thetaloginX). The containers from August, 2020, are also all available converted to singularity here:
Running on ThetaGPU
After logging into ThetaGPU with
ssh thetagpusn1, one can submit job using the container one a single node by doing:
qsub -n 1 -t 10 -A <project-name> submit.sh where
submit.shcontians the following bash scripting:
#!/bin/bash CONTAINER=$HOME/tensorflow-20.08-tf2-py3.simg singularity exec --nv $CONTAINER python /usr/local/lib/python3.6/dist-packages/tensorflow/python/debug/examples/debug_mnist.py
make sure to make the script executable with
chmod a+x submit.sh.
The log file
<cobalt-jobid>.output should contain some text like this:
Accuracy at step 0: 0.2159 Accuracy at step 1: 0.098 Accuracy at step 2: 0.098 Accuracy at step 3: 0.098 Accuracy at step 4: 0.098 Accuracy at step 5: 0.098 Accuracy at step 6: 0.098 Accuracy at step 7: 0.098 Accuracy at step 8: 0.098 Accuracy at step 9: 0.098
The numbers may be different.
Running Tensorflow-2 with Horovod on ThetaGPU
To run on ThetaGPU with MPI you can do the follow test:
git clone firstname.lastname@example.org:jtchilders/tensorflow_skeleton.git cd tensorflow_skeleton qsub -n 2 -t 20 -A <project-name> submit_scripts/thetagpu_mnist.sh
You can inspect the submit script for details on how the job is constructed.