Machine Learning Tools on Cooley

Help Desk


Cooley is designed as a visualization cluster, because of the support for Nvidia GPUs. However,  it is also possible to run machine learning workflows on Cooley.  The datascience group supports a number of containers for machiner learning and deep learning workflows, built from Nvidia's containers with the addition of deep learning software.  

For more information about building containers for Cooley, please see the Singularity on Cooley page.  This page will focus on using containers for machine learning and deep learning workflows.

Available Containers

There are 4 containers available with machine learning frameworks that are tested and available on Cooley:

  1. Tensorflow + Keras with GPU capability
  2. Tensorflow + Keras with GPU capability, plus mpich, mpi4py, and horovod for distributed learning.
  3. Pytorch with GPU capability
  4. Pytorch with GPU capability, plus mpich, mpi4py and horovod for distributed learning.

These container images are in /soft/datascience/singularity. Depending on your needs, you can use either pytorch or tensorflow.  Running a container without horovod or with horovod will not have any significant impact on performance on single GPUs, however since the K80 nodes have 2 GPUs per node it is recommended you use horovod with data parallel learning to take advantage of both GPUs.

Running a machine learning workflow on Cooley

To run deep learning workflow, you must use singularity to execute your python scripts.  Because singularity is setting up a containerized system, there are several important steps to take note of:

  1. Use the `--nv` option to singularity exec to enable Nvidia GPU drivers within the container.  Without this, you will not be able to take advantage of Nvidia gpu acceleration.
  2. Make sure you bind necessary directories correctly.  By default, not all areas mounted on the host system (outside the container) are available inside the container.  To access an area, you can bind it with the -B outside_loc:inside_loc syntax.  For example, to access the theta projects area from inside a container on Cooley, use `-B /lus:/lus` as part of your singularity command.
  3. Run the container inside of mpirun calls.  For example, do `mpirun -n 2 singularity exec --nv -B /lus:/lus $IMAGE /path/to/python/` and NOT ` singularity exec $IMAGE mpirun -n 2 /path/to/python/` (where $IMAGE is the path to the container you want to run)

Running the mpi containers with both GPUs per node has been demonstrated to scale to many nodes on Cooley, so distributed learning is feasible on Cooley.

Extending Available Software in Containers

The singularity containers already contain many import pieces of software for ML/DL workflows, but if you have custom software it is possible for you to use it inside the container.  The most straightforward path is to install it via pip while in the container, using the `--user` flag if you can.  In this way, you can add extensions to tensorflow/pytorch, or IO frameworks, etc.  Alternatively, you can use the recipes from the alcf containers bootstrap your own containers with everything available inside the portable container.  Email for questions concerning these techniques.

Non-container software solutions

It is perfectly possible to run tensorflow, pytorch, etc outside of a container on Cooley.  We don't support official builds or distributions of this, but because Nvidia GPUs are very common for ML and DL software, there are many excellent tools available for getting GPU optimized tensorflow, pytorch, etc.  Solutions that can work on Cooley are pip, conda, and virtualenv, and possibly others.  Note that you will need to add Cuda libraries from softenv to use these tools.