Machine Learning Tools

Machine Learning on Theta

The ALCF is working to support scalable machine learning on Theta. Our focus has been supporting Tensorflow on Theta as it has broad support from Cray & Intel giving large performance boosts on the Intel KNL architecture versus other frameworks. 

We support two installations of Tensorflow, one via the Conda Environment and one via a custom Cray plugin environment. We also provide easy to use datascience modules for mpi4py, TensorFlow, Keras, PyTorch, and Horovod.

Generic Environment Settings

First we'll mention some generic settings to play with while doing Tensorflow training on Theta.

The Google Tensorflow describes these variables here.

In your batch submit script use the following:

  • `export OMP_NUM_THREADS=62` This setting should be set to the number of physical cores, although, our local Cray expert suggested using 62.

  • `export KMP_BLOCKTIME=0` ​Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.

  • `export KMP_AFFINITY=“granularity=fine,compact,1,0”`  Enables the run-time library to bind threads to physical processing units.

In addition, Tensorflow has the following internal settings that should be used to optimize performance:

  • intra_op_parallelism_threads: Setting this equal to the number of physical cores is recommended. Setting the value to 0, which is the default and will result in the value being set to the number of logical cores, is an option to try for some architectures. This value and OMP_NUM_THREADS should be equal.

  • inter_op_parallelism_threads: Setting this equal to the number of sockets is recommended. Setting the value to 0, which is the default, results in the value being set to the number of logical cores.

This can be added to your model using this code:

config = tf.ConfigProto()
config.allow_soft_placement = True
config.intra_op_parallelism_threads = FLAGS.num_intra_threads
config.inter_op_parallelism_threads = FLAGS.num_inter_threads
tf.session(config=config)

Tensorflow via Conda Environment

We've installed a Conda Environment which has the latest Tensorflow Wheel from Intel installed. We've also installed the Horovod tool which uses MPI to run Tensorflow in a distributed way. This enables the training of Tensorflow models on Theta at large scales. Horovod provides examples for running Tensorflow natively or Tensorflow using Keras. This can be run using:

#!/bin/bash
#COBALT -n <num-nodes>
#COBALT -t <wall-time>
#COBALT -q <queue>
#COBALT -A <project>

module load miniconda-3.6/conda-4.5.12

aprun -n <num-ranks> -N <mpi-ranks-per-node> python script.py

Tensorflow via Cray ML Plugin

Cray has provided a custom ML Plugin for running Tensorflow on Theta that provides performance benefits when using smaller local mini-batch sizes. 

There are two example batch scripts for Python 2.7 or 3.6 which show how to setup the local environment:

/lus/theta-fs0/projects/SDL_Workshop/mendygra/cpe_plugin_py2.batch
/lus/theta-fs0/projects/SDL_Workshop/mendygra/cpe_plugin_py3.batch

This is the environment setup for python 2.7:

module load cray-python
export PYTHONUSERBASE=/lus/theta-fs0/projects/SDL_Workshop/mendygra/pylibs
module load /lus/theta-fs0/projects/SDL_Workshop/mendygra/tmp_inst/modulefiles/craype-ml-plugin-py2/1.1.0

and for python 3.6:

module load cray-python/3.6.1.1
export PYTHONUSERBASE=/lus/theta-fs0/projects/SDL_Workshop/mendygra/pylibs
module load /lus/theta-fs0/projects/SDL_Workshop/mendygra/tmp_inst/modulefiles/craype-ml-plugin-py3/1.1.0

After setting up one of these environments, you can see an example of implementing the plugin in this script:

less $CRAYPE_ML_PLUGIN_BASEDIR/examples/tf_mnist/mnist.py

Data Science Modules

ALCF Data Science group provides modules to simplify the usage of common data science tools, such as TensorFlow, PyTorch, Horovod, and mpi4py. Users can see a list of available datascience modules with `module avail datascience`. More information about each module can be found by executing `module show <MODULENAME>`.

datascience/mpi4py

This module loads the environment required to run MPI for Python (mpi4py) package. The version of mpi4py is 3.0.1a0.

Note: This module loads intelpython35 and gcc/7.3.0 modules. 

datascience/tensorflow-X

This module loads the environment required to run TensorFlow on Theta. Available versions are 1.4, 1.6, 1.8, 1.10, and 1.12. We also provide 1.13.0rc0, but note that this version is a release candidate, so we recommend to use 1.12, which is the current stable release version.

Note: This module loads intelpython35 and gcc/7.3.0 modules. You will get a core dump if you try to use TensorFlow on the login node, since TensorFlow library was compiled to use AVX512F instructions, which are available on compute nodes.

 
datascience/horovod-X
 
This module loads the environment required to run Horovod on Theta. Horovod is a distributed deep learning framework for TensorFlow, Keras, PyTorch, and MXNet. Available versions are 0.13.11, 0.14.1, 0.15.0, and 0.15.2. 
 
Note: This module loads intelpython35 and gcc/7.3.0 modules. However, it doesn’t load TensorFlow, Keras, or PyTorch. You have to load one of these modules in order to use it together with horovod.
 
 
datascience/keras-X
 
This module loads the environment required to run Keras, which is a high-level Python API to run Tensorflow, CNTK, or Theano. Currently, only version 2.2.2 is available on Theta and it automatically loads TensorFlow 1.10.
 
Note: This module loads intelpython35, gcc/7.3.0, and datascience/tensorflow-1.10 modules. 
 
 
datascience/pytorch-X
 
This module loads the environment required to run PyTorch, a deep learning platform with Python and C++ API. Available versions are 0.5 and 1.0. 
 
Note: This module loads intelpython35, gcc/7.3.0, and datascience/tensorflow-1.10 modules.