Keras (https://keras.io) is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. It allows for easy and fast prototyping, and support both convolutional networks and recurrent networks and the combination of the two. It runs seamlessly on CPU and GPU.
On Theta, we support Tensorflow backend for Keras. To use the datascience Keras module on Theta, please load the following two modules:
module load datascience/keras-2.2.4
module load datascience/tensorflow-1.12
Notice that the datascience/tensorflow-* modules were compiled with AVX512 extension on Theta. Therefore, it could not run on login node, otherwise it will issue an “illegal instruction” fault. One has to submit the job to KNL nodes (see Tensorflow documentation for details).
Since we use Tensorflow as the backend, all the optimal environmental setups (Threading + affinity) are applicable here. Please follow the Tensorflow documentation page (https://www.alcf.anl.gov/user-guides/tensorflow) for the optimal setting.
We do not see any incompatibility issues in using different versions of keras and tensorflow as those specified above. Feel free to change other versions of keras or Tensorflow. Currently, we support version 2.2.2 and 2.2.4.
Distributed learning using Horovod
We support distributed learning using Horovod. To use it please load datascience/horovod-0.15.2 module. Please change your python script accordingly
1) Initialize Horovod by adding the following lines to the beginning of your python script.
import horovod.keras as hvd
After this initialization, the total number of ranks and the rank id could be access through hvd.rank(), hvd.size() functions.
2) Scale the learning rate.
Typically, since we use multiple workers, the global batch is usually increased n times (n is the number of workers). The learning rate should increase proportionally as follows (assuming that the learning rate initially is 0.01).
opt = keras.optimizers.Adadelta(1.0 * hvd.size()
In some case, 0.01*hvd.size() might be too large, one might want to have some warming up steps with smaller learning rate.
3) Wrap the optimizer with Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
In such case, opt will automatically average the loss and gradients among all the workers and then perform update.
4) Broadcast the model from rank 0, so that all the workers will have the same starting point.
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
Notice that by default, TensorFlow will initialize the parameters randomly. Therefore, by default, different workers will have different parameters. So it is crucial to broadcast the model from rank 0 to other ranks.
5) Letting only rank 0 to write checkpoint.
if hvd.rank() == 0:
6) Loading data according to rank ID:
Since we are using data parallel scheme. Different ranks shall process different data. One has to change the data loader part of the python script to ensure different ranks read different mini batches of data.
A simple example for doing linear regression using Keras + Horovod is put in teh follwoing directory on Theta:
linreg_keras.py is the python script, and qsub.sc is the COBALT submission script.