PyTorch on Polaris
PyTorch is a popular, open source deep learning framework developed and released by Facebook. The PyTorch home page has more information about Pytorch, which you can refer to. For trouble shooting on Polaris, please contact firstname.lastname@example.org.
Installation on Polaris
PyTorch is installed on Polaris already, available in the
conda module. To use it from a compute node, please do:
module load conda conda activate
Then, you can load PyTorch in
python as usual (below showing results from the
>>> import torch >>> torch.__version__ '1.12.0a0+git67ece03' >>>
This installation of PyTorch was built from source and the cuda libraries it uses are found via the
CUDA_HOME environment variable (below showing results from the
$ echo $CUDA_HOME /soft/datascience/cuda/cuda_11.5.2_495.29.05_linux
If you need to build applications that use this version of PyTorch and CUDA, we recommend using these cuda libraries to ensure compatibility. We periodically update the PyTorch release, though updates will come in the form of new versions of the
PyTorch is also available through nvidia containers that have been translated to Singularity containers. For more information about containers, please see the containers documentation page.
PyTorch Best Practices on Polaris
Single Node Performance
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
Use Reduced Precision. Reduced Precision is available on A100 via tensorcores and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as descibed in the mixed precision documentation. In Pytorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
PyTorch has a
JITmodule as well as backends to support op fusion, similar to Tensorflow's
tf.functiontools. However, PyTorch JIT capabilities are newer and may not yield performance improvements. Please see TorchScript for more information.
Multi-GPU / Multi-Node Scale up
PyTorch is compatible with scaling up to multiple GPUs per node, and across multiple nodes. Good scaling performance has been seen up to the entire Polaris system, > 2048 GPUs. Good performance with PyTorch has been seen with both DDP and horovod. For details, please see the horovod documentation or the Distributed Data Parallel documentation. Some polaris specific details that may be helpful to you:
- CPU affinity and NCCL settings can improve scaling performance, particularly at the largest scales. In particular, we encourage users to try their scaling measurements with the following settings:
- Set the environment variable
- Set the environment varialbe
Manually set the CPU affinity via mpiexec, such as with
Horovod and DDP work best when you limit the visible devices to only one GPU. Note that if you import
horovod, and then do something like
os.environ["CUDA_VISIBLE_DEVICES"] = hvd.local_rank(), it may not actually work! You must set the
CUDA_VISIBLE_DEVICESenvironment variable prior to doing
MPI.COMM_WORLD.init(), which is done in
horovod.init()as well as implicitly in
from mpi4py import MPI. On Polaris specifically, you can use the environment variable
PMI_LOCAL_RANK(as well as
PMI_LOCAL_SIZE) to learn information about the node-local MPI ranks.
DeepSpeed is also available and usable on Polaris. For more information, please see the DeepSpeed documentation directly.
Please note there is a bug that causes a hang when using pytorch data loaders + distributed training (horovod, DDP, etc). To workaround this, Nvidia recommends setting
num_workers=0 in the dataloader configuration.