The ALCF's Simulation, Data, and Learning Workshop is designed to help researchers improve the performance and productivity of simulation, data science, and machine learning applications on ALCF systems. Workshop participants will have the opportunity to:
- Work directly with ALCF staff experts during dedicated hands-on sessions
- Learn how to use available tools and frameworks to improve productivity
- Test and debug codes with exclusive system reservations on ALCF computing resources
- Get assistance with Director's Discretionary projects to help prepare for a major allocation award
- Improve the performance of existing ALCF projects
- Plan ahead for 2021-2022 allocation proposal submissions
- U.S. Citizens: November 23, 2020
- Foreign Nationals: November 13, 2020
Note: Registrants will be reviewed for experience level and will be asked to provide goals for attending.
12/8 Day One (10AM-3PM Central Time) will be a hands-on tutorial for introducing distributed data parallel training on ALCF systems. There will be experts on hand as you run through examples from our Git repo. These examples will teach you how to run deep learning training on multi-GPU nodes and on multiple nodes of ThetaGPU, or a multi-CPU system like Theta. There will also be discussion of how to build proper data pipelines to keep your workflows humming.
- Introductory on-boarding for ALCF systems
- Data parallel training with Tensorflow and Horovod
- Hands-on session
- Building effective data pipelines to use accelerators effectively
12/9 Day Two (10AM-3PM Central Time) will focus on DeepHyper, a tool for distributed hyperparameter optimization. Again, this will be done via tutorial using examples from our Git repos, with experts walking you through the steps. We will also cover how to identify performance issues using common profilers such as VTune, TAU, and the built-in Tensorflow Profiler.
- Running distributed hyper-parameter optimization with DeepHyper
- Profiling deep learning frameworks to optimize your workflow
- Using Tensorflow Profiler
- Profiling with TAU and VTune
12/10 Day Three (10AM-3PM Central Time) Now that you have a performant, trained, deep network, Day Three will cover the important topic of how to deploy it at scale in a simulation. Integrating model inference into distributed simulations will be covered using tutorials from our Git repo.
- Integrating inference into distributed simulation
- Demonstration using example simulations