Researchers scale code for INCITE at ALCF’s Mira Boot Camp

science
2015 Mira Performance Boot Camp
2015 Mira Performance Boot Camp
2015 Mira Performance Boot Camp
2015 Mira Performance Boot Camp

From May 19-21, the annual Mira Performance Boot Camp once again drew new and experienced supercomputer users from around the globe to the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science User Facility. A cornerstone of the ALCF’s user outreach program, the three-day onsite Boot Camp, now in its seventh year, provides a timely opportunity for the community to tap into the expertise of assembled ALCF staff and invited guests for assistance ramping up their code’s scalability as they prepare to submit a proposal for an INCITE award.

The INCITE program is the mechanism through which the majority of available compute hours at the DOE leadership computing facilities are allocated to projects aimed at accelerating discoveries in science and engineering. The deadline for 2016 INCITE proposals was June 26, 2015. To be eligible for an INCITE award, projects must undergo a rigorous review for scientific merit and must demonstrate the ability to utilize the massive compute resources of leadership-class systems, like those available at the ALCF. The ALCF’s Boot Camp gives researchers assistance towards demonstrating that computational readiness.

The bulk of this year’s event was devoted to hands-on, one-on-one tuning of applications. In addition, ALCF experts spoke on topics of interest, including Blue Gene/Q architecture, ensemble jobs, parallel I/O, and data analysis. Guest speakers from tool and debugger vendors provided information and individualized assistance to attendees.

Reservation queues created specifically for the event gave attendees quick, uninterrupted access to ALCF resources, allowing them to run 835 jobs and to use over 18.8 million core-hours as they diagnosed code issues and tweaked performance. This year, with expert assistance and newly acquired knowledge, several groups were able to complete full-machine runs on Mira (786,432 cores), and generate plots to incorporate into their INCITE proposals.

View the event agenda and links to the presentation slides.

Highlighted Accomplishments

  • Researchers from Missouri University of Science and Technology received assistance from ALCF performance engineers to improve their code and I/O performance in preparation for an INCITE proposal submission. By a careful choice of compiler options, the group obtained a 9X speed-up of their code over their baseline performance. This team’s work is part of a larger effort that will ultimately aid in the improved design of supersonic aircraft.
  • Using performance-profiling tools highlighted at this year’s Boot Camp, Argonne and vendor experts successfully eliminated a major bottleneck in the application used by a University of California-based INCITE team researching the evolution and present dynamical states of galaxies, stars and other celestial bodies. By removing the bottleneck in code initialization and improving the use of MPI, their application, which previously scaled inefficiently to only 131,072 cores, was able to scale cleanly to 262,144 cores of Mira.
  • The Virtual Engine Research Institute and Fuels Initiative (VERIFI) is a multidisciplinary team of Argonne scientists and engineers utilizing the breadth of the laboratory’s state-of-the-art resources (including the ALCF’s leadership-class supercomputers) to aid industry in next-generation engine design. At Boot Camp, the VERIFI team compiled the latest version of the CFD software Converge (2.3), and identified and addressed a bug related to the writing of the restart file. In addition, ALCF staff later resolved a hanging issue identified during the workshop. Altogether, the improvements allowed the code to use twice as many cores on Mira as before, jumping from 4,096-core runs to 8,192-core runs.
  • Researchers at the University of Köln and TU Bergakademie Freiberg at work on EXASTEEL—a project aimed at creating tools for simulating high-strength steels on exascale systems—had scaled their code previously to full-machine runs on the German JUQUEEN Blue Gene/Q machine. While their code was, in principle, scalable to the full size of Mira, it initially crashed during large run attempts during Boot Camp. Using bgq_stack and DDT, Argonne staff and invited experts helped the team to identify the error in the source code that prevented their code from scaling beyond 262,144 cores. They were then able to fix their code and scale it to the full Mira machine (786,432 cores).

View our Event Schedule for upcoming user training opportunities at the ALCF, or sign up for our Newsbytes, the ALCF’s monthly newsletter.