Aurora ESP Call for Proposals

Aurora ESP

 

The Call for Proposals for the Aurora Early Science Program is now closed.

In late 2018, the Argonne Leadership Computing Facility (ALCF) will deploy Aurora, a new Intel-Cray system based on the third-generation Intel® Xeon Phi™ processor (code name: Knights Hill (KNH)). Aurora, will be capable of 180 petaflops. It will have over 50,000 nodes and over 7 petabytes of total memory, consisting of high bandwidth on-package memory, local memory, and persistent memory.

For the Aurora Early Science Program (ESP), we will select 10 projects, all chosen competitively based on a call for proposals. Theta ESP teams will, of course, be excellent candidates to submit Aurora ESP proposals, but the call is open to the general community. With Aurora being a much bigger and faster machine than Theta or Mira, the three months of pre-production Early Science time will be a large and valuable allocation of core-hours, with the potential for truly unprecedented computational science. ALCF will fully fund 10 postdoctoral appointees for Aurora ESP—one for each selected project.

Below is a rough timeline showing both Early Science Programs. The row labeled “Aurora ESP projects” denotes the central effort of the projects: developing, porting, and tuning code for the target systems:

Aurora timeline

We believe that the transition from Theta, the second-generation Intel® Xeon Phi™ processor (code name: Knights Landing (KNL)) system, to Aurora should be relatively straightforward. Many of the code development aspects for Aurora are shared by Theta: Both support large numbers of threads; both KNL and KNH cores support four hardware threads per core, both have multiple memory levels with very different bandwidths and capacities; and both have Intel x86-64 architecture compatibility. Here are some of the publically available details about Aurora:

  • Peak System Performance: 180 - 450 petaflops
  • Processor: Third-generation Intel® Xeon Phi™ processors (code name: Knights Hill (KNH))
  • Number of compute nodes: > 50,000
  • Compute Platform: Cray Shasta next-generation supercomputing platform
  • Total Memory: over 7 petabytes total combination of high bandwidth on-package memory, local memory, and persistent memory
  • System Interconnect: Second-generation Intel® Omni-Path Architecture with silicon photonics
  • Interconnect Interface: Integrated
  • Burst Buffer Storage: Intel® SSDs, second-generation Intel® Omni-Path Architecture
  • File System: Intel Lustre® File System
  • File System Capacity: > 150 petabytes
  • File System Throughput: > 1 terabyte/s
  • Peak Power Consumption: > 13 megawatts
  • FLOPS/watt: > 13 GFLOPS/watt
  • Delivery Timeline: 2018
  • Facility Area: ~3000 sq. ft.

We believe the ESP projects’ allocations of Aurora core-hours will support innovative science not possible on today’s leadership-class systems such as Mira and on the intermediate compute platform Theta. The ESP allocations will be in the ballpark of 500-700 million core-hours.

Support for Aurora ESP Projects

Postdocs

ALCF will directly support efforts on the ESP projects by hiring postdocs to work with some of the project teams. We will fund up to 10 dedicated postdocs, who will be employed by ALCF, but work directly with project investigators on efforts needed to prepare for Early Science runs. Some of these may split time between the ALCF and project-investigator locations. Generally, a postdoc will work on only one Aurora ESP project.

ALCF Staff

We will assign one ALCF staff member, most likely from our Catalyst team, to each ESP project. This person will spend a fraction of his/her time collaborating with the ESP project and mentoring the postdoc on computational aspects of the project.

Collaboration with Intel & Cray

Another type of support for the ESP projects will be access to applications experts from the Aurora vendors—Intel and Cray. These expert consultants will be made available through various avenues to assist with ESP code porting and tuning.

Training and Hardware Access

ALCF and Intel will offer training targeted toward the ESP projects. This will include a virtual kick-off workshop about the Aurora hardware and programming the system, and a hands-on workshop as soon as we have sufficient Theta-generation hardware to support testing and debugging of project applications. Before then, we will also offer access to advanced Aurora simulator software, and allocations on our production system(s) Mira and Theta for ESP development work that does not depend strongly on having the new hardware (e.g., new algorithms, new physics modules, basic introduction of threads). Where appropriate, we will offer joint training opportunities with OLCF, in support of as much portability as possible among our future systems and OLCF’s IBM/NVIDIA-based Summit system. Finally, ALCF will provide access to small KNH systems via Argonne’s Joint Laboratory for System Evaluation (JLSE).

Aurora’s Early Science Period

The Early Science period on Aurora is a span of about three months, beginning after system acceptance, but before turnover to production users. During this time, projects selected for the Aurora ESP will have dedicated access to the full system, with large allocations of time in support of their major scientific run campaigns. Access will continue for the remainder of a year, but will be shared with the production users.

Aurora ESP Proposal Process

Proposals for the Aurora ESP must have a plan for the science to be accomplished, and a description of what application development will occur throughout the duration of the project. In addition, each selected project’s home institution must pursue an appropriate Non-disclosure Agreement (NDA) with Intel/Cray for access to needed information on the next-generation architecture (and with IBM/NVIDIA to take advantage of portable-application-development training including Summit’s architecture). ALCF will provide instructions for obtaining the NDAs to the selected project teams.

The forms and instructions should include everything needed to submit a proposal. These are, roughly, a simplified version of an INCITE proposal. Please direct any questions to earlyscience@alcf.anl.gov. As part of a DOE user facility collaboration on application readiness, the proposal form will ask for disclosure of participation in NERSC’s NESAP program and OLCF’s CAAR program. (These are the equivalents of ESP at the other centers.)

Evaluation of Proposals

ALCF, with the help of internal and external science-domain experts, will evaluate proposals on the strength of

  • Potential impact of proposed Early Science run campaign
  • Proposed runs that are beyond what can be done with today’s machines such as Mira
  • Computational readiness
  • Scaling to hundreds of thousands of cores (8 racks on Blue Gene/Q, for example)
  • An existing or reasonably well-planned implementation to make use of thread concurrency on Aurora. ALCF expects to see strong-scaling up to at least 8 threads per MPI rank, with greater than 75% efficiency.
  • A plan for portability to other architectures, especially the distinct architecture of OLCF’s future Summit system.
  • Appropriateness of development team: are the expertise and person-hours proposed likely to succeed—yielding a code ready to run the proposed science on Aurora starting in the Early Science period
  • Diversity of science domains and algorithms; we want the ESP projects to be a reasonable sample set of the science and algorithms in production at the LCF site

Aurora ESP Proposal Timeline (2016)

  • July 6, 2016          Call for Proposals issued
  • September 2, 2016       Call for Proposals closes
  • End of 2016         Aurora ESP Awards Announced
  • January 2017          Kick-Off Workshop for ESP Teams (virtual/webinar)

Outcomes and Expectations of ESP Projects

We expect ESP projects to share best practices and lessons learned—the fruits of your labors developing, porting, and optimizing your applications for Aurora—in technical reports and in presentations to the community. ALCF will organize a community workshop around the end of the Early Science period. We also expect projects to share and publish their scientific results in appropriate venues, acknowledging the ALCF and its Early Science Program.

We expect project participants to partner with ALCF and Intel/Cray to assess robustness and correctness of the new hardware and software—to help identify the root cause of problems and find potential workarounds.