The Theta ESP Program ended in July 2017.
In late 2016, the Argonne Leadership Computing Facility (ALCF) will deploy Theta, a next-generation Cray/Intel system based on the 2nd Generation Intel® Xeon Phi™ processor (code name: Knights Landing (KNL)). The system will have over 2500 nodes, each with a KNL 60-or-more-core processor having up to 16 GB of high-bandwidth in-package memory (IPM) and 192 GB of DDR4 RAM. The aggregate peak compute speed will be over 8.5 PFLOPS. It will have an initial 10 PB Lustre parallel file system. Theta will help bridge the gap between Mira and our ALCF-3 machine Aurora, a 180 PFLOPS machine we will deploy in late 2018. Aurora will have 3rd Generation Intel® Xeon Phi™ processors (code name: Knights Hill (KNH)). It will have over 50,000 nodes and over 7 PB total combination of high bandwidth on-package memory, local memory, and persistent memory. Theta will also enable exploration of data-intensive applications.
We believe Theta will be a highly useful stepping-stone to Aurora. Many of the code development aspects for Aurora are shared by Theta: Both support large numbers of threads; KNL cores support 4 hardware threads. Both have multiple memory levels with very different bandwidths and capacities. Both have Intel x86-64 architecture compatibility. While Theta’s peak speed is only about the same as our current BG/Q production system Mira, we expect some scientific codes to achieve much higher performance on Theta than on Mira, because of:
- High bandwidth of the IPM (over 400 GB/s), with many applications running entirely in IPM or using it effectively with the DRAM
- Better single thread performance
- Potentially much better vectorization with AVX 512
- Features beneficial to data-intensive operations:
- Large total memory per node (up to 208 GB vs. 16 GB on Mira)
We believe the Early Science projects’ allocations of Theta core-hours will support innovative science not possible on today’s leadership-class systems such as Mira. The Early Science allocations will be in the range of 50 to 100 million core-hours.
Support for Theta ESP Projects
ALCF will directly support efforts on the ESP projects by hiring postdocs to work with some of the project teams. We will fund up to 4 dedicated postdocs, who will be employed by ALCF, but work directly with project investigators on efforts needed to prepare for Early Science runs. Some of these may split time between the ALCF and project-investigator locations. Generally, a postdoc will work on only one Theta ESP project.
We will assign one ALCF staff member, most likely from our Catalyst team, to each Early Science project. This person will spend a fraction of his/her time collaborating with the ESP project and mentoring the postdoc on computational aspects of the project.
Center of Excellence
The ALCF plans a center of excellence in partnership with the Theta vendors. This center, when established, will support vendor staff dedicated to applications readiness for Theta and Aurora. These will be available as expert consultants on ESP code porting and tuning.
Training and Hardware Access
ALCF and Intel/Cray will offer training targeted toward the ESP projects. This will include a virtual kick-off workshop about the Theta hardware and programming it, and a hands-on workshop as soon as we have sufficient Theta-generation hardware to support testing and debugging of project applications. Before then, we will also offer access to advanced Theta simulator software, and allocations on our production system Mira for ESP development work that does not depend strongly on having the new hardware (e.g. new algorithms, new physics modules, basic introduction of threads).
Where appropriate, we will offer joint training opportunities with NERSC, whose NERSC-8 system Cori is based on KNL nodes with similar configuration; and with OLCF, in support of as much portability as possible among our future systems and OLCF’s IBM/NVIDIA-based Summit system. Since Cori will be available before Theta, we will offer modest discretionary access to Cori for Theta ESP projects—especially in support of scaling (Cori will be a much bigger system than Theta).
Finally, ALCF will provide access to small KNL systems via Argonne’s Joint Laboratory for System Evaluation (JLSE).
Theta’s Early Science Period
The Early Science period on Theta is a span of about 3 months, beginning after system acceptance, but before turnover to production users. During this time, projects selected for the Theta Early Science Program (ESP) with have dedicated access to the full system, with large allocations of time in support of major scientific run campaigns the projects proposed. Access will continue for the remainder of a year, but will be shared with the production users.
The Theta ESP Proposal Process
Because of the short timeframe for deployment of Theta, and its modest capability increase relative to our present-day system Mira, ALCF will limit the scope of Theta ESP to about 6 projects, of which we are pre-selecting 2-4 projects for which we have in-house expertise and strategic interest. We will select the remaining projects competitively, based on this call for proposals.
Proposals for the Theta ESP must have a plan for the science to be accomplished, and a description of what application development will occur throughout the duration of the project. In addition, each selected project’s home institution must pursue an appropriate Non-disclosure Agreement (NDA) with Intel/Cray for access to needed information on the next generation architecture (and with IBM/NVIDIA to take advantage of portable-application-development training including Summit’s architecture). ALCF will provide instructions for obtaining the NDAs to the selected project teams.
The forms and instructions should include everything needed to submit a proposal. These are, roughly, a simplified version of an INCITE proposal. Please direct any questions to firstname.lastname@example.org. As part of a DOE user facility collaboration on application readiness, the proposal form will ask for disclosure of participation in NERSC’s NESAP program and OLCF’s CAAR program. (These are the equivalents of ESP at the other centers.)
Evaluation of Proposals
ALCF, with the help of internal and external science-domain experts, will evaluate proposals on the strength of
- Potential impact of proposed Early Science run campaign
- Proposed runs that are beyond what can be done with today’s machines such as Mira
- Computational readiness:
- Scaling to tens of thousands of cores (8 racks on Blue Gene/Q, for example)
- An existing or reasonably well-planned implementation to make use of thread concurrency on Theta. ALCF expects to see strong-scaling up to at least 8 threads per MPI rank, with greater than 75% efficiency.
- A plan for portability to other architectures, especially the distinct architecture of OLCF’s future Summit system.
- Appropriateness of development team: is it likely that expertise and person-hours proposed are likely to succeed—yielding a code ready to run the proposed science on Theta starting in the Early Science period
- Diversity of science domains and algorithms; we want the ESP projects to be a reasonable sample set of the science and algorithms in production at the LCF sites
- Prospects as an Aurora application, and intent to use Aurora
Theta ESP Proposal Timeline (2015)
- April 22 Call for Proposals issued
- May 22 Call for Proposals closes
- Mid June Theta ESP Awards Announced
- July Kick-Off Workshop for ESP Teams (virtual/webinar)
Outcomes and Expectations of ESP Projects
We expect ESP projects to share best practices and lessons learned—the fruits of your labors developing, porting, and optimizing your applications for Theta—in technical reports and in presentations to the community. ALCF will organize a community workshop around the end of the Early Science period. We also expect projects to share and publish their scientific results in appropriate venues, acknowledging the ALCF and its Early Science Program.
We expect project participants to partner with ALCF and Intel/Cray to assess robustness and correctness of the new hardware and software—to help identify the root cause of problems and find potential workarounds
The Future: Aurora ESP
There will be a second, larger Early Science Program for our next-generation production system, Aurora. For this program, we will select 10 projects, all chosen competitively based on a call for proposals. Theta project teams will, of course, be excellent candidates to submit Aurora ESP proposals, but the call will be open to the general community. Aurora being a much bigger and faster machine than Theta or Mira, the three months of pre-production Early Science time will be a large and valuable chunk of core hours, with the potential for truly unprecedented computational science. ALCF will fully fund 10 postdoctoral appointees for Aurora ESP—one for each selected project. Below is a rough timeline showing both Early Science Programs. The rows labeled “Aurora/Theta ESP projects” denote the central effort of the projects: developing, porting and tuning code for the target systems: