DOE Research Group Makes Case for Exascale

announcements

Exascale computing promises incredible science breakthroughs, but it won't come easily, and it won't come free. That's the premise of a feature story from the DOE's Office of Advanced Scientific Computing Research, whose mission it is "to discover, develop, and deploy the computational and networking tools that enable researchers in the scientific disciplines to analyze, model, simulate, and predict complex phenomena important to the Department of Energy".

The article makes the case for exascale computing, citing some of the scientific breakthroughs that such a leap would enable, such as precise long-range weather forecasting, innovative alternative fuels and advances in disease research. The ability to represent many more variables will lead to more realistic models. For example, future researchers will be able to create a global climate model with a level of resolution that is now only possible for regional studies.

There are three main obstacles standing in the way of tomorrow's exascale behemoths, and according to Rick Stevens, Argonne National Laboratory associate director for computing, environmental and life science and a University of Chicago computer science professor, all are potential showstoppers.

The current exascale model predicts a machine with a billion cores. So the first challenge is creating software that can take advantage of all of them. This is parallelism in the extreme. Applications have been developed that can achieve 250,000-way parallelism, but exaflop-class machines will be called upon exhibit 1-billion-way parallelism.

Another daunting concern is power. Stevens says that a 1 billion-processor computer made with today's technology would consume more than a gigawatt of electricity. According to the DOE's Energy Information Administration, the top US utility plants generate only a few gigawatts with most producing less than four. That means that a single exascale machine could necessitate its own power plant. GPU computing is being looked at as a potential way to curb energy demands.

The enormous increase in the number of processing cores is what leads to the third major challenge, reliability. Whatever reliability issues exist in a modern system will be magnified a thousand fold, such that, according to Stevens, "If you just scale up from today's technology, an exascale computer wouldn't stay up for more than a few minutes at a time". Practically-speaking, a machine's mean failure rate must be about a week or more. To illustrate, Lawrence Livermore National Laboratory's IBM BlueGene/L fails about once every two weeks.

At the heart of all these challenges is funding, specifically government funding. This is the fundamental factor on which the success or failure of exaflop-level computing hinges. As explained in the article, scientific computing is a niche market, not sustained by the overall IT industry, which is driven by consumer electronics innovations. Therefore, "complex and coordinated R&D efforts [are required] to bring down the cost of memory, networking, disks and all of the other essential components of an exascale system".