Weathering the Flood of Big Data in Climate Research

Jon Bashor

Facebook Twitter LinkedIn Google E-mail Printer-friendly version

From data to discovery: months to weeks to days to hours to minutes

Big Data, it seems, is everywhere, usually characterized as a Big Problem. But researchers at Lawrence Berkeley National Laboratory are adept at accessing, sharing, moving and analyzing massive scientific datasets. At a July 14-16 workshop focused on climate science, Berkeley Lab experts shared their expertise with other scientists working with big datasets.

In climate research, both observational data and datasets generated by large-scale models using supercomputers provide the foundation for understanding what has happened in the past and what the future holds for our planet. Researchers around the world regularly share their results with the goal of increasing the accuracy of the various computer models and the future scenarios.

“It’s not a leap of faith to characterize climate science as a big data problem,” said Prabhat, a climate scientist in Berkeley Lab’s Computational Research Division. “Right now, we have many petabytes of climate data worldwide. And as we create climate models with higher and higher resolution, combined with more powerful computers, climate science is expected to be one of the first disciplines to generate an exabyte.”

How much is a petabyte? If you had a petabyte of digital songs in MP3 format, it would take about 2,000 years to listen to all the files. An exabyte of music, or 1,000 petabytes, would make a 2 million-year playlist.

The climate datasets, however, are created using different climate models developed by different research institutions around the world. Typically, those sites store their own data, but share the information with the global climate science community. According to Prabhat, the data is being stored on servers and is waiting to be downloaded and analyzed.

A number of factors contribute to the size of climate datasets. These include the resolution of the model, the number of output variables, the number of time steps, etc. For example, a model that divided the earth into 25-kilometer grid spaces and included the interactions of 20 different variables and generated a new step every six hours would create a much larger dataset than a model with a 100-km grid using 12 variables and only saving data at every 24-hour step. And to get the most useful results, researchers may run the model over and over, perturbing initial conditions and model parameters, and then comparing the results. Such a process could require a million processor-hours or more on a large supercomputer.

The First Step – Defining the Science Problem

For his research into extreme weather events associated with climate change, Prabhat was interested in the results from about 25 climate modeling groups collectively known as the fifth phase of the Coupled Model Intercomparison Project (CMIP5). Prabhat wanted to access subsets of those datasets that showed three variables indicative of extra-tropical cyclones, a class of storms that originate in the middle latitudes, outside the earth’s tropical zones, and influence day-to-day weather.

“I’m interested in quantifying how extreme weather is likely to change in the future, is the frequency and intensity going to change?” he said. “For this project, I wanted to see what the CMIP5 datasets tell me about this question.”

To increase the usefulness of the data for his research, Prabhat wanted to see the results as they were produced in six-hour time steps. In all, the datasets amounted to 56 terabytes, with the largest single dataset being over 20 terabytes.

Downloading that amount of data, using a desktop computer and typical 5 Megabit-per-second network connection could take more than two-and-a-half years. But since Berkeley Lab is home to the National Energy Research Scientific Computing Center (NERSC), one of the world’s leading supercomputing centers, and the Department of Energy’s ESnet, the world’s fastest network dedicated to science, Prabhat was able to tap world-class infrastructure and expertise.

Step Two – Moving the Data

Eli Dart, an ESnet network engineer, specializes in eliminating network bottlenecks to speed up data transfers. All of the datasets were accessed through a portal created by the Earth Systems Grid Federation to facilitate the sharing of data. The portal allows users to refine their search parameters to download specific datasets. In this case, that was atmospheric model data at six- hour increments running out to the year 2100. The data was stored at 21 sites around the world, including Norway, the UK, France, Japan and the United States.

The portal software then automatically begins fetching the defined datasets, getting the files one at a time, and then checks the integrity of the files once the transfer is complete. Although the process sounds straightforward, there are a number of complications, Dart said.

First, not all data servers are connected equally. Some have very fast connections, while others have lower bandwidth. Some are tuned for faster transfers, others aren’t.

“There was one site where the network access was particularly bad, so I sent them some information on how they could optimize the network,” Dart said. “They did implement some of the fixes, which upgraded their system from really terrible to pretty bad – however, it was a significant improvement. In any system there is always one component, whether it’s software, hardware or whatever that keeps you from going faster – finding that limiting factor is critical to improving performance.”

The current ESGF tools handle some errors gracefully, but others require manual intervention. The net effect is that a person needs to shepherd the process.  “The way the ESGF is currently set up, it just takes a while to download data at that scale,” Dart said. “You need to keep the process running, which means you need to regularly check to ensure it’s running correctly.” For example, to ensure that the person requesting the data has permission, credentials need to be issued and verified – the credentials have limited lifetime and must be refreshed often during the data transfer.

From start to finish, the data staging took about three months – Dart began moving the data on July 16, 2013, and completed it on Oct. 20.

All of the data was transferred to NERSC, where the Cray supercomputer known as “Hopper,” was used to standardize and clean up the data in preparation for doing extensive analysis. That preprocessing of the data took about two weeks, and resulted in a final 15 terabyte dataset.

Prabhat had received an allocation on “Mira,” a 10-petaflops IBM Blue Gene/Q supercomputer at Argonne National Laboratory in Illinois. In preparation for his analysis of the data, Prabhat has been modifying TECA, the Toolkit for Extreme Climate Analysis that he and colleagues at Berkeley Lab have been developing over the past three years.

Moving the data from NERSC to the Argonne Leadership Computing Facility was accomplished using ESnet’s 100-gigabit-per-second network backbone. Globus, a software package developed for moving massive datasets easily, further sped things up. Several data sets were moved during the project, with much better performance than the original data staging.  As an example, a replication of the entire raw project dataset from NERSC to ALCF (56 TB) took only two days.

At Argonne, computer scientist Venkat Vishwanath helped optimize the code to scale to Mira’s massively parallel architecture and high performance storage system, thereby enabling TECA to fully exploit the supercomputer. With these optimizations, Prabhat’s job ran simultaneously on 750,000 of Mira’s 786,432 processor-cores. In just an hour and a half, he used 1 million processor hours.

“In one shot, I obtained statistics about extratropical cyclones for every ensemble run in CMIP5 – this particular analysis might have taken over a decade to complete on a conventional desktop system” said Prabhat, who is now studying the results and looking to publish several research papers on the findings. One area he will be looking at is the effect, if any, of increased carbon emissions on the frequency and intensity of such storms.

Now that he has the analysis in hand, Prabhat is assessing the CMIP-5 model performance in terms of how well they reproduce statistics of extra-tropical cyclones. This assessment will be done by comparing to standard reanalysis products in the climate community. Once he’s compared the baseline statistics for the historical period (1979-present day), he will look for changes in extra-tropical cyclone tracks, track densities and intensities for future climate scenarios. This analysis will inform researchers as to whether extra-tropical cyclone should be expected  to intensify (or weaken) in the future under global warming, and changes in precipitation patterns related to extra-tropical cyclones.

Getting a handle on Big Data will require better workflows, Prabhat said, noting that the majority of time was spent on downloading and cleaning up the datasets before sophisticated data analysis tools could be applied. “I believe that workflows will be a critical technology for gaining insights from Big Data”, he said.

And Dart also sees that space as an area where improvements need to be made.

“Datasets will grow exponentially, but human capabilities can’t, so the systems need to be able to scale up to handle the increasingly large datasets,” Dart said. “If we don’t do this, the scientists will have to wait even longer. But if we can make the data more easily accessible, think about what more we could do. Think about what more we could learn.”