Scalable Unsupervised Learning Approaches for Analysis of Large Geospatiotemporal Datasets

Richard Mills

The increasing availability of high-resolution geospatiotemporal datasets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery and mining of climate and ecological data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new implementations that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. I will describe some unsupervised knowledge discovery and anomaly detection approaches based on highly scalable parallel algorithms (optimized for manycore processors such as the Intel Knights Landing Xeon Phi and NVIDIA GPGPUs) for k-means clustering and singular value decomposition, consider a few practical applications there of to remotely-sensed vegetation phenology and eco-climatological data sets, and speculate on some of the new possibilities that such scalable analysis methods may enable. Time permitting, I will end with a discussion of one (highly speculative) new science possibility: hyperresolution global land-surface modeling, using machine learning-based approaches to bridge between processes operating on different scales.