Using Cloud Computing for Processing Dynamic Data Streams

Luis Pineda
Seminar

With the rapidly growing number of dynamic data streams produced by sensing and experimental devices as well as social networks, scientists are given an unprecedented opportunity to explore a variety of environmental and social phenomena ranging from understanding of weather and climate to population dynamics. One of the main challenges is that dynamic data streams and their computation requirements are volatile: sensors or social networks may generate data at highly variable rates, processing time in an application may significantly change from one stage to the next one, or different phenomena may simply generate different levels of interest. Cloud computing is a promising platform allowing us to cope with such volatility because it enables us to allocate computational resources on demand, for short periods of time, and at an acceptable cost. At the same time using clouds for this purpose is challenging because an application may yield a very different performance depending on the hosting infrastructure, requiring us to pay special attention to how and where we schedule resources.
 
In this presentation, I will describe our experiences using an application relying on input from social networks, notably geo-located tweets, to discover correlation between users’ work and home locations, with focus in the Illinois area. Our overall intent is to assess the impact of running the same application in offerings from different providers; to this end, we execute data filtering and per-user classification applications in two flavors of Chameleon cloud instances, namely bare-metal and KVM. Also, we analyze specific configuration parameters, such as data block size, replication factor and parallel processing, towards statistically modeling the application performance in a given infrastructure. We then identify and discuss the key parameters that influence the execution time. Finally, we look into the gains brought by accounting for data proximity when scheduling a resource in a multi-site environment.