Managing Provenance for Reproducibility and Beyond

Juliana Freire
Seminar

Computing has been an enormous accelerator to science and industry alike and it has led to an information explosion in many different fields. The unprecedented volume of data acquired by sensors, derived by simulations and analysis processes, and shared on the Web opens up new opportunities, but it also creates many challenges when it comes to managing and analyzing these data.
In this talk, I discuss the importance of maintaining detailed provenance also referred to as lineage and pedigree for digital data. Provenance provides important documentation that is key to preserve data, to determine the data's quality and authorship, to understand, reproduce, as well as validate results. Besides presenting techniques we have developed to efficiently manage and re-use provenance information, I will give an overview of the provenance infrastructure we have built for the open-source VisTrails system. I will also describe emerging applications and novel uses of provenance for enabling collaborative data analysis, teaching science, and publishing reproducible results.