Supporting Scientific (Meta) Data Management with Globus

Kyle Chard
Seminar

As research becomes increasingly data intensive, researchers are faced with new data management challenges associated with the size and distributed nature of data. Current approaches for organizing, discovering, describing and publishing data often do not scale with increasing data sizes and therefore new techniques are required to support research data management. In this talk, I will describe how we are enhancing Globus to support the extraction, definition, association and use of scientific metadata.  While there is already a wealth of valuable metadata hidden within files, encoded in directories, and associated with data via references, using this metadata is difficult as it is stored in science-specific formats, encoded in proprietary binary formats, and unstructured (or at least does not follow standard conventions). By extracting scientific metadata and augmenting it with user annotated and curated metadata, we form the foundation for providing higher level abstractions with the goal of further simplifying research activities across the entire research data lifecycle. I will describe efforts related to metadata extraction and representation, search and discovery across Globus’ large network of accessible endpoints, support for active research data management throughout the research lifecycle via a scalable metadata catalog, and publishing of immutable and identifiable research data and metadata.