Efficient lineage tracking for scientific workflows

14 years 7 months ago
Efficient lineage tracking for scientific workflows
Data lineage and data provenance are key to the management of scientific data. Not knowing the exact provenance and processing pipeline used to produce a derived data set often renders the data set useless from a scientific point of view. On the positive side, capturing provenance information is facilitated by the widespread use of workflow tools for processing scientific data. The workflow process describes all the steps involved in producing a given data set and, hence, captures its lineage. On the negative side, efficiently storing and querying workflow based data lineage is not trivial. All existing solutions use recursive queries and even recursive tables to represent the workflows. Such solutions do not scale and are rather inefficient. In this paper we propose an alternative approach to storing lineage information captured as a workflow process. We use a space and query efficient interval representation for dependency graphs and show how to transform arbitrary workflow processe...
Thomas Heinis, Gustavo Alonso
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2008
Authors Thomas Heinis, Gustavo Alonso
Comments (0)