This document discusses the need for infrastructure to preserve and share scientific workflows and data. It provides two use cases, one from astronomy and one from biology, to illustrate challenges around collaboration, data management and reproducibility. Research Objects are proposed as a way to bundle workflows and data with metadata, provenance and lifecycles. Linked data could help support collaborative science by linking research objects and enabling discovery. Ensuring user-driven provenance is also discussed as important for adoption.
1 of 16
Download to read offline
More Related Content
2011 03-provenance-workshop-edingurgh
1. Provenance in the Dynamic, Collaborative New
Science
Dr Jun Zhao
Department of Zoology
University of Oxford
jun.zhao@zoo.ox.ac.uk
5. Technological infrastructure for the preservation and efficient
retrieval and reuse of scientific workflows in a range of disciplines
7. Astronomy Use Case:
A Repeater's Story
Dealing with big amounts of tabular
data
A lot of small scripts to avoid creating
blackbox process
Local resource sharing, public
access only after publication
Data must be frequently updated
from external data repositories
Data updates must be tested before
being executed
Data must be locally stored with
versioning
... we don't like to spread [the tasks]
and lose controls who is doing
what ...
8. Research Objects
http:/www.wf4ever-project.org
Aggregation Pointers or literals of
internal and external content;
Identity Equivalence, equality;
Metadata A reusable object;
Lifecycle Stages of development.
Impacts on available functionality;
Versioning Recording changes;
Security Access, authentication,
ownership, trust;
Graceful Degradation of
Understanding Opaque RO
domain content.
Mixed stewardship
Provenance
ROs are Content Aware Objects
Of compound objects
that bundle things together
Of evolutions
Of dynamic objects and static
objects
9. Biology Use Case: A Reuser's Story
Takes a set of genes from gene experiment results
performed by others, as read in a scientific paper
Perform 'dry' analysis to understand which genes and
which biological processes were disturbed by which
chemical compounds
basic affymetrix data processing
statistical analysis to identify genes that are significantly
differentially expressed under different conditions (with/without the
compounds)
find those pathways that are most prominent among the filtered
genes
10. Biology Use Case: A Reuser's Story
Search for existing experiments from
myExperiment (http://myexperiment.org)
Challenge: Understand the workflow
Perform test runs with test data and his own data
Read others' logs
Read annotations to workflows
Reuse scripts from colleagues and perform
tests that his colleagues are familiar with
11. How Can It be Supported?
A reference to the source of the data and the people to acknowledge for it.
The initial hypothesis
The conceptual workflow or a summary of the experiment plan
References to workflows that were tested, with comments on their application for
the user's use case
The workflow of the user's, possibly with a backlog of previous versions that the
user wishes to keep for reference (with notes and comments)
The runs of the user's own workflow, results and the recorded steps that lead to
the results, in some cases with comments for later reference (e.g. 'here I used
parameter A, next time I may try B')
The final hypothesis, with comments.
A reference to the results of the workflow
Design logs that record the user's considerations while making the workflow
Run logs that record the user's considerations while running and interpreting the
workflow
15. Take home
Provenance should be user-driven
Linked Data should be a means to an end
http://www.wf4ever-project.org
16. Acknowledgement
Marco Roos of Leiden Unveristy (NL) and Jose
Enrique Ruiz of Instituto de Astrof鱈sica de
Andaluc鱈a (Spain)
Carole Goble of University of Manchester (UK)
and Jose Manuel Gomez of iSOCO (Spain)
Hui Hua and Jenny Molly of University of
Oxford (UK)