際際滷

際際滷Share a Scribd company logo
Provenance in the Dynamic, Collaborative New
                  Science




                    Dr Jun Zhao
               Department of Zoology
                University of Oxford
              jun.zhao@zoo.ox.ac.uk
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
Technological infrastructure for the preservation and efficient
retrieval and reuse of scientific workflows in a range of disciplines
Packaging, preserving and publishing
Astronomy Use Case:
     A Repeater's Story
   Dealing with big amounts of tabular
    data
   A lot of small scripts to avoid creating
    blackbox process
   Local resource sharing, public
    access only after publication
   Data must be frequently updated
    from external data repositories
   Data updates must be tested before
    being executed
   Data must be locally stored with
    versioning
   ... we don't like to spread [the tasks]
    and lose controls who is doing
    what ...
Research Objects
http:/www.wf4ever-project.org
                                       
                                           Aggregation  Pointers or literals of
                                           internal and external content;
                                       
                                           Identity Equivalence, equality;
                                       
                                           Metadata  A reusable object;
                                       
                                           Lifecycle  Stages of development.
                                           Impacts on available functionality;
                                       
                                           Versioning  Recording changes;
                                       
                                           Security  Access, authentication,
                                           ownership, trust;
                                       
                                           Graceful Degradation of
                                           Understanding  Opaque RO
                                           domain content.
                                       
                                           Mixed stewardship
                                       
                                           Provenance
       ROs are Content Aware Objects
                                            
                                                Of compound objects
         that bundle things together
                                            
                                                Of evolutions
                                            
                                                Of dynamic objects and static
                                                objects
Biology Use Case: A Reuser's Story
   Takes a set of genes from gene experiment results
    performed by others, as read in a scientific paper
   Perform 'dry' analysis to understand which genes and
    which biological processes were disturbed by which
    chemical compounds
       basic affymetrix data processing
       statistical analysis to identify genes that are significantly
        differentially expressed under different conditions (with/without the
        compounds)
       find those pathways that are most prominent among the filtered
        genes
Biology Use Case: A Reuser's Story
   Search for existing experiments from
    myExperiment (http://myexperiment.org)
   Challenge: Understand the workflow
       Perform test runs with test data and his own data
       Read others' logs
       Read annotations to workflows
   Reuse scripts from colleagues and perform
    tests that his colleagues are familiar with
How Can It be Supported?
   A reference to the source of the data and the people to acknowledge for it.
   The initial hypothesis
   The conceptual workflow or a summary of the experiment plan
   References to workflows that were tested, with comments on their application for
    the user's use case
   The workflow of the user's, possibly with a backlog of previous versions that the
    user wishes to keep for reference (with notes and comments)
   The runs of the user's own workflow, results and the recorded steps that lead to
    the results, in some cases with comments for later reference (e.g. 'here I used
    parameter A, next time I may try B')
   The final hypothesis, with comments.
   A reference to the results of the workflow
   Design logs that record the user's considerations while making the workflow
   Run logs that record the user's considerations while running and interpreting the
    workflow
Where is Linked Data?
The Role of Linked Data in Wf4Ever
   Collaborative science
   Dynamic science
   Open science
Provenance Challenge
   Identity
   Context
   Storage
   Retrieval
Take home
   Provenance should be user-driven
   Linked Data should be a means to an end
   http://www.wf4ever-project.org
Acknowledgement
   Marco Roos of Leiden Unveristy (NL) and Jose
    Enrique Ruiz of Instituto de Astrof鱈sica de
    Andaluc鱈a (Spain)
   Carole Goble of University of Manchester (UK)
    and Jose Manuel Gomez of iSOCO (Spain)
   Hui Hua and Jenny Molly of University of
    Oxford (UK)

More Related Content

2011 03-provenance-workshop-edingurgh

  • 1. Provenance in the Dynamic, Collaborative New Science Dr Jun Zhao Department of Zoology University of Oxford jun.zhao@zoo.ox.ac.uk
  • 5. Technological infrastructure for the preservation and efficient retrieval and reuse of scientific workflows in a range of disciplines
  • 7. Astronomy Use Case: A Repeater's Story Dealing with big amounts of tabular data A lot of small scripts to avoid creating blackbox process Local resource sharing, public access only after publication Data must be frequently updated from external data repositories Data updates must be tested before being executed Data must be locally stored with versioning ... we don't like to spread [the tasks] and lose controls who is doing what ...
  • 8. Research Objects http:/www.wf4ever-project.org Aggregation Pointers or literals of internal and external content; Identity Equivalence, equality; Metadata A reusable object; Lifecycle Stages of development. Impacts on available functionality; Versioning Recording changes; Security Access, authentication, ownership, trust; Graceful Degradation of Understanding Opaque RO domain content. Mixed stewardship Provenance ROs are Content Aware Objects Of compound objects that bundle things together Of evolutions Of dynamic objects and static objects
  • 9. Biology Use Case: A Reuser's Story Takes a set of genes from gene experiment results performed by others, as read in a scientific paper Perform 'dry' analysis to understand which genes and which biological processes were disturbed by which chemical compounds basic affymetrix data processing statistical analysis to identify genes that are significantly differentially expressed under different conditions (with/without the compounds) find those pathways that are most prominent among the filtered genes
  • 10. Biology Use Case: A Reuser's Story Search for existing experiments from myExperiment (http://myexperiment.org) Challenge: Understand the workflow Perform test runs with test data and his own data Read others' logs Read annotations to workflows Reuse scripts from colleagues and perform tests that his colleagues are familiar with
  • 11. How Can It be Supported? A reference to the source of the data and the people to acknowledge for it. The initial hypothesis The conceptual workflow or a summary of the experiment plan References to workflows that were tested, with comments on their application for the user's use case The workflow of the user's, possibly with a backlog of previous versions that the user wishes to keep for reference (with notes and comments) The runs of the user's own workflow, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B') The final hypothesis, with comments. A reference to the results of the workflow Design logs that record the user's considerations while making the workflow Run logs that record the user's considerations while running and interpreting the workflow
  • 13. The Role of Linked Data in Wf4Ever Collaborative science Dynamic science Open science
  • 14. Provenance Challenge Identity Context Storage Retrieval
  • 15. Take home Provenance should be user-driven Linked Data should be a means to an end http://www.wf4ever-project.org
  • 16. Acknowledgement Marco Roos of Leiden Unveristy (NL) and Jose Enrique Ruiz of Instituto de Astrof鱈sica de Andaluc鱈a (Spain) Carole Goble of University of Manchester (UK) and Jose Manuel Gomez of iSOCO (Spain) Hui Hua and Jenny Molly of University of Oxford (UK)