ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Data Provenance for
      Phyloinformatics:
Introduction & Survey Results
         Elliott Hauser
    UNC Information Science

        Karen Cranston
      NESCent Informatics
Overview:
What is Phylogenetics?
What is Phylogenetic Data?




                                                                  ...many things!
           Source: DRAFT: Current Best Practices for Publishing Trees Electronically, 2010. Stoltzfus et al.
           http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010
What is Phylogenetic Data?


        <A sample NeXML file>




              Source: http://github.com/miapa/miapa-etl/tree/master/nexmlex
What is a
Minimum Information Standard?
The answer to this question, for a domain:

"What is the minimum information necessary
for an independent scientist to carry out an
independent analysis of the data?"

                               Quackenbush, 2005

For Phylogenetics, this is MIAPA:
Minimum Information About a Phylogenetic Analysis
What do we need to know to analyze
this tree?
Overview:
What is MIAPA?




                 Source: Leebens-Mack et al. 2006
Overview:
Producers' and Consumers' attitudes

                                                Most important
                                                metadata type




                                                Least important
                                                metadata type



                      Source: Cranston MIAPA survey, 2012 (unpublished)
Half of all metadata types are
critically important to two+ subfields




                        Source: Cranston MIAPA survey, 2012 (unpublished)
The majority of metadata types are
easy to produce for all subfields




                       Source: Cranston MIAPA survey, 2012 (unpublished)
How to balance the needs of
Producers and Consumers?

                                                Most important
                                                metadata type




                                                Least important
                                                metadata type



                      Source: Cranston MIAPA survey, 2012 (unpublished)
Metadata at work:
The Open Tree of Life Project




                   Conflicting Data, Conflicting Needs:
                    ¡ñ A Single, 'Best' Tree of Life
                    ¡ñ Access to Underlying, Conflicting Trees
A new research area:
Computational data provenance




              ...Huh?
A new research area:
Computational data provenance

Computational: The result of a computation

Data provenance: Where/how it came to be


   As science becomes more and more
computational, we need to know more about
                our data!
Reprise:
What is Phylogenetics?




a perfect field for computational data provenance!
Discussion
Will our survey results predict actual behavior?

What tools, if any, will preserve and encourage
submission of computational data provenance?

Is computational data different from measurement
data, classification data, or other types of
metadata? If so, does that affect our work?
Thanks!
eah13@mac.com
Reprise: balancing the needs of
Producers and Consumers?

                              Most important
                              metadata type




                              Least important
                              metadata type

More Related Content

Phylogenetics & Data Provenance: Survey Results

  • 1. Data Provenance for Phyloinformatics: Introduction & Survey Results Elliott Hauser UNC Information Science Karen Cranston NESCent Informatics
  • 3. What is Phylogenetic Data? ...many things! Source: DRAFT: Current Best Practices for Publishing Trees Electronically, 2010. Stoltzfus et al. http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010
  • 4. What is Phylogenetic Data? <A sample NeXML file> Source: http://github.com/miapa/miapa-etl/tree/master/nexmlex
  • 5. What is a Minimum Information Standard? The answer to this question, for a domain: "What is the minimum information necessary for an independent scientist to carry out an independent analysis of the data?" Quackenbush, 2005 For Phylogenetics, this is MIAPA: Minimum Information About a Phylogenetic Analysis
  • 6. What do we need to know to analyze this tree?
  • 7. Overview: What is MIAPA? Source: Leebens-Mack et al. 2006
  • 8. Overview: Producers' and Consumers' attitudes Most important metadata type Least important metadata type Source: Cranston MIAPA survey, 2012 (unpublished)
  • 9. Half of all metadata types are critically important to two+ subfields Source: Cranston MIAPA survey, 2012 (unpublished)
  • 10. The majority of metadata types are easy to produce for all subfields Source: Cranston MIAPA survey, 2012 (unpublished)
  • 11. How to balance the needs of Producers and Consumers? Most important metadata type Least important metadata type Source: Cranston MIAPA survey, 2012 (unpublished)
  • 12. Metadata at work: The Open Tree of Life Project Conflicting Data, Conflicting Needs: ¡ñ A Single, 'Best' Tree of Life ¡ñ Access to Underlying, Conflicting Trees
  • 13. A new research area: Computational data provenance ...Huh?
  • 14. A new research area: Computational data provenance Computational: The result of a computation Data provenance: Where/how it came to be As science becomes more and more computational, we need to know more about our data!
  • 15. Reprise: What is Phylogenetics? a perfect field for computational data provenance!
  • 16. Discussion Will our survey results predict actual behavior? What tools, if any, will preserve and encourage submission of computational data provenance? Is computational data different from measurement data, classification data, or other types of metadata? If so, does that affect our work?
  • 18. Reprise: balancing the needs of Producers and Consumers? Most important metadata type Least important metadata type