Talk given at CHOP bioinformatics retreat where I describe the technical and cultural ingredients needed to foster a reproducible research culture in a bioinformatics core
13. Why Snakemake?
Addresses Makefile weaknesses:
Difficult to implement control flow
No cluster support
Inflexible wildcards
Too much reliance on sentinel files
No reporting mechanism
Keeps the good stuff:
Implicit dependency resolution
Johannes K旦ster
18. Acknowledgements
BiG
Deanne Taylor
Batsal Devkota
Noor Dawany
Perry Evans
Juan Perin
Pichai Raman
Ariella Sasson
Hongbo Xie
Zhe Zhang
TIU
Jeff Pennington
Byron Ruth
Kevin Murphy
DBHi
Bryan Wolf
Mark Porter
#3: ---
際際滷 2
Because I'm giving a talk on reproducible research I am legally obligated to open with the cautionary tale of Anil Potti scandal. This is the Duke researcher who from 2006 to 2009 used microarray expression data to classify predict sensitivity of clinical tumor samples to various chemotherapy treatments.
As early as 2007 Keith Baggerly and Kevin Coombes from MD Anderson tried to reproduce the initial analyses and obtained radically different results.
So lacking much of the code had to reverse engineer what they believed occurred and discovered a number of off-by-one errors, one of which was induced by a program that didnt want a header row, genes missing from the test set that were in the training set, genes that appeared in a later email from the authors were not even in the probeset, and a huge number of samples that were duplicated, some of which were both labeled sensitive and resistant.
Despite years of efforts by Baggerly and Coombes to have the authors come clean with a working analysis this went all the way to clinical trials, which were halted only when it was discovered Anil Potti lied on his CV and was not in fact a Rhodes Scholar.
I really encourage everyone to view the presentation by Keith Baggerly on Youtube - The Importance of Reproducible Research in High-Throughput Biology
The real point of the story is that the manipulation of data might have began with a cell shift and off-by-one errors, stupid mistakes that anyone can make but are virtually impossible to detect unless submit a reproducible workflow with your paper and allow reviewers to run that on a novel dataset.
Withholding of data was likely a side of active obfuscation I suspect much of the initial errors in these papers were due to stupid mistakes.
So for me preventing the outright falsification of data is not even in my top reasons for reproducible research. If someone is determined to lie they're going to find a way to do it.
The thing is they Duke fired Anil Potti but I didnt they fired Microsoft Excel. Excel still has tenure at Duke for all I know. And thats a shame because Excel was likely a partner in this crime.
#4: ---
際際滷 3
The kind of reproducible research also tends to get conflated with a bunch of hot topics: open access, open data, software carpentry, and a bunch of other stuff people don't want to do. This is what I call the reproducible research guilt trip. In the big bad mean world the relationship between journals, funding institutions, and reviewers is essentially adversarial.
Inside a group like ours there are a lot more incentives and ways of enforcing reproducibility, and if an investigator wants to say publish an open data set and a fully reproducible analysis it should be possible. If not, thats fine, these things still benefit us.
I tend to blur the lines between good practice, automation and reproducibility. I consider this a branch of software or process engineering rather than ethics. This just goes hand-in-hand with optimizing our practices and becoming a more efficient and productive group and also produce analyses that will live up to the increased scrutiny that is coming from the journals. So this is not feel-good reproducibility and this not just for the benefit of the people we work with.
So I want to talk about our values and our practices, and our habits.
I was a biologist I wouldn't work with a core that didn't have a reproducibility as a standard. That might be because I've seen how the sausage is made, but I think there are sound reasons why this should a guiding principle for how we do work.
#5: ---
際際滷 4
The whys of reproducible research, other that umm this is science are for me:
Sanity in terms of being able to reliably derive results from raw data because if I cant do that then I dont have a leg to stand on
Reuse I want others or my future self to reproduce an analysis from the start that should be possible
Redundancy so if someone gets hits by a bus there rest of us
Evaluation we have a group with a lot of different strengths and weaknesses software development, statistics, systems biology, sequencing, and disease domains. If we dont have codebase that is shared, open, reproducible I really dont see any reason to have a group at all. We might as well just be fully embedded analysts. Im sure there are some people here who feel this would be a better arrangement and I can respect that viewpoint but I would think there is some real synergistic benefit to having a group of analysts rather than 8 scattered throughout an organization
#6: ---
際際滷 5
I'm here today to speak about two very domestic areas of interest to me and where I think DBHi can be a standard bearer and innovator.
First is the marriage or "bondage" between code, data and tracking metadata (which is called data provenance) , and results (or what are sometimes called deliverables) so version control, reproducible reports, MyBiC, tools that are more or less in place
My other interest is in accelerating what I call the edit cycle, the spin cycle of occurs when you present results to an investigator and they want to tweak parameters and redo everything. The challenge is keeping the work in a reproducible context while still allowing people to explore their own data.
#7: ---
際際滷 6
So a couple years ago there was this paper outlining the ten simple rules for reproducible computational research.
And these are great rules I cant argue with any of them:
Rule 1: For Every Result, Keep Track of How It Was Produced
Rule 2: Avoid Manual Data Manipulation Steps
Rule 3: Archive the Exact Versions of All External Programs Used
Rule 4: Version Control All Custom Scripts
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
Etc. etc. ten rules is actually a bit much to take in at once but I think we can melt this down to one principal component for these is
#8: ---
際際滷 7
You wouldnt thinkd need to say that.
But a lot of people who come from life science disciplines with very strict traditions in keeping lab notebooks when they come to bioinformatics they just say well Ill do whatever the hell I want
This is quite dangerous because the less experienced you are as a programmer the more prone you are to making stupid mistakes because you write repetitive code, you dont write sanity checks or test cases.
So where do we write stuff down?
#9: ---
際際滷 8
So where do we write stuff down? In version control. We use Git for version control and were lucky to have a 40 seat Github enterprise instance behind the firewall.
If your code is not in Github it doesnt exist.
Youve probably used something akin to an automatic version control in google docs
#10: ---
際際滷 9
The difference with git is the that the changes are explicit manually denoted by commits in which you attach a message that says what the significance of this change was.
Repositories are distributed so you can work on something without a constant internet connection to the server
Commits can be organized into branches. This is a branching pattern we use in software development where you have a master branch of major releases, a development branch, and then feature branches that become more experimental as you move left here.
What distinguishes Git from older version control systems is the ability to do very graceful merges. So person A and person B can both make changes and then merge those changes into a common branch if the changes are mutually exclusive theyll be transparent. If there are conflicts, if two people modify the exact same line of code, then it will denote that and ask you to decide how to proceed.
Its not unusual for projects to have 20 or 30 branches that come and go one for each feature and one for each bug.
#11: ---
際際滷 10
This a non-normalized heatmap of git commits on our enterprise Github from members of the bioinformatics group. What I like most is to see who is committing stuff on Saturdays and Sundays. Typically it's people without kids.
I want to discuss this because reproducible research culture has to come from the top down.
The previous director gave us zero support for this initiative, for him enforcing any software development standards was just a brake on the wheel. He said You don't want to get bogged down in process.
This ball didnt get rolling Mark Porter stepped in as interim director. And he kind of held peoples feet to the fire and said hey you need to push your damn code.
Despite being thrust unwillingly into the director position he actually wound up being one of the best directors we ever had.
And pretty soon cores won't be hiring programmers and analysts who don't have a body of work on Github. This does factor into hiring. That's just where it's headed.
We had one applicant who, for whatever reason, just to acquiesce to the journal put her code in the readme section of the repository, where you describe your repository. And the code was awful.
So for me this is fundamental. If you value reproducibility it starts here.
I'd like to hear your thoughts if you would like to join me at the roundtable.
#12: ---
際際滷 11
So its not enough that we just put reproducible code if its some inscrutable black box. We need something that is both reproducible and literate.
Thats where Sweave comes in. Sweave is acutally pronounced S-Weave (S is the predecessor to R)
In Sweave you wrap your R code chunks in these tags and it is embedded or weaved into the LaTeX formatted markup that describes what the code is doing and we use this to produce PDF reports. This is generally the last step in an analysis pipeline but the one that involves the vast majority of tweaks and edits and actual analysis and statistics.
So when people ask why Excel is such a pariah in the bioninformatics community no self-respecting analyst would use Excel. Is it because it mangles files or turns gene names into dates? No they fix that and it would still suck. The reason Excel is not a tool we use for scientific research is because what you do in it is not reproducible, it's not automated, and it's not literate.
Jim has really run with the report concept and made great use of it both for NGS and microarray expression reports that he has developed with Deborah Watson.
In the R community Sweave has been supplanted by a package called knitr which has a lot more features in terms of caching and variable execution of chunks as well as support for Markdown so you can easily produce web reports.
Im still somewhat partial to PDFs because they have a beginning and an end and you can print them out and you can time stamp them and you can stamp them with git commit hashes.
#13: ---
際際滷 12
This git hash looks pretty hairy but it actually provides a hook by which we can really hold provenance
So can include metadata we get from the LIMS, what we alignments and variants calls from CBMi-Seq and everything downstream from there
#14: ---
際際滷 13
So whats the glue brings from raw data to Sweave reports.
For a long time it was Make. Make is a build system designed to compile C programs but it has been really useful in for bioinformatics because it provides a syntax for describing how to convert one type of file to another based on filename suffixes.
That is 95% of all analytical pipelines.
But Make has some limitations that are frustrating to work with. And thats where this genius Johannes Koster basically solved all those by basically subsuming the entirety of python into the domain specific language.
#15: ---
際際滷 14
But what is really attractive for me is the ability to keep an entire workflow encapsulated in the Snakefile.
So all the input, outputs, and intermediates are all first-class citizens in the Snakefile, so the same code that runs the alignments can also kick off the Sweave scripts and also produce Markdown web pages that we can display in a portal.
#16: ---
際際滷 15
So what is that portal. Where do we put the deliverables?
For a long time they were being emailed, or put in DropBox. These both present big disadvantages in terms of persistence, and access, and also technically Im not sure were supposed to use DropBox.
We could in theory use Github since it has a lot of utility for displaying Markdown and it has issue tracking, but the authentication in github is crude and you cant really put big files in there, its just not organized the way an investigator would want to use them.
MyBiC is a Django-based delivery portal that I created to serve as a delivery portal for analyses. MyBiC provides an Users/Labs/Groups authentication scheme, search, news, and trackingProjects within MyBiC can be created in Markdown or HTML and can be loaded directly from Github or from the disk. The MyBiC server has a read-only mount to the Isilon, so even very large files can be served.
#17: ---
際際滷 16
OS-level virtualization
Unlike a virtual machine which talks to the hardware through a translator, the docker engine is much more lightweight.
And Unlike a VM which is kind of a stateful machine you massage into place, a docker container is run off of a reproducible configured script called a Dockerfile which makes it literate, if that were a terms DevOps people used.
Once a program is dockerized it can be run without installing it.
It lives inside a container and it only knows what you tell it as far as network ports, permissions, file volumes.
This has great appeal for anyone who has ever tried to install software.
#18: ---
際際滷 17
The question is what do we get from dockerizing an entire analysis
For one it will be trivial to send an entire workflow to a colleague or a journal and say have at it and they can hit the ground running
But even if we dont do that it should be much easier for a website like MyBiC to execute a workflow.
In my mind I can see how this could accelerate the edit cycle I mentioned earlier.
Sometimes theyre hunting for p-values but often they are just exploring the data. The problem is it ties up a lot of our time just reading emails and re-running analyses. This is what I call the parameter purgatory
So tradiitonaly we could build full scale web apps in Shiny (Pichai and Jim have done this), but thats really better for traditional database-driven portals, sometimes we just want to write and analyses once and then choose parameters from there. You dont want to build an analysis to do some kind of meta-analysis.