This document discusses considerations for enabling access to and use of data from the International Cancer Genome Consortium (ICGC) in a biomedical compute cloud. It notes that the ICGC has over 25,000 tumors across 53 projects and 16 countries/regions, with about 100GB of open access analysis results and 700TB of controlled access sequencing and array data hosted across various repositories. It raises questions about how to aggregate this distributed data through a single access point, what compute and analysis resources users may need, who would create and maintain common pipelines, how to ensure authorization and compliance of cloud-based data users, and what metadata is required to make the data useful.
1 of 7
Download to read offline
More Related Content
2013-B_Whitty-biomedical_cloud
1. Brett Whitty
ICGC Data Coordination Center Curation Manager
Ontario Institute for Cancer Research
Open Cloud Consortium
Towards a Biomedical Commons Cloud Working Group
April, 2013
Some Considerations for Enabling Users of
International Cancer Genome Consortium (ICGC)
Data in a Biomedical Compute Cloud
3. ICGC Data
Current data:
(represents ~1/3 of goal)
~100GB of gzipped analysis results (open access)
hosted via HTTP(S)/FTP at ICGC DCC data portal
~700TB raw sequencing and array datasets* (controlled access)
hosted at EBI EGA repository (and other public repos)
*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)
3
4. ICGC Data Access
Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)
Excludes TCGA data for which access is granted by the TCGA project
DACO, ICGC.org & DCC support OpenID for authentication
Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms
ICGC datasets are presently distributed across several public repositories
Presents a challenge to end users
Need to aggregate the data through a single access point, virtually if not physically
Ideally a single user sign-on method would be recognized by all resources
May be impossible due to technical/organizational challenges
4
5. ICGC Computes(1)
No common ICGC data analysis centers (yet)
No common ICGC workflow systems (yet)
No common ICGC pipelines (yet)
5
6. ICGC Computes(2)
Who are the cloud-based data consumers?
What do they need/want?
Sufficient to have ICGC simply provide datasets?
Does ICGC need to also provide canned analysis pipelines?
Reproduce methods used in ICGC publications?
Who creates/maintains these?
Using which workflow system?
6
7. Other Issues
Can ICGC DACO assure authorization and compliance of
cloud-based data consumers?
Auditing, revoking access, etc.
How is this achieved?
What are the support needs of ICGC Cloud users?
How much effort will they require?
From whom?
What is the minimal metadata we need to collect to make
the data useful?
Who ensures this?
7