際際滷

際際滷Share a Scribd company logo
Brett Whitty
ICGC Data Coordination Center Curation Manager
Ontario Institute for Cancer Research
Open Cloud Consortium
Towards a Biomedical Commons Cloud Working Group
April, 2013
Some Considerations for Enabling Users of
International Cancer Genome Consortium (ICGC)
Data in a Biomedical Compute Cloud
2
53 projects 16 countries/regions > 25,000 tumors committed
ICGC Data
Current data:
(represents ~1/3 of goal)
 ~100GB of gzipped analysis results (open access)
 hosted via HTTP(S)/FTP at ICGC DCC data portal
 ~700TB raw sequencing and array datasets* (controlled access)
 hosted at EBI EGA repository (and other public repos)
*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)
3
ICGC Data Access
 Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)
 Excludes TCGA data for which access is granted by the TCGA project
 DACO, ICGC.org & DCC support OpenID for authentication
 Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms
 ICGC datasets are presently distributed across several public repositories
 Presents a challenge to end users
 Need to aggregate the data through a single access point, virtually if not physically
 Ideally a single user sign-on method would be recognized by all resources
 May be impossible due to technical/organizational challenges
4
ICGC Computes(1)
 No common ICGC data analysis centers (yet)
 No common ICGC workflow systems (yet)
 No common ICGC pipelines (yet)
5
ICGC Computes(2)
 Who are the cloud-based data consumers?
 What do they need/want?
 Sufficient to have ICGC simply provide datasets?
 Does ICGC need to also provide canned analysis pipelines?
 Reproduce methods used in ICGC publications?
 Who creates/maintains these?
 Using which workflow system?
6
Other Issues
 Can ICGC DACO assure authorization and compliance of
cloud-based data consumers?
 Auditing, revoking access, etc.
 How is this achieved?
 What are the support needs of ICGC Cloud users?
 How much effort will they require?
 From whom?
 What is the minimal metadata we need to collect to make
the data useful?
 Who ensures this?
7

More Related Content

2013-B_Whitty-biomedical_cloud

  • 1. Brett Whitty ICGC Data Coordination Center Curation Manager Ontario Institute for Cancer Research Open Cloud Consortium Towards a Biomedical Commons Cloud Working Group April, 2013 Some Considerations for Enabling Users of International Cancer Genome Consortium (ICGC) Data in a Biomedical Compute Cloud
  • 2. 2 53 projects 16 countries/regions > 25,000 tumors committed
  • 3. ICGC Data Current data: (represents ~1/3 of goal) ~100GB of gzipped analysis results (open access) hosted via HTTP(S)/FTP at ICGC DCC data portal ~700TB raw sequencing and array datasets* (controlled access) hosted at EBI EGA repository (and other public repos) *excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects) 3
  • 4. ICGC Data Access Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO) Excludes TCGA data for which access is granted by the TCGA project DACO, ICGC.org & DCC support OpenID for authentication Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms ICGC datasets are presently distributed across several public repositories Presents a challenge to end users Need to aggregate the data through a single access point, virtually if not physically Ideally a single user sign-on method would be recognized by all resources May be impossible due to technical/organizational challenges 4
  • 5. ICGC Computes(1) No common ICGC data analysis centers (yet) No common ICGC workflow systems (yet) No common ICGC pipelines (yet) 5
  • 6. ICGC Computes(2) Who are the cloud-based data consumers? What do they need/want? Sufficient to have ICGC simply provide datasets? Does ICGC need to also provide canned analysis pipelines? Reproduce methods used in ICGC publications? Who creates/maintains these? Using which workflow system? 6
  • 7. Other Issues Can ICGC DACO assure authorization and compliance of cloud-based data consumers? Auditing, revoking access, etc. How is this achieved? What are the support needs of ICGC Cloud users? How much effort will they require? From whom? What is the minimal metadata we need to collect to make the data useful? Who ensures this? 7