This document provides an overview of computational techniques for analyzing metabolomics data. It describes several web-based tools and R packages that can be used for biomarker discovery, data analysis, and pathway analysis using metabolomics data. These include MetaboAnalyst for statistical analysis and visualization, xmsPANDA for preprocessing, biomarker discovery, clustering and network analysis, and Mummichog for pathway analysis. The document then discusses specific workflows and parameters for preprocessing raw LC-MS data, performing quality control checks, and conducting statistical analysis and visualization in MetaboAnalyst and xmsPANDA.
1 of 70
Downloaded 43 times
More Related Content
Cardiology_Metabolomics_workshop_2016_v2
1. Computational techniques for Metabolomics Data
Analysis
Sophia A. Banton and Karan Uppal
Clinical Biomarkers Laboratory
Emory University School of Medicine
sbanton@emory.edu, kuppal2@emory.edu
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
August 11, 2016
2. Topics covered in this workshop
? Overview of metabolomics data
? Web-based tools for biomarker discovery and data analysis
C MetaboAnalyst3.0 (hands-on)
? Using R for biomarker discovery and data analysis
C xmsPANDA (hands-on)
C Runs on R >= 3.2.0
? Mummichog for pathway analysis
C Runs on Python2.7
2
5. 5
HRM: Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty
Liver Disease-An Untargeted, High Resolution Metabolomics Study
Jin and Banton, et al. Amino Acid Metabolism is Altered in Adolescents with Nonalcoholic Fatty Liver Disease!An
Untargeted, High Resolution Metabolomics Study, The Journal of Pediatrics, Volume 172, May 2016, Pages 14-19.e5.
7. Connecting HRM: Plasma Metabolomics of Common Marmosets (Callithrix
jacchus) to Evaluate Diet and Feeding Husbandry
7
Banton et al. Plasma Metabolomics of Common
Marmosets (Callithrix jacchus) to Evaluate Diet and
Feeding Husbandry. JAALAS. March 2016
8. LC-Orbitrap MS
Raw data
Data Analysis Workflow
Final deliverables
8
Raw data processing with
built-in feature and sample
quality assessment
(apLCMS with xMSanalyzer)
Data Exploratory Analysis
(Box plots, histograms, etc.)
Batch-effect evaluation and correction
(Using ComBat); void volume filtering
Annotation of metabolites
(xMSannotator)
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
Metabolite prediction based
on MS/MS
? Metlin (known)
? MassBank (known/unknown)
MS/MS validation
and deconvolution
? DeconMSn
Pathway analysis
(Mummichog,MetaboAnalyst,
MetaCore, MSEA)
Biomarker and network analysis
(xmsPANDA, MetabNet, MetaboAnalyst)
? Univariate: Limma t-test, paired t-test,
ANOVA, time-series
? Multivariate and predictive analysis:
Support vector machine, Random forest,
PLSDA
? Clustering: Two-way Hierarchical
clustering analysis
? Targeted and untargeted MWAS
10. Feature and sample quality
assessment
Merge results from different
parameter settings
Mass calibration, batch-effect
evaluation and correction
Annotation of metabolites
1. Untargeted feature table
2. Targeted feature table
3. Annotated feature table
4. EIC and QC plots
Noise removal and peak
detection in each run
Peak grouping after retention
time alignment
Recovery of weaker signals or
filling missing peaks
Summary feature table
Peak detection and alignment
using apLCMS or XCMS at
different parameter settings
apLCMS or XCMS
LC/MS data processing using apLCMS or
XCMS with xMSanalyzer R package 10
11. Quality evaluation and assurance
A. xMSanalyzer has built-in data quality evaluation routines that
evaluate the quality of both features and samples
C Each sample is run in triplicates so that allows us to evaluate the quality
of features and samples based on coefficient of variation (CV) and
Pearson correlation within the technical replicates, respectively
C Only features with median CV <50% and samples for which the technical
replicates have an average pairwise Pearson correlation >0.7 are retained
for further analysis
C A quality score is assigned to each measured m/z that takes into account
both reliability and reproducibility of detection
B. Batch-effect evaluation using Principal Component Analysis
C. Batch-effect correction using ComBat (Johnson 2007,
Biostatistics)
11
12. Feature table C column headings
mz Median measured mass-to-charge across all samples
time Median Retention time at which the ion elutes
mz.min Minimum measured mass-to-charge across all samples
mz.max Maximum measured mass-to-charge across all samples
NumPres.All.Samples
Number of samples with non-missing/non-zero values
NumPres.Biol.Samples
Number of biological samples for which 2 out of the 3
replicates have non-missing/non-zero values
median_CV
median coefficient of variation (%) within technical
replicates
Qscore
Quality score, defined as the ratio of the percentage of
biological samples for which > 50% of technical replicates
have a signal to the %median CV; A higher Qscore means
the feature is more quantitatively reproducible within
technical replicates is detected across large percentage of
biological samples
Max.intensity Maximum intensity of the feature across all samples
VT_SampleRunDate_Run
Number.cdf
Integrated peak area (ion intensity) in each sample; each
sample has 3 technical replicates (eg: VT_130712_002,
VT_130712_004, VT_130712_006)
12
Feature
Quality
Assessment
14. Biomarker and statistical analysis using
MetaboAnalyst3.0
(http://www.metaboanalyst.ca/)
Integrated Health Science and Facilities Core
NIEHS P30 ES019776
15. Various options for feature selection and predictive
evaluation
? Univariate:
C T-test, Paired t-test, LIMMA based t-test
? P-values from moderate t-tests were adjusted for multiple hypothesis testing
using the Benjamini-Hochberg false discovery rate (FDR) correction method
C Manhattan plot to visualize metabolome wide statistically significant
changes
? Multivariate and data mining:
C Supervised:
? Support Vector Machine
? Partial Least Square Discriminant Analysis
? Random Forest
C Unsupervised:
? Principal Component Analysis
? Two-way hierarchical clustering analysis
? K-means clustering
15
36. (EXTREMELY) Useful resources
? Xia J. and Wishart D., Web-based inference of biological patterns,
functions and pathways from metabolomic data using
MetaboAnalyst, Nature Protocols 2011
? Sugimoto et al., Bioinformatics Tools for Mass Spectroscopy-
Based Metabolomic Data Processing and Analysis, Current
Bioinformatics 2012
36
37. xmsPANDA: R package for pre-processing, biomarker discovery,
clustering, and network analysis
37
38. xmsPANDA workflow
Module a) Data pre-processing (Stage 1)
? Replicate summarization
? Data filtering: missing values, relative standard deviation
? Data Transformation (log, z-score)
? Normalization (Quantile)
Module b) Data mining (Stage 2)
? Univariate: Limma t-test, paired t-test, wilcox, mixed effects
model, ANOVA
? Multivariate and predictive analysis for regression and
classification: Support vector machine, MARS, Random
forest, PLS, sPLS
? Unsupervised: PCA, two-way Hierarchical clustering
analysis
Module c) Metabolome-wide association
(correlation) analysis (Stage 3)
? Global: Pairwise correlation and network of all metabolites
? Targeted: Pairwise correlation and network of targeted
metabolites 38
? Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
39. xmsPANDA: Various options for feature selection and
predictive evaluation
? Univariate:
C T-test, Paired t-test, LIMMA, linear regression, ANOVA
? P-values from moderate t-tests were adjusted for multiple hypothesis testing using the Benjamini-
Hochberg false discovery rate (FDR) correction method
C Manhattan plot to visualize metabolome wide statistically significant changes
? Multivariate and data mining:
C Supervised:
? Support Vector Machine
? Partial Least Square Discriminant Analysis (PLS, PLSDA, sPLS, sPLSDA)
? Random Forest
? Splines based (MARS)
C Unsupervised:
? Principal Component Analysis
? Two-way hierarchical clustering analysis
? Correlation/network analysis using *MetabNet (Uppal 2015):
C Untargeted: Correlations with all metabolites
C Targeted: Correlations with metabolites from a specific pathway, clinical parameters
39
40. xmsPANDA: Sample input files
a. Feature table
b. Class labels file
40
The
order
must be
identical
Sample IDs
42. xmsPANDA Manhattan plots: Y-axis corresponds to the Clog10 (p-value); FDR
cut-off is represented by the horizontal line
a) -logP vs m/z b) -logP vs time
42
m/z Retention time
Amino
acids
Lipids,
steroids
43. xmsPANDA PCA and cluster analysis
Principal Component Analysis
(PCA)
Hierarchical clustering Analysis
(HCA)
Samples
m/z features
43
PC1
PC2
45. xmsPANDA Network analysis using MetabNet (Stage 3)
: correlated m/z
|cor|>0.4 at FDR 0.2
: putative biomarkers from PLS
? Targeted metabolome-wide
association study (MWAS) of
specific metabolites (biomarkers,
environmental exposures, etc.)
? Facilitates detection of related
metabolic pathways and network
structures
? Correlation-based network analysis
? Each node corresponds to
metabolites and the edges
correspond to the correlation
coefficient, Cij
? Two metabolites are linked if |Cij|>
threshold at a user defined
significance level
? Pearson, Spearman, and partial
correlation
45
46. Summary
? xmsPANDA provides an automated workflow for analyzing metabolomics
data (package can be tricked to work other Comics data)
? The package facilitates network level investigation of significant or different
expressed metabolites
? Includes independent functions for hierarchical clustering analysis, PCA,
boxplots
? Availability
C Emory IT Box, (Accessible under MetabolomicsWorkshopSummer2016 folder
on Box)
C Email: kuppal2@emory.edu
46
48. A) In the work flow of untargeted metabolomics, the conventional approach requires the metabolites to be identified before
pathway/network analysis, while mummichog (blue arrow) predicts functional activity bypassing metabolite identification. B) Each
row of dots represent possible matches of metabolites from one m/z feature, red the true metabolite, gray the false matches. The
conventional approach first requires the identification of metabolites before mapping them to the metabolic network.
C)mummichog maps all possible metabolite matches to the network and looks for local enrichment, which reflects the true activity
because the false matches will distribute randomly.
Mummichog for pathway enrichment analysis
48
? Developed by Shuzhao Li Ph.D., Assistant Professor, Emory University School of Medicine
? Li et al. 2013. PLoS Computational Biology
50. Metabolite annotation
? >10,000 reproducible signals can be detected using liquid
chromatography high resolution mass spectrometry
? Simple database searches can result in a large number of false
positives
50
51. Metabolite Annotation: mapping m/z from
LC-MS data to known metabolites in databases
Many-to-
many
relationship
between m/z
and
metabolites
m/z 1
m/z 2
51
52. Main goals of xMSannotator
? Incorporating multiple layers of information (m/z, retention time,
intensity profiles, isotope patterns, pathway membership) to
enhance confidence in annotations and prioritize candidates for
validation using MS/MS and chemical standards
? Perform suspect screening (exposure to environmental chemicals,
drugs)
? Allow use of cluster/module membership to facilitate generating
hypothesis about biochemical roles of features with no database
matches
52
? Developed by Karan Uppal Ph.D., MSc., Assistant Professor, Emory University School
of Medicine
53. ? Human Metabolome Database (HMDB)
C About 41,000 metabolites
? 2,824 (Detected and Quantified)
? 251 (Detected but not Quantified)
? 38,439 (Expected but not detected)
? LipidMaps
C 36,269 lipids
? The toxin and toxin target database (T3DB)
C 2,097 toxic chemicals
? KEGG
C 15,298 chemicals
Databases supported by xMSannotator
53
54. xMSannotator functions
? Multilevelannotation() for multi-criteria based annotation that assigns
annotations into confidence levels (high, medium, low, none)
? get_mz_by_KEGGspecies:
C generate list of expected m/z based on adducts for all metabolites associated with a species in
KEGG
? get_mz_by_KEGGpathwayIDs:
C generate list of expected m/z based on adducts for all metabolites in specific pathways
? get_mz_by_KEGGcompoundIDs:
C generate list of expected m/z based on adducts for given KEGG compound ID
? get_kegg_map:
C Download KEGG map as a PNG file with color coded KEGG IDs
? ChemSpider.annotation:
C m/z based annotation for select databases in ChemSpider
54
55. library(xMSannotator)
#Package data files
data(example_data) #example peak intensity matrix
data(adduct_table)
data(adduct_weights)
#data(customIDs) #example for custom IDs
#data(customDB) #example for custom DB
#data(hmdbAllinf)
#data(keggotherinf)
#data(t3dbotherinf)
###########Parameters to change##############
dataA<-read.table("/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/50marmosets_rawdata_averaged.txt",sep="t",header=TRUE)
#OR
#dataA<-example_data
outloc<-"/Users/karanuppal/Documents/Emory/JonesLab/Projects/xMSannotator/testBloodSpotv1.1.2T3DB/"
max.mz.diff<-10 #mass search tolerance for DB matching in ppm
max.rt.diff<-10 #retention time tolerance between adducts/isotopes
corthresh<-0.7 #correlation threshold between adducts/isotopes
max_isp=5
mass_defect_window=0.01
num_nodes<-4 #number of cores to be used; 2 is recommended for desktop computers due to high memory consumption
db_name=^HMDB" #other options: KEGG, LipidMaps, T3DB
status=NA #other options: "Detected", NA, "Expected and Not Quantified"
num_sets<-300 #number of sets into which the total number of database entries should be split into;
mode<-"pos" #ionization mode
queryadductlist=c("M+2H","M+H+NH4","M+ACN+2H","M+2ACN+2H","M+H","M+NH4","M+Na","M+ACN+H","M+ACN+Na","M+2ACN+H","2M+H","2M+Na",
"2M+ACN+H","M+2Na-H","M+H-H2O","M+H-2H2O") #other options: c("M-H","M-H2O-H","M+Na-2H","M+Cl","M+FA-H"); c("positive"); c("negative");
c("all");see data(adduct_table) for complete list
#########################
dataA<-unique(dataA)
print(dim(dataA))
system.time(annotres<-multilevelannotation(dataA=dataA,max.mz.diff=max.mz.diff,max.rt.diff=max.rt.diff,cormethod="pearson",num_nodes=num_nodes,queryadductlist=queryadductlist,
mode=mode,outloc=outloc,db_name=db_name, adduct_weights=adduct_weights,num_sets=num_sets,allsteps=TRUE,
corthresh=corthresh,NOPS_check=TRUE,customIDs=NA,missing.value=NA,hclustmethod="complete",deepsplit=2,networktype="unsigned",
minclustsize=10,module.merge.dissimilarity=0.2,filter.by=c("M+H"),biofluid.location=NA,origin=NA,status=status,boostIDs=NA,max_isp=max_isp,
HMDBselect="union",mass_defect_window=mass_defect_window,pathwaycheckmode="pm",mass_defect_mode="pos")
)
xMSannotator example script (R)
55
62. xmsPANDA
? Results
? ReadME.txt
C Stage 1 results: Preprocessing (Normalization, transformation)
C Stage 2 results: Feature selection & evaluation results (Manhattan
plots, PCA, HCA, boxplots, table of significant features, clustering
results)"
C Stage 3 results: Correlation based network analysis 62
63. xmsPANDA Stage 2 Results
? Results
? Page 9 and 10 C Type I and II Manhattan plots
? Page 11 C 2-way HCA heatmap Final page(s) C box plots
63
Cotinine