This document discusses methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. It presents three methods - eigenvalue decomposition, uniformization, and exponentiation - and evaluates their accuracy and speed for different models and time points. For accuracy, it finds that exponentiation is most accurate. For speed, exponentiation is slowest while eigenvalue decomposition and uniformization trade off accuracy and speed depending on the model and time points. Uniformization has potential to be improved through better cutoff points or adaptive uniformization techniques.
This document outlines a timeline that begins with low stress work on a thesis such as collecting papers or reformatting a document. Stress levels then increase as the focus shifts to the thesis itself through drafting and revisions. Finally, the timeline shows stress peaking as deadlines approach but notes taking control to reduce stress through chilling out.
1. The document discusses methods for calculating summary statistics for continuous-time Markov chains (CTMCs), which are used to model processes like DNA sequence evolution.
2. It focuses on calculating the expected number of jumps between states and expected waiting time in a state, conditioned on the start and end points. These statistics are needed to estimate model parameters using maximum likelihood approaches like the EM algorithm.
3. As an application, the document describes using a 61x61 CTMC rate matrix to model codon sequence evolution, with the matrix parameters estimated via the EM algorithm using the summary statistics. Expected jumps between states differentiated by transition type are of particular interest.
The markovchain R package provides tools for creating, representing, and analyzing discrete time Markov chains (DTMCs). It allows users to easily define DTMC objects, perform structural analysis of transition matrices, estimate transition matrices from data, and simulate stochastic sequences from DTMCs. The package aims to make working with Markov chains straightforward for R programmers through S4 classes and methods.
Probability formula sheet
Set theory, sample space, events, concepts of randomness and uncertainty, basic principles of probability, axioms and properties of probability, conditional probability, independent events, Bayes formula, Bernoulli trails, sequential experiments, discrete and continuous random variable, distribution and density functions, one and two dimensional random variables, marginal and joint distributions and density functions. Expectations, probability distribution families (binomial, poisson, hyper geometric, geometric distribution, normal, uniform and exponential), mean, variance, standard deviations, moments and moment generating functions, law of large numbers, limits theorems
for more visit http://tricntip.blogspot.com/
This document describes three methods - eigenvalue decomposition, uniformization, and matrix exponentiation - for computing sufficient statistics for continuous-time Markov chains (CTMCs) that are needed for maximum likelihood estimation. The eigenvalue decomposition method is prone to large errors, while the uniformization method sums many small numbers and the matrix exponentiation method is most accurate but also the slowest. The document implemented these methods and compared their performance and accuracy for computing statistics in an expectation-maximization algorithm.
This document outlines Paula Tataru's qualification exam, which focuses on developing statistical theory and algorithms for analyzing molecular data. It summarizes her past work on using stochastic context-free grammars (SCFGs) for RNA secondary structure prediction and calculating expectations for continuous-time Markov chains (CTMCs). It then discusses her current work on hidden Markov models (HMMs), including algorithms for patterns in HMMs and using HMMs to infer population parameters. Future work includes applying these methods to real data and extending models to incorporate additional properties.
皚皚皚皚皚 Finding a Dynamical Model of a Social Norm Physical Activity InterventionVictor Asanza
油
Low levels of physical activity in sedentary individuals constitute a major concern in public health.
Physical activity interventions can be designed relying on mobile technologies such as smartphones.
The purpose of this work is to find a dynamical model of a social norm physical activity intervention relying on Social Cognitive Theory, and using a data set obtained from a previous experiment.
The model will serve as a framework for the design of future optimized interventions. To obtain model parameters, two strategies are developed: first, an algorithm is proposed that randomly varies the values of each model parameter around initial guesses.
The second approach utilizes traditional system identification concepts to obtain model parameters relying on semi-physical identification routines. For both cases, the obtained model is assessed through the computation of percentage fits to a validation data set, and by the development of a correlation analysis.
Computational Intelligence for Time Series PredictionGianluca Bontempi
油
This document provides an overview of computational intelligence methods for time series prediction. It begins with introductions to time series analysis and machine learning approaches for prediction. Specific models discussed include autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) processes. Parameter estimation techniques for AR models are also covered. The document outlines applications in areas like forecasting, wireless sensors, and biomedicine and concludes with perspectives on future directions.
This document provides an overview of signal fundamentals, including definitions, examples, and properties of signals. It discusses topics such as signal energy and power, signal transformations, periodic and exponential signals. Examples are provided to illustrate concepts such as determining if a signal has finite energy/power, applying signal transformations, decomposing signals into even and odd components, and plotting exponential signals. The document is from a university course on signal fundamentals and is intended to introduce basic signal processing concepts.
rbs - presentation about applications of machine learning.ChellamuthuMech
油
This document discusses applications of machine learning for interdisciplinary research. It covers topics like data science examples, decision tree algorithms and their applications. Numerical examples are provided to illustrate decision tree analysis for manufacturing and wear rate prediction datasets. Association rule mining is also discussed along with other potential areas of implementation like credit cards, grocery shopping, visa processing and meteorology. Algorithms mentioned include decision trees, artificial neural networks, support vector machines, Apriori and clustering. The document emphasizes the significance of converting low-level data to high-level knowledge through machine learning techniques.
LSSC2011 Optimization of intermolecular interaction potential energy paramete...Dragan Sahpaski
油
Optimization of intermolecular interaction potential energy parameters for Monte-Carlo and Molecular dynamics simulations using Genetic Algorithms (GA)
This document describes using sequential Monte Carlo methods like the sequential importance sampling (SIS) filter, sequential importance resampling (SIR) filter, and bootstrap filter to estimate parameters of linear time-invariant systems subjected to non-stationary earthquake excitations. It presents simulations applying these filters to identify parameters of a single-degree-of-freedom oscillator and a 3-story shear building model using synthetic earthquake data. The performance of different filters and resampling algorithms are compared based on identified natural frequencies and parameter convergence.
This document summarizes different modeling and control approaches for continuous and discrete event systems. It discusses how Petri nets can be used to model discrete event systems and manufacturing processes. It also provides an example of using a Petri net to model a batch plant and analyzing its performance. Finally, it mentions how Petri nets can be combined with artificial intelligence search methods for scheduling.
diCal-IBD is a tool that offers improved identity by descent (IBD) detection compared to existing programs. It uses the coalescent with recombination process to model IBD tracts, allows IBD tracts to contain mutations, and can process full sequence data rather than just SNP data. The tool was tested on simulated European population data and was able to detect shorter IBD tracts than previous programs.
Paula Tataru defended her PhD thesis on using mathematical modeling and computational tools to analyze population genetics and molecular data. Her research involved using stochastic context-free grammars, hidden Markov models, discrete time Markov chains, and continuous time Markov chains to study problems involving RNA structure prediction, motif discovery, identity-by-descent modeling, and modeling allele frequency data. She described these modeling techniques and their applications to several of her published works and papers in preparation.
The document describes the Bioinformatics Research Centre (BiRC) in Denmark. It was established in 2001 and has approximately 45 employees, including researchers, PhD students, masters students, and programmers. The BiRC uses computers and algorithms to analyze biological data and answer questions in fields like evolutionary genomics, mathematical biology, medical genetics, and structural bioinformatics. It develops tools for applications such as gene annotation, sequence alignment, phylogenetic analysis, and protein modeling.
This document contains questions from various categories related to phylogenetics including distances & substitution models, phylogenies, selection, and random processes. It asks about key concepts like p-distance, the Jukes-Cantor model, gene trees, synonymous and non-synonymous substitutions, the Last Universal Common Ancestor (LUCA), bootstrapping, and incomplete lineage sorting. The questions are from a game of Jeopardy! about phylogenetic trees and molecular evolution.
This document discusses RNA secondary structure prediction using stochastic context-free grammars (SCFGs). It provides examples of SCFGs that can generate RNA sequences and assign probabilities to different secondary structure conformations. The CYK algorithm is described for finding the most probable structure for an RNA sequence using dynamic programming in O(n^3) time and O(n^2) space. Backtracking allows recovering the structure in O(n^2) time. Advantages of the grammatical approach include fully probabilistic modeling and biologically-based scoring of structural elements, while disadvantages include difficulty scoring more complex structures without increased computational costs.
This document discusses metrics for comparing the secondary structures of RNA sequences. It introduces true/false positives/negatives for structure comparisons and defines sensitivity as the ability to identify positive results and specificity as the ability to identify negative results. It also introduces the Matthews correlation coefficient for structure comparisons and shows example calculations and results. Finally, it questions how different distance metrics may result in different distances when comparing structures.
The document discusses RNA secondary structure prediction based on multiple sequence alignments. It explains that homologous RNAs can share a common secondary structure without high sequence similarity due to compensatory mutations. Comparative sequence analysis can reveal conserved base pairs through frequent correlated compensatory mutations detected as high mutual information between columns in an alignment. The mutual information measure is described and secondary structure can be predicted through a greedy approach pairing columns with highest mutual information. Refining the alignment based on predicted structure can improve predictions.
The document discusses RNA secondary structure prediction using the Nussinov algorithm. It begins by providing background on RNA structure and representation. It then describes the Nussinov algorithm, which calculates the maximum number of base pairs in a RNA sequence using dynamic programming. The algorithm fills a 2D matrix representing all subsequence scores in O(n^3) time and O(n^2) space. Backtracking through the matrix reconstructs the predicted secondary structure.
The document describes an evolutionary algorithm for learning stochastic context-free grammars (SCFGs) to model RNA secondary structure. The algorithm starts with an initial population of grammars and uses mutations like adding/deleting rules and breeding grammars to generate new grammars. Grammars are selected for the next generation based on fitness metrics like sensitivity and specificity on RNA structure data. The results show grammars evolved by the algorithm achieve higher sensitivity and specificity than baseline grammars from Dowell & Eddy, although their data contains complex pseudoknot structures not modeled by SCFGs.
Stochastic context free grammars (SCFGs) are context free grammars that assign probabilities to generated strings. SCFGs can generate the same string using different derivation rules, with each set of rules assigning a different probability to the string. SCFGs are useful for modeling RNA sequences and predicting secondary structure, as they can generate RNA sequences and assign probabilities to potential secondary structures. The Cocke-Younger-Kasami algorithm is commonly used to calculate the most probable structure for an RNA sequence using a SCFG.
RNA secondary structure can be predicted using comparative sequence analysis of multiple RNA alignments. Conserved base pairs are often revealed by frequent compensatory mutations that maintain base pairing complementarity. The covariance method measures sequence covariation between columns in an alignment using mutual information to identify compensatory mutations. Predictions start with an initial alignment that is then refined based on the predicted secondary structure in an iterative process. Stochastic context-free grammars can be modified to generate RNA secondary structure by modeling the probability of columns being single or forming base pairs.
The document discusses RNA secondary structure prediction. It explains that RNA molecules can fold into structures through base pairing, and that the secondary structure provides the scaffold for the tertiary structure. It then describes the Nussinov algorithm for predicting the secondary structure with the maximum number of base pairs in quadratic time and space. The algorithm uses a dynamic programming approach to calculate scores for all substructures and find the optimal structure through backtracking.
The document presents a new "Beta with spikes" approximation for modeling allele frequency distributions under the Wright-Fisher model. It improves upon the standard Beta approximation by incorporating loss and fixation probabilities, better modeling behavior at the boundaries. Simulation results show it provides more accurate inference of population divergence times than the Beta approximation. Future work includes incorporating selection, which introduces a non-linear evolutionary force.
This document describes using a Beta approximation to model the Wright-Fisher model of genetic drift in population genetics. It discusses using a moment-based approach to calculate the mean and variance of allele frequencies over time, allowing the distribution to be approximated by a Beta distribution. It also describes adding "spikes" to the Beta distribution to better model loss and fixation probabilities at the boundaries of 0 and 1.
皚皚皚皚皚 Finding a Dynamical Model of a Social Norm Physical Activity InterventionVictor Asanza
油
Low levels of physical activity in sedentary individuals constitute a major concern in public health.
Physical activity interventions can be designed relying on mobile technologies such as smartphones.
The purpose of this work is to find a dynamical model of a social norm physical activity intervention relying on Social Cognitive Theory, and using a data set obtained from a previous experiment.
The model will serve as a framework for the design of future optimized interventions. To obtain model parameters, two strategies are developed: first, an algorithm is proposed that randomly varies the values of each model parameter around initial guesses.
The second approach utilizes traditional system identification concepts to obtain model parameters relying on semi-physical identification routines. For both cases, the obtained model is assessed through the computation of percentage fits to a validation data set, and by the development of a correlation analysis.
Computational Intelligence for Time Series PredictionGianluca Bontempi
油
This document provides an overview of computational intelligence methods for time series prediction. It begins with introductions to time series analysis and machine learning approaches for prediction. Specific models discussed include autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) processes. Parameter estimation techniques for AR models are also covered. The document outlines applications in areas like forecasting, wireless sensors, and biomedicine and concludes with perspectives on future directions.
This document provides an overview of signal fundamentals, including definitions, examples, and properties of signals. It discusses topics such as signal energy and power, signal transformations, periodic and exponential signals. Examples are provided to illustrate concepts such as determining if a signal has finite energy/power, applying signal transformations, decomposing signals into even and odd components, and plotting exponential signals. The document is from a university course on signal fundamentals and is intended to introduce basic signal processing concepts.
rbs - presentation about applications of machine learning.ChellamuthuMech
油
This document discusses applications of machine learning for interdisciplinary research. It covers topics like data science examples, decision tree algorithms and their applications. Numerical examples are provided to illustrate decision tree analysis for manufacturing and wear rate prediction datasets. Association rule mining is also discussed along with other potential areas of implementation like credit cards, grocery shopping, visa processing and meteorology. Algorithms mentioned include decision trees, artificial neural networks, support vector machines, Apriori and clustering. The document emphasizes the significance of converting low-level data to high-level knowledge through machine learning techniques.
LSSC2011 Optimization of intermolecular interaction potential energy paramete...Dragan Sahpaski
油
Optimization of intermolecular interaction potential energy parameters for Monte-Carlo and Molecular dynamics simulations using Genetic Algorithms (GA)
This document describes using sequential Monte Carlo methods like the sequential importance sampling (SIS) filter, sequential importance resampling (SIR) filter, and bootstrap filter to estimate parameters of linear time-invariant systems subjected to non-stationary earthquake excitations. It presents simulations applying these filters to identify parameters of a single-degree-of-freedom oscillator and a 3-story shear building model using synthetic earthquake data. The performance of different filters and resampling algorithms are compared based on identified natural frequencies and parameter convergence.
This document summarizes different modeling and control approaches for continuous and discrete event systems. It discusses how Petri nets can be used to model discrete event systems and manufacturing processes. It also provides an example of using a Petri net to model a batch plant and analyzing its performance. Finally, it mentions how Petri nets can be combined with artificial intelligence search methods for scheduling.
diCal-IBD is a tool that offers improved identity by descent (IBD) detection compared to existing programs. It uses the coalescent with recombination process to model IBD tracts, allows IBD tracts to contain mutations, and can process full sequence data rather than just SNP data. The tool was tested on simulated European population data and was able to detect shorter IBD tracts than previous programs.
Paula Tataru defended her PhD thesis on using mathematical modeling and computational tools to analyze population genetics and molecular data. Her research involved using stochastic context-free grammars, hidden Markov models, discrete time Markov chains, and continuous time Markov chains to study problems involving RNA structure prediction, motif discovery, identity-by-descent modeling, and modeling allele frequency data. She described these modeling techniques and their applications to several of her published works and papers in preparation.
The document describes the Bioinformatics Research Centre (BiRC) in Denmark. It was established in 2001 and has approximately 45 employees, including researchers, PhD students, masters students, and programmers. The BiRC uses computers and algorithms to analyze biological data and answer questions in fields like evolutionary genomics, mathematical biology, medical genetics, and structural bioinformatics. It develops tools for applications such as gene annotation, sequence alignment, phylogenetic analysis, and protein modeling.
This document contains questions from various categories related to phylogenetics including distances & substitution models, phylogenies, selection, and random processes. It asks about key concepts like p-distance, the Jukes-Cantor model, gene trees, synonymous and non-synonymous substitutions, the Last Universal Common Ancestor (LUCA), bootstrapping, and incomplete lineage sorting. The questions are from a game of Jeopardy! about phylogenetic trees and molecular evolution.
This document discusses RNA secondary structure prediction using stochastic context-free grammars (SCFGs). It provides examples of SCFGs that can generate RNA sequences and assign probabilities to different secondary structure conformations. The CYK algorithm is described for finding the most probable structure for an RNA sequence using dynamic programming in O(n^3) time and O(n^2) space. Backtracking allows recovering the structure in O(n^2) time. Advantages of the grammatical approach include fully probabilistic modeling and biologically-based scoring of structural elements, while disadvantages include difficulty scoring more complex structures without increased computational costs.
This document discusses metrics for comparing the secondary structures of RNA sequences. It introduces true/false positives/negatives for structure comparisons and defines sensitivity as the ability to identify positive results and specificity as the ability to identify negative results. It also introduces the Matthews correlation coefficient for structure comparisons and shows example calculations and results. Finally, it questions how different distance metrics may result in different distances when comparing structures.
The document discusses RNA secondary structure prediction based on multiple sequence alignments. It explains that homologous RNAs can share a common secondary structure without high sequence similarity due to compensatory mutations. Comparative sequence analysis can reveal conserved base pairs through frequent correlated compensatory mutations detected as high mutual information between columns in an alignment. The mutual information measure is described and secondary structure can be predicted through a greedy approach pairing columns with highest mutual information. Refining the alignment based on predicted structure can improve predictions.
The document discusses RNA secondary structure prediction using the Nussinov algorithm. It begins by providing background on RNA structure and representation. It then describes the Nussinov algorithm, which calculates the maximum number of base pairs in a RNA sequence using dynamic programming. The algorithm fills a 2D matrix representing all subsequence scores in O(n^3) time and O(n^2) space. Backtracking through the matrix reconstructs the predicted secondary structure.
The document describes an evolutionary algorithm for learning stochastic context-free grammars (SCFGs) to model RNA secondary structure. The algorithm starts with an initial population of grammars and uses mutations like adding/deleting rules and breeding grammars to generate new grammars. Grammars are selected for the next generation based on fitness metrics like sensitivity and specificity on RNA structure data. The results show grammars evolved by the algorithm achieve higher sensitivity and specificity than baseline grammars from Dowell & Eddy, although their data contains complex pseudoknot structures not modeled by SCFGs.
Stochastic context free grammars (SCFGs) are context free grammars that assign probabilities to generated strings. SCFGs can generate the same string using different derivation rules, with each set of rules assigning a different probability to the string. SCFGs are useful for modeling RNA sequences and predicting secondary structure, as they can generate RNA sequences and assign probabilities to potential secondary structures. The Cocke-Younger-Kasami algorithm is commonly used to calculate the most probable structure for an RNA sequence using a SCFG.
RNA secondary structure can be predicted using comparative sequence analysis of multiple RNA alignments. Conserved base pairs are often revealed by frequent compensatory mutations that maintain base pairing complementarity. The covariance method measures sequence covariation between columns in an alignment using mutual information to identify compensatory mutations. Predictions start with an initial alignment that is then refined based on the predicted secondary structure in an iterative process. Stochastic context-free grammars can be modified to generate RNA secondary structure by modeling the probability of columns being single or forming base pairs.
The document discusses RNA secondary structure prediction. It explains that RNA molecules can fold into structures through base pairing, and that the secondary structure provides the scaffold for the tertiary structure. It then describes the Nussinov algorithm for predicting the secondary structure with the maximum number of base pairs in quadratic time and space. The algorithm uses a dynamic programming approach to calculate scores for all substructures and find the optimal structure through backtracking.
The document presents a new "Beta with spikes" approximation for modeling allele frequency distributions under the Wright-Fisher model. It improves upon the standard Beta approximation by incorporating loss and fixation probabilities, better modeling behavior at the boundaries. Simulation results show it provides more accurate inference of population divergence times than the Beta approximation. Future work includes incorporating selection, which introduces a non-linear evolutionary force.
This document describes using a Beta approximation to model the Wright-Fisher model of genetic drift in population genetics. It discusses using a moment-based approach to calculate the mean and variance of allele frequencies over time, allowing the distribution to be approximated by a Beta distribution. It also describes adding "spikes" to the Beta distribution to better model loss and fixation probabilities at the boundaries of 0 and 1.
This document presents a new method called the "Beta with spikes" approach for modeling allele frequency data under the Wright-Fisher model of genetic drift, mutation, and selection over time. The method uses a recursive calculation of mean and variance to approximate allele frequency distributions as a Beta distribution with additional point masses accounting for loss and fixation. It provides a consistent approximation compared to other methods like diffusion approximations. The Beta with spikes approach can be used to infer population genetic parameters and split times from DNA sequence data.
The document announces an event titled "Mathematics and Genetics of Selection and Adaptation" organized by Susan F. Bailey, Thomas Bataillon, Asger Hobolth, and Paula Tatatu from October 23-24, 2014 in Aarhus, Denmark. The event will bring together theorists, statisticians, and experimentalists working in areas related to genetics, selection, adaptation, and the connection between theoretical, population, and empirical data on variants and their effects. Free registration is required at the provided website.
The document presents a new approximation called Beta with spikes to model allele frequency data under the Wright-Fisher model of genetic drift, mutation, and selection. The Beta with spikes approximation fits the true Wright-Fisher model better than the commonly used Beta distribution approximation. Simulation results show the Beta with spikes approximation estimates population split times similarly to methods based on the diffusion approximation but more accurately than the Beta distribution approximation. Future work will focus on inferring parameters like mutation rates, selection coefficients, and variable population sizes using the Beta with spikes approximation.
This presentation summarizes methods for detecting identity by descent (IBD) tracts from sequence data and compares the performance of existing algorithms. It introduces IBD and current population-based methods like GERMLINE, FastIBD, and RefinedIBD. A coalescence-based method called SMCSD is described that uses hidden Markov models and can infer shorter IBD tracts. Simulation results on European and African populations show that while existing methods perform well for long tracts, SMCSD has higher recall, precision, and F-score for shorter tracts due to its probabilistic modeling of recombination breakpoints.
4. Motivation
Methods
Results
Paula
Tataru
4/30
Conditional
expectations
of sufficient
statistics
for CTMCs
Modeling DNA data
1st 2nd base 3rd
base T C A G base
T
TTT
Phe
TCT
Ser
TAT
Tyr
TGT
Cys
T
TTC TCC TAC TGC C
TTA
Leu
TCA TAA Stop TGA Stop A
TTG TCG TAG Stop TGG Trp G
C
CTT CCT
Pro
CAT
His
CGT
Arg
T
CTC CCC CAC CGC C
CTA CCA CAA
Gln
CGA A
CTG CCG CAG CGG G
A
ATT
Ile
ACT
Thr
AAT
Asn
AGT
Ser
T
ATC ACC AAC AGC C
ATA ACA AAA
Lys
AGA
Arg
A
ATG Met ACG AAG AGG G
G
GTT
Val
GCT
Ala
GAT
Asp
GGT
Gly
T
GTC GCC GAC GGC C
GTA GCA GAA
Glu
GGA A
GTG GCG GAG GGG G
30. Thank you!
Comparison of methods for calculating
conditional expectations of sufficient statistics
for continuous time Markov chains
BMC Bioinformatics 12(1):465, 2011