This document describes a simulation of kidney tumor growth to determine the probability that a tumor formed before a patient's retirement from the military. The simulation models tumor growth rates based on data from a medical study. It generates random growth rates from a fitted distribution and simulates tumor size over time. The results are cached in a joint distribution of size and age. Conditional distributions of age given size are extracted and percentiles plotted to summarize the results. Potential issues with the modeling assumptions are discussed, including the effects of non-spherical tumor shapes and serial correlation in growth rates over time.
This document describes a simulation of kidney tumor growth to determine the probability that a tumor formed before a patient's retirement from the military. The simulation models tumor growth rates based on data from a medical study. It generates random growth rates from a fitted distribution and simulates tumor size over time. The results are cached in a joint distribution of size and age. Conditional distributions of age given size are extracted and percentiles plotted to summarize the results. Potential issues with the modeling assumptions are discussed, including the effects of non-spherical tumor shapes and serial correlation in growth rates over time.
Kernel functions allow measuring the similarity between objects without explicitly representing them as feature vectors. The kernel trick enables applying algorithms designed for explicit feature vectors, like support vector machines (SVMs), to implicit spaces defined by kernels. SVMs find a sparse set of support vectors that define the decision boundary by maximizing margin and minimizing error. They can perform both classification using a hinge loss function and regression using an epsilon-insensitive loss function.
This document describes using Bayesian inference to locate an opponent in a two-dimensional paintball arena based on the locations of paint spatters on the wall. It defines a joint distribution over all possible (x,y) coordinates of the opponent's location. Given observed spatter locations, it computes the posterior distribution, which provides the likelihood of each possible location. This allows extracting marginal and conditional distributions over each dimension, as well as computing credible intervals to identify likely regions where the opponent may be hiding.
This document describes using a feature-based Markov Decision Process (MDP) and policy iteration to develop an algorithm that learns to play the game Tetris well. It formulates Tetris as an MDP with states defined by wall configurations and piece placement. An approximated value function is defined using features of the game state like column heights. Policy iteration is then used to iteratively update the weight vector of this approximated value function to learn an optimal policy. Simulation results show the learning algorithm achieves much higher scores on Tetris compared to a heuristic algorithm.
This document discusses sparse linear models and Bayesian variable selection. It introduces the spike and slab model for Bayesian variable selection, which uses a binary vector ¦Ã to indicate whether features are relevant or not. Computing the posterior p(¦Ã|D) involves calculating the marginal likelihood p(D|¦Ã). Greedy search and stochastic search methods are discussed to approximate the posterior over models. L1 regularization, also known as lasso, is introduced as an optimization technique since computing the posterior for discrete ¦Ã is difficult. Lasso replaces the discrete priors with continuous priors to encourage sparsity. Coordinate descent is discussed as an algorithm to optimize the lasso objective function.
A Markov model assumes that the current state captures all relevant information for predicting the future. It can be used for language modeling by assigning probabilities to word sequences. Google's PageRank algorithm ranks web pages based on the principle that more authoritative pages, as determined by other pages linking to them, should rank higher. It models the probability of being on a page as a stationary distribution of a Markov chain defined by the link structure of the web.
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Machine Learning : Latent variable models for discrete data (Topic model ...)Yukara Ikemiya
?
Machine Learning, A Probabilistic Perspective
Chapter 27 : Latent variable models for discrete data
topic model, LDA, graph structure, relational data
text analysis
¥È¥Ô¥Ã¥¯¥â¥Ç¥ë?¥Æ¥¥¹¥È·ÖÎö?
2. 12.1 Factor analysis
? ? ? ???? latent variable z = {1,2,..,K} ? ???? ??
An alternative is to use a vector of real-valued latent variables,zi ¡ÊR
? where W is a D¡ÁL matrix, known as the factor loading matrix, and ¦· is a D¡ÁD covariance matrix.
? We take ¦· to be diagonal, since the whole point of the model is to ¡°force¡± zi to explain the correlation, rather than
¡°baking it in¡± to the observation¡¯s covariance.
? The special case in which ¦·=¦Ò2I is called probabilistic principal components analysis or PPCA.
? The reason for this name will become apparent later.
3. 12.1.1 FA is a low rank parameterization of an MVN
? FA can be thought of as a way of specifying a joint density model on x using a small number of parameters.
4. 12.1 Factor analysis
? The generative process, where L=1, D=2 and ¦· is diagonal, is illustrated in Figure 12.1.
? We take an isotropic Gaussian ¡°spray can¡± and slide it along the 1d line defined by wzi +¦Ì.
? This induces an ellongated (and hence correlated) Gaussian in 2d.
5. 12.1.2 Inference of the latent factors
?
latent factors z will reveal something interesting about the data.
xi(D??)? ??? L???? ???? ? ??
training set? D???? L???? ?? ??
6. 12.1.2 Inference of the latent factors
? Example
? D =11??(????, ??? ?, ??,...), N =328 ?? example(??? ??), L = 2
? ? ??(????, ??? ?,.. 11?)? ?? ?? e1=(1,0,...,0), e2=(0,1,0,...,0)? ??? ??? ??? ?? ?
?? ? (biplot??? ?)
? biplot ??? ?? ????(??)? ? ??? ? ??? ?? ?
training set? D???? L???? ?? ?? (??? ?)
7. 12.1.3 Unidentifiability
? Just like with mixture models, FA is also unidentifiable
? LDA ?? ?? ?????, z(??)? ??? ??
? ?? ???? ??? ?? ???, ?? ??? ??? ?
? ?? ??
? Forcing W to be orthonormal Perhaps the cleanest solution to the identifiability problem is to force W to be
orthonormal, and to order the columns by decreasing variance of the corresponding latent factors. This is the
approach adopted by PCA, which we will discuss in Section 12.2.
? orthonormal ??? ?? ???? ?? ????
? ???? ?????,
9. 12.1.4 Mixtures of factor analysers
?
let [the k¡¯th linear subspace of dimensionality Lk]] be represented by Wk, for k=1:K.
? Suppose we have a latent indicator qi ¡Ê{1,...,K} specifying which subspace we should use to generate the data.
? We then sample zi from a Gaussian prior and pass it through the Wk matrix (where k=qi), and add noise.
? ??? Xi? k?? FA?? ???? ??
(GMM? ??)
10. 12.1.5 EM for factor analysis models
Expected log likelihood
ESS(Expected Sufficient Statistics)
11. 12.1.5 EM for factor analysis models
? E- step
? M-step
12. 12.2 Principal components analysis (PCA)
? Consider the FA model where we constrain ¦·=¦Ò2I, and W to be orthonormal.
? It can be shown (Tipping and Bishop 1999) that, as ¦Ò2 ¡ú0, this model reduces to classical (nonprobabilistic)principal
components analysis( PCA),
? The version where ¦Ò2 > 0 is known as probabilistic PCA(PPCA)
15. proof of PCA
? wj ¡ÊRD to denote the j¡¯th principal direction
? xi ¡ÊRD to denote the i¡¯th high-dimensional observation,
? zi ¡ÊRL to denote the i¡¯th low-dimensional representation
? Let us start by estimating the best 1d solution,w1 ¡ÊRD, and the corresponding projected points?z1¡ÊRN.
? So the optimal reconstruction weights are obtained by orthogonally projecting the data onto the first principal
direction
16. proof of PCA
x? z = wx? ??? ??? ????
??
????? reconstruction error? ????? ??? ??? ??? ??? ????? ??? ????
direction that maximizes the variance is an
eigenvector of the covariance matrix.
17. proof of PCA
Optimizing wrt w1 and z1 gives the same solution as before.
The proof continues in this way. (Formally one can use induction.)
26. 12.2.5 EM algorithm for PCA
? M-step
multi-output linear regression (Equation 7.89)
? linear regression?? ? y? ??? ??? linear regression
? ??? zi? ????, xi? ??? ?? multi-output linear regression??
? ??? ??? ??? zi? ??? ??(W)? ??? ? x(??? ?)?? ??? ??? ?? ?? ??? ?
??.
27. 12.2.5 EM algorithm for PCA
? EM? ??
? EM can be faster
? EM can be implemented in an online fashion, i.e., we can update our estimate of W
as the data streams in.