�ݺ�ߣ

Gaussian Process
Jungkyu Lee
Daum Search Quality Team
1

Intuition
? http://explainaway.wordpress.com/2008/12/01/ridiculous-stats/
? x ? ??, y ?: 100?? ??, ??= ??? uncertainty
? ??? ??? ? ??? = 2030? ?? ?? ?? ???? ??? ? ??
? ??? x(??)? ????, ???? ??? (100 ??)? ????? ???? ???? ???? ? ??
regression

Prerequisite: Conditioning of Multivariate Gaussians
? a random vector x �� Rn with x �� N (?, ��)
? Then the marginals are given by��
? and the posterior conditional is given by

Prerequisite: Bayesian Linear Regression
? ?? ??
? http://play.daumcorp.com/display/~sweaterr/7.+Linear+Regression#7.LinearRegression-
7.6Bayesianlinearregression
?
w ? Gaussian, likelihood ? Gaussian ??, w? posterior? Gaussian ??, Linear Gaussian System ?
?? ??? ?? ? ??
posterior predictive ? ??? tractable ?? ???, ?? ? ??, ?? posterior ? ??? weight ? ? ??
?? ???? ?? ?? ??? ??

Parametric Models vs. Non-parametric Models
Non-parametric model? parametric model ? ??? model? ??? ???? ??,
?????? ?? ?? ????.
Parametric models:
? Linear Regression
? GMM
Non-parametric models:
? KNN
? Kernel Regression
? Gaussian Process

Non-parametric Models
Kernel Regression (Non-Parametric, Non-Bayes) GP Regression (Non-Parametric, Bayes)
??? ?? ??? ???? ??? ???
confidence? ??

15.1 Introduction
? In supervised learning, we observe some inputs xi and some outputs yi.
? xi ? yi? mapping ?? ?? ?? f? ??? ????, ? ??? ???? ????.
? ?? ???? ??? ? ??? ???? ???? ??? ??? ?? ???
? ???? ??? parametric? ??? ??? ??? ??? p(f|D) ?? p(��|D)? ?????.
? ??? ?? ??? ?? ???? ??? ????.

Probability distributions over functions with ?nite domains
? ??? ?? training example ? ??? ?? X = {x1, . . . , xm}
? ?? ??? ?? ?? ???? ??? ??? ??? ?? ??? ? ??
? ??? domain ? ??? m ?? ?? ???? ??? ??? ??? ?? ??? ??? ? ??
? ??? ??? ??? ??? ? ???? ??? ????? ???
? ?, ?? domain ? ?? ??? ?? ????? ???.
? ???? ?? ?????, h(x1)? h(x2) ? ????
? ????, ??? domain ? ??? ??? ?? ?, ??? ?? ??? ??? ?????

Probability distributions over functions with in?nite domains
? Stochastic process ? ?? ?? ? ????
? Gaussian process ? ? ?? ??? sub collection ? joint probability ? multivariate Gaussian distribution (MVN)
? stochastic process ??.
? Gaussian Process ? ?? ??? ?? ????, ? ?? ???? ???, ??? ?? MVN ??
? ?? ?? x1, ... , xm ? ???, ?? ?? h(x1),..., h(xm)? ??? ?? ??? ???
? ??? ?? ??? ?? ????.
? ?? domain ? ??, ???? kernel? ????.
? ??? ??? m? ???? ?? ????? ???, ?? ? ??? correlation ? ??? input ? ???
? ??? ?? kernel ? ????.
? ?? ????? ???? ????, ? ??? ???? ?? ??? ?? ??? ???? ???? ??
?
? ? Gaussian Process? ?? ??(?? domain)? ?? ????, ?? ?? ??? ???? ?? ? ??
? ????, ?? ?? ?????, ??? ???? ?? ?? ??? ? ? ??? ???

Graphical Model for GP
? 2 training points ? 1 test point? ?? Gaussian Process
? mixed directed and undirected graphical model representing
? fi = f(xi) ? ? data points ?? ??? ???, hidden node ?
? Test data point x*? training ? ??? ??, y*? training ? ??? ??.
? ????, ?? ??? ?? Kernel Regression ? ????, ??? test point ? ??? training y ? f ? ??
?? ??

different kernel
? ??? ?? ?????, ??? sampling ? ? ??
? kernel ? ?? ??? ??? ??? ???.
? kernel ? ?? GP design ? ?? ??? ?? ???? ???? ?? ? ????

15.2 GPs for regression
? ??? ??? ?? GP ??? ???? ??
? where m(x) is the mean function and ��(x, x) is the kernel or covariance function,
? ??? ??? ? ??? ???, ? ????? joint Gaussian? ???? (GP? ??)
?

15.2.1 Predictions using noise-free observations
? fi = f(xi)? xi?? ??? ??? ?? ????
? X? of size N? �� D ? test set? ???? ?, test set ? ?? function outputs f?? ???? ??
? ??? ???? ??? ?????, GP? ?? x? ?? f(x)? ???? ?? ????, test point x* ? ?
? ?? f* ? ??? ?? f? ??? ????? ??? ????
? By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following form
? ??? 0, K ? identity matrix ?? ??, test point? f* ??? K*
Tf? test point ? ??? training point ? f ?
???? ? ??
training points ?? Gram matrix
training points ? test points ?? Gram matrix
test points ?? Gram matrix

15.2.1 Predictions using noise-free observations
? ??: GP? prior
? ???: 5?? noise free observations? ?? ? GP posterior
? Predictive uncertainty ? ???? ????? ???

15.2.2 Predictions using noisy observations
? ??? noisy ??? ??
? ???? ?? ??? response ? ?? f ? ???
? ???? response (?) ? ?? ???? ??? ?? ????
? where ��pq = I(p = q). In other words,
? ? ?? ??? ? ??? ???? ??? term? ???? ??? diagonal??
? ???? ?? ????, test point ? ?? noise-free ??? ?? joint probability ?
? ???, notation ? ??? ?? ??, ??? 0??? ????
? ????, posterior predictive density ?
? ???? ???? ??? ??? ??? ?? ?? K? diagonal ? ��2y ? ?? ? ???

15.2.2 Predictions using noisy observations
? ? ?? test point ? ?? ???
? where k? = [��(x?, x1), . . . , ��(x?, xN)] and k?? = ��(x?, x?).
? Another way to write the posterior mean is as follows:
? where �� = K?1 y y. We will revisit this expression later

15.2.3 Effect of the kernel parameters
? GP? ?? ??? ?? kernel ? ???? ?? ??.
? Suppose we choose the following squared-exponential (SE) kernel for the noisy observations
? (a) l = 1 good fit
? (b) l = 0.3 ?? --> ?????, ????? ??
? (c ) l = 3
controls the vertical scale of the function
noise variance

15.2.4 Estimating the kernel parameters
? kernel parameters ? ???? ??? ?? CV? ?? ???
? ??? ?? likelihood ? ????? kernel parameters ? ??? ??
? The first term is a data fit term,
? kernel ? RBF? ??, bandwidth ? ????, ?? ?? ??? ??? 0? ??, ??? ?? ??? ???
?? ???, GPR? ?? ??? ?? data ? ?? fit ??. bandwidth ? ??? ????, Ky? diagonal
matrix ? ? ??( ?? near point ?? similarity ? 0 ??), log|Ky| ? ?? ????? (?? ???
determinant ? ?? ??? ?), ??? Fit ??? ????
? ??? kernel parameter(bandwidth) ? ???, data fit error ? ???, log|Ky| ? ????
? ????, ??? term = likelihood, ? ?? term = model complexity ?? ? ? ??, trade off ? ??.

? ??? ????? ??? ?? kernel parameter l? ???.
? ??? ?? gradient ? ???
? gradient descent ?? standard gradient-based optimizer ? ???

15.3 GPs meet GLMs
? GP ? ?? GLMs ??? ??? ? ??
? ???? f(x) = wTx ?? ??? ??? x ? ???? ?????, ??? f ~ GP() ??
? ?? ??? 15.3.1 binary classification ??? ??? ?????
? define the model
? yi �� {?1, +1}, and we let ��(z) = sigm(z) (logistic regression)
? yi = 1??, fi? 0?? ?? ??? 0.5?? ??
? yi = -1?? fi? 0?? ??? ??? 0.5?? ??
? As for GP regression, we assume f �� GP(0, ��).

Prerequisite: Gaussian approximation(aka Laplace
approximation)
? http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression-
8.4.1Laplaceapproximation
? posterior? ?? ??? ???? ??
posterior ? ??? posterior ? MAP ????, ?
?? posterior? ?? ??? ?? ???? ???
?? ???? ?

Prerequisite: Gaussian approximation for logistic regression
? ????? hat{w}? ??? ?? logistic regression? MAP ??? ?? H? posterIor? ? ? ????

15.3 GPs meet GLMs
15.3.1.1 Computing the posterior
? Define the log of the unnormalized posterior as follows:
? Let J(f) ?(f) be the function we want to minimize
? The gradient and Hessian of this are given by
? Newton��s Method? ???? ???? ??
? ??? f? posterior ? ??? ? f ? hat{f} ?? ??
? ???, f? posterior ? Gaussian Approximation?
�� ??? ????? ?? ?? ?

15.3 GPs meet GLMs
15.3.1.2 Computing the posterior predictive
? Test point? ?? posterior predictive ?
f ? ???? f? MAP ????? ??

15.3 GPs meet GLMs
15.3.1.3 Computing the marginal likelihood
? Kernel parameter? ????? ?? marginal likelihood? ????
? LogP(D) ? kernel parameter? ?? ???? ??????, kernel parameter? W,K,f? ?? depend?? ?
?? ??? ((Rasmussen and Williams 2006,p125))
15.3.1.1 Computing the posterior ? ??? ??? ??

5 Summary
? ? Gaussian Process ? ????
? Bayesian method ??
? ??? uncertainty ? ??? ? ? ??
? ?? model selection ? hyperparameter selection ? ?? Bayesian method? ??? ??? ? ??
? ??? RBF ? bandwidth ? ?? likelihood ? ????? ?? ??
? Non-parametric ??
? input point ? ?? ??? ??? ????? (No model assumption)

�ݺ�ߣ

??? ???? : Gaussian Processes

More Related Content

??? ???? : Gaussian Processes