3. Prerequisite: Conditioning of Multivariate Gaussians
? a random vector x ( Rn with x ゛ N (?, Σ)
? Then the marginals are given by¨
? and the posterior conditional is given by
10. Graphical Model for GP
? 2 training points ? 1 test point? ?? Gaussian Process
? mixed directed and undirected graphical model representing
? fi = f(xi) ? ? data points ?? ??? ???, hidden node ?
? Test data point x*? training ? ??? ??, y*? training ? ??? ??.
? ????, ?? ??? ?? Kernel Regression ? ????, ??? test point ? ??? training y ? f ? ??
?? ??
12. 15.2 GPs for regression
? ??? ??? ?? GP ??? ???? ??
? where m(x) is the mean function and κ(x, x) is the kernel or covariance function,
? ??? ??? ? ??? ???, ? ????? joint Gaussian? ???? (GP? ??)
?
13. 15.2.1 Predictions using noise-free observations
? fi = f(xi)? xi?? ??? ??? ?? ????
? X? of size N? 〜 D ? test set? ???? ?, test set ? ?? function outputs f?? ???? ??
? ??? ???? ??? ?????, GP? ?? x? ?? f(x)? ???? ?? ????, test point x* ? ?
? ?? f* ? ??? ?? f? ??? ????? ??? ????
? By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following form
? ??? 0, K ? identity matrix ?? ??, test point? f* ??? K*
Tf? test point ? ??? training point ? f ?
???? ? ??
training points ?? Gram matrix
training points ? test points ?? Gram matrix
test points ?? Gram matrix
16. 15.2.2 Predictions using noisy observations
? ? ?? test point ? ?? ???
? where k? = [κ(x?, x1), . . . , κ(x?, xN)] and k?? = κ(x?, x?).
? Another way to write the posterior mean is as follows:
? where α = K?1 y y. We will revisit this expression later
17. 15.2.3 Effect of the kernel parameters
? GP? ?? ??? ?? kernel ? ???? ?? ??.
? Suppose we choose the following squared-exponential (SE) kernel for the noisy observations
? (a) l = 1 good fit
? (b) l = 0.3 ?? --> ?????, ????? ??
? (c ) l = 3
controls the vertical scale of the function
noise variance
18. 15.2.4 Estimating the kernel parameters
? kernel parameters ? ???? ??? ?? CV? ?? ???
? ??? ?? likelihood ? ????? kernel parameters ? ??? ??
? The first term is a data fit term,
? kernel ? RBF? ??, bandwidth ? ????, ?? ?? ??? ??? 0? ??, ??? ?? ??? ???
?? ???, GPR? ?? ??? ?? data ? ?? fit ??. bandwidth ? ??? ????, Ky? diagonal
matrix ? ? ??( ?? near point ?? similarity ? 0 ??), log|Ky| ? ?? ????? (?? ???
determinant ? ?? ??? ?), ??? Fit ??? ????
? ??? kernel parameter(bandwidth) ? ???, data fit error ? ???, log|Ky| ? ????
? ????, ??? term = likelihood, ? ?? term = model complexity ?? ? ? ??, trade off ? ??.
23. 15.3 GPs meet GLMs
15.3.1.1 Computing the posterior
? Define the log of the unnormalized posterior as follows:
? Let J(f) ?(f) be the function we want to minimize
? The gradient and Hessian of this are given by
? Newton¨s Method? ???? ???? ??
? ??? f? posterior ? ??? ? f ? hat{f} ?? ??
? ???, f? posterior ? Gaussian Approximation?
σ ??? ????? ?? ?? ?
24. 15.3 GPs meet GLMs
15.3.1.2 Computing the posterior predictive
? Test point? ?? posterior predictive ?
f ? ???? f? MAP ????? ??