際際滷

際際滷Share a Scribd company logo
Gaussian Process
Jungkyu Lee
Daum Search Quality Team
1
Intuition
? http://explainaway.wordpress.com/2008/12/01/ridiculous-stats/
? x ? ??, y ?: 100?? ??, ??= ??? uncertainty
? ??? ??? ? ??? = 2030? ?? ?? ?? ???? ??? ? ??
? ??? x(??)? ????, ???? ??? (100 ??)? ????? ???? ???? ???? ? ??
regression
Prerequisite: Conditioning of Multivariate Gaussians
? a random vector x ( Rn with x ゛ N (?, Σ)
? Then the marginals are given by¨
? and the posterior conditional is given by
Prerequisite: Bayesian Linear Regression
? ?? ??
? http://play.daumcorp.com/display/~sweaterr/7.+Linear+Regression#7.LinearRegression-
7.6Bayesianlinearregression
?
w ? Gaussian, likelihood ? Gaussian ??, w? posterior? Gaussian ??, Linear Gaussian System ?
?? ??? ?? ? ??
posterior predictive ? ??? tractable ?? ???, ?? ? ??, ?? posterior ? ??? weight ? ? ??
?? ???? ?? ?? ??? ??
Parametric Models vs. Non-parametric Models
Non-parametric model? parametric model ? ??? model? ??? ???? ??,
?????? ?? ?? ????.
Parametric models:
? Linear Regression
? GMM
Non-parametric models:
? KNN
? Kernel Regression
? Gaussian Process
Non-parametric Models
Kernel Regression (Non-Parametric, Non-Bayes) GP Regression (Non-Parametric, Bayes)
??? ?? ??? ???? ??? ???
confidence? ??
15.1 Introduction
? In supervised learning, we observe some inputs xi and some outputs yi.
? xi ? yi? mapping ?? ?? ?? f? ??? ????, ? ??? ???? ????.
? ?? ???? ??? ? ??? ???? ???? ??? ??? ?? ???
? ???? ??? parametric? ??? ??? ??? ??? p(f|D) ?? p(θ|D)? ?????.
? ??? ?? ??? ?? ???? ??? ????.
Probability distributions over functions with ?nite domains
? ??? ?? training example ? ??? ?? X = {x1, . . . , xm}
? ?? ??? ?? ?? ???? ??? ??? ??? ?? ??? ? ??
? ??? domain ? ??? m ?? ?? ???? ??? ??? ??? ?? ??? ??? ? ??
? ??? ??? ??? ??? ? ???? ??? ????? ???
? ?, ?? domain ? ?? ??? ?? ????? ???.
? ???? ?? ?????, h(x1)? h(x2) ? ????
? ????, ??? domain ? ??? ??? ?? ?, ??? ?? ??? ??? ?????
Probability distributions over functions with in?nite domains
? Stochastic process ? ?? ?? ? ????
? Gaussian process ? ? ?? ??? sub collection ? joint probability ? multivariate Gaussian distribution (MVN)
? stochastic process ??.
? Gaussian Process ? ?? ??? ?? ????, ? ?? ???? ???, ??? ?? MVN ??
? ?? ?? x1, ... , xm ? ???, ?? ?? h(x1),..., h(xm)? ??? ?? ??? ???
? ??? ?? ??? ?? ????.
? ?? domain ? ??, ???? kernel? ????.
? ??? ??? m? ???? ?? ????? ???, ?? ? ??? correlation ? ??? input ? ???
? ??? ?? kernel ? ????.
? ?? ????? ???? ????, ? ??? ???? ?? ??? ?? ??? ???? ???? ??
?
? ? Gaussian Process? ?? ??(?? domain)? ?? ????, ?? ?? ??? ???? ?? ? ??
? ????, ?? ?? ?????, ??? ???? ?? ?? ??? ? ? ??? ???
Graphical Model for GP
? 2 training points ? 1 test point? ?? Gaussian Process
? mixed directed and undirected graphical model representing
? fi = f(xi) ? ? data points ?? ??? ???, hidden node ?
? Test data point x*? training ? ??? ??, y*? training ? ??? ??.
? ????, ?? ??? ?? Kernel Regression ? ????, ??? test point ? ??? training y ? f ? ??
?? ??
different kernel
? ??? ?? ?????, ??? sampling ? ? ??
? kernel ? ?? ??? ??? ??? ???.
? kernel ? ?? GP design ? ?? ??? ?? ???? ???? ?? ? ????
15.2 GPs for regression
? ??? ??? ?? GP ??? ???? ??
? where m(x) is the mean function and κ(x, x) is the kernel or covariance function,
? ??? ??? ? ??? ???, ? ????? joint Gaussian? ???? (GP? ??)
?
15.2.1 Predictions using noise-free observations
? fi = f(xi)? xi?? ??? ??? ?? ????
? X? of size N? 〜 D ? test set? ???? ?, test set ? ?? function outputs f?? ???? ??
? ??? ???? ??? ?????, GP? ?? x? ?? f(x)? ???? ?? ????, test point x* ? ?
? ?? f* ? ??? ?? f? ??? ????? ??? ????
? By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following form
? ??? 0, K ? identity matrix ?? ??, test point? f* ??? K*
Tf? test point ? ??? training point ? f ?
???? ? ??
training points ?? Gram matrix
training points ? test points ?? Gram matrix
test points ?? Gram matrix
15.2.1 Predictions using noise-free observations
? ??: GP? prior
? ???: 5?? noise free observations? ?? ? GP posterior
? Predictive uncertainty ? ???? ????? ???
15.2.2 Predictions using noisy observations
? ??? noisy ??? ??
? ???? ?? ??? response ? ?? f ? ???
? ???? response (?) ? ?? ???? ??? ?? ????
? where δpq = I(p = q). In other words,
? ? ?? ??? ? ??? ???? ??? term? ???? ??? diagonal??
? ???? ?? ????, test point ? ?? noise-free ??? ?? joint probability ?
? ???, notation ? ??? ?? ??, ??? 0??? ????
? ????, posterior predictive density ?
? ???? ???? ??? ??? ??? ?? ?? K? diagonal ? σ2y ? ?? ? ???
15.2.2 Predictions using noisy observations
? ? ?? test point ? ?? ???
? where k? = [κ(x?, x1), . . . , κ(x?, xN)] and k?? = κ(x?, x?).
? Another way to write the posterior mean is as follows:
? where α = K?1 y y. We will revisit this expression later
15.2.3 Effect of the kernel parameters
? GP? ?? ??? ?? kernel ? ???? ?? ??.
? Suppose we choose the following squared-exponential (SE) kernel for the noisy observations
? (a) l = 1 good fit
? (b) l = 0.3 ?? --> ?????, ????? ??
? (c ) l = 3
controls the vertical scale of the function
noise variance
15.2.4 Estimating the kernel parameters
? kernel parameters ? ???? ??? ?? CV? ?? ???
? ??? ?? likelihood ? ????? kernel parameters ? ??? ??
? The first term is a data fit term,
? kernel ? RBF? ??, bandwidth ? ????, ?? ?? ??? ??? 0? ??, ??? ?? ??? ???
?? ???, GPR? ?? ??? ?? data ? ?? fit ??. bandwidth ? ??? ????, Ky? diagonal
matrix ? ? ??( ?? near point ?? similarity ? 0 ??), log|Ky| ? ?? ????? (?? ???
determinant ? ?? ??? ?), ??? Fit ??? ????
? ??? kernel parameter(bandwidth) ? ???, data fit error ? ???, log|Ky| ? ????
? ????, ??? term = likelihood, ? ?? term = model complexity ?? ? ? ??, trade off ? ??.
? ??? ????? ??? ?? kernel parameter l? ???.
? ??? ?? gradient ? ???
? gradient descent ?? standard gradient-based optimizer ? ???
15.3 GPs meet GLMs
? GP ? ?? GLMs ??? ??? ? ??
? ???? f(x) = wTx ?? ??? ??? x ? ???? ?????, ??? f ~ GP() ??
? ?? ??? 15.3.1 binary classification ??? ??? ?????
? define the model
? yi ( {?1, +1}, and we let σ(z) = sigm(z) (logistic regression)
? yi = 1??, fi? 0?? ?? ??? 0.5?? ??
? yi = -1?? fi? 0?? ??? ??? 0.5?? ??
? As for GP regression, we assume f ゛ GP(0, κ).
Prerequisite: Gaussian approximation(aka Laplace
approximation)
? http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression-
8.4.1Laplaceapproximation
? posterior? ?? ??? ???? ??
posterior ? ??? posterior ? MAP ????, ?
?? posterior? ?? ??? ?? ???? ???
?? ???? ?
Prerequisite: Gaussian approximation for logistic regression
? ????? hat{w}? ??? ?? logistic regression? MAP ??? ?? H? posterIor? ? ? ????
15.3 GPs meet GLMs
15.3.1.1 Computing the posterior
? Define the log of the unnormalized posterior as follows:
? Let J(f) ?(f) be the function we want to minimize
? The gradient and Hessian of this are given by
? Newton¨s Method? ???? ???? ??
? ??? f? posterior ? ??? ? f ? hat{f} ?? ??
? ???, f? posterior ? Gaussian Approximation?
σ ??? ????? ?? ?? ?
15.3 GPs meet GLMs
15.3.1.2 Computing the posterior predictive
? Test point? ?? posterior predictive ?
f ? ???? f? MAP ????? ??
15.3 GPs meet GLMs
15.3.1.3 Computing the marginal likelihood
? Kernel parameter? ????? ?? marginal likelihood? ????
? LogP(D) ? kernel parameter? ?? ???? ??????, kernel parameter? W,K,f? ?? depend?? ?
?? ??? ((Rasmussen and Williams 2006,p125))
15.3.1.1 Computing the posterior ? ??? ??? ??
5 Summary
? ? Gaussian Process ? ????
? Bayesian method ??
? ??? uncertainty ? ??? ? ? ??
? ?? model selection ? hyperparameter selection ? ?? Bayesian method? ??? ??? ? ??
? ??? RBF ? bandwidth ? ?? likelihood ? ????? ?? ??
? Non-parametric ??
? input point ? ?? ??? ??? ????? (No model assumption)

More Related Content

??? ???? : Gaussian Processes

  • 1. Gaussian Process Jungkyu Lee Daum Search Quality Team 1
  • 2. Intuition ? http://explainaway.wordpress.com/2008/12/01/ridiculous-stats/ ? x ? ??, y ?: 100?? ??, ??= ??? uncertainty ? ??? ??? ? ??? = 2030? ?? ?? ?? ???? ??? ? ?? ? ??? x(??)? ????, ???? ??? (100 ??)? ????? ???? ???? ???? ? ?? regression
  • 3. Prerequisite: Conditioning of Multivariate Gaussians ? a random vector x ( Rn with x ゛ N (?, Σ) ? Then the marginals are given by¨ ? and the posterior conditional is given by
  • 4. Prerequisite: Bayesian Linear Regression ? ?? ?? ? http://play.daumcorp.com/display/~sweaterr/7.+Linear+Regression#7.LinearRegression- 7.6Bayesianlinearregression ? w ? Gaussian, likelihood ? Gaussian ??, w? posterior? Gaussian ??, Linear Gaussian System ? ?? ??? ?? ? ?? posterior predictive ? ??? tractable ?? ???, ?? ? ??, ?? posterior ? ??? weight ? ? ?? ?? ???? ?? ?? ??? ??
  • 5. Parametric Models vs. Non-parametric Models Non-parametric model? parametric model ? ??? model? ??? ???? ??, ?????? ?? ?? ????. Parametric models: ? Linear Regression ? GMM Non-parametric models: ? KNN ? Kernel Regression ? Gaussian Process
  • 6. Non-parametric Models Kernel Regression (Non-Parametric, Non-Bayes) GP Regression (Non-Parametric, Bayes) ??? ?? ??? ???? ??? ??? confidence? ??
  • 7. 15.1 Introduction ? In supervised learning, we observe some inputs xi and some outputs yi. ? xi ? yi? mapping ?? ?? ?? f? ??? ????, ? ??? ???? ????. ? ?? ???? ??? ? ??? ???? ???? ??? ??? ?? ??? ? ???? ??? parametric? ??? ??? ??? ??? p(f|D) ?? p(θ|D)? ?????. ? ??? ?? ??? ?? ???? ??? ????.
  • 8. Probability distributions over functions with ?nite domains ? ??? ?? training example ? ??? ?? X = {x1, . . . , xm} ? ?? ??? ?? ?? ???? ??? ??? ??? ?? ??? ? ?? ? ??? domain ? ??? m ?? ?? ???? ??? ??? ??? ?? ??? ??? ? ?? ? ??? ??? ??? ??? ? ???? ??? ????? ??? ? ?, ?? domain ? ?? ??? ?? ????? ???. ? ???? ?? ?????, h(x1)? h(x2) ? ???? ? ????, ??? domain ? ??? ??? ?? ?, ??? ?? ??? ??? ?????
  • 9. Probability distributions over functions with in?nite domains ? Stochastic process ? ?? ?? ? ???? ? Gaussian process ? ? ?? ??? sub collection ? joint probability ? multivariate Gaussian distribution (MVN) ? stochastic process ??. ? Gaussian Process ? ?? ??? ?? ????, ? ?? ???? ???, ??? ?? MVN ?? ? ?? ?? x1, ... , xm ? ???, ?? ?? h(x1),..., h(xm)? ??? ?? ??? ??? ? ??? ?? ??? ?? ????. ? ?? domain ? ??, ???? kernel? ????. ? ??? ??? m? ???? ?? ????? ???, ?? ? ??? correlation ? ??? input ? ??? ? ??? ?? kernel ? ????. ? ?? ????? ???? ????, ? ??? ???? ?? ??? ?? ??? ???? ???? ?? ? ? ? Gaussian Process? ?? ??(?? domain)? ?? ????, ?? ?? ??? ???? ?? ? ?? ? ????, ?? ?? ?????, ??? ???? ?? ?? ??? ? ? ??? ???
  • 10. Graphical Model for GP ? 2 training points ? 1 test point? ?? Gaussian Process ? mixed directed and undirected graphical model representing ? fi = f(xi) ? ? data points ?? ??? ???, hidden node ? ? Test data point x*? training ? ??? ??, y*? training ? ??? ??. ? ????, ?? ??? ?? Kernel Regression ? ????, ??? test point ? ??? training y ? f ? ?? ?? ??
  • 11. different kernel ? ??? ?? ?????, ??? sampling ? ? ?? ? kernel ? ?? ??? ??? ??? ???. ? kernel ? ?? GP design ? ?? ??? ?? ???? ???? ?? ? ????
  • 12. 15.2 GPs for regression ? ??? ??? ?? GP ??? ???? ?? ? where m(x) is the mean function and κ(x, x) is the kernel or covariance function, ? ??? ??? ? ??? ???, ? ????? joint Gaussian? ???? (GP? ??) ?
  • 13. 15.2.1 Predictions using noise-free observations ? fi = f(xi)? xi?? ??? ??? ?? ???? ? X? of size N? 〜 D ? test set? ???? ?, test set ? ?? function outputs f?? ???? ?? ? ??? ???? ??? ?????, GP? ?? x? ?? f(x)? ???? ?? ????, test point x* ? ? ? ?? f* ? ??? ?? f? ??? ????? ??? ???? ? By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following form ? ??? 0, K ? identity matrix ?? ??, test point? f* ??? K* Tf? test point ? ??? training point ? f ? ???? ? ?? training points ?? Gram matrix training points ? test points ?? Gram matrix test points ?? Gram matrix
  • 14. 15.2.1 Predictions using noise-free observations ? ??: GP? prior ? ???: 5?? noise free observations? ?? ? GP posterior ? Predictive uncertainty ? ???? ????? ???
  • 15. 15.2.2 Predictions using noisy observations ? ??? noisy ??? ?? ? ???? ?? ??? response ? ?? f ? ??? ? ???? response (?) ? ?? ???? ??? ?? ???? ? where δpq = I(p = q). In other words, ? ? ?? ??? ? ??? ???? ??? term? ???? ??? diagonal?? ? ???? ?? ????, test point ? ?? noise-free ??? ?? joint probability ? ? ???, notation ? ??? ?? ??, ??? 0??? ???? ? ????, posterior predictive density ? ? ???? ???? ??? ??? ??? ?? ?? K? diagonal ? σ2y ? ?? ? ???
  • 16. 15.2.2 Predictions using noisy observations ? ? ?? test point ? ?? ??? ? where k? = [κ(x?, x1), . . . , κ(x?, xN)] and k?? = κ(x?, x?). ? Another way to write the posterior mean is as follows: ? where α = K?1 y y. We will revisit this expression later
  • 17. 15.2.3 Effect of the kernel parameters ? GP? ?? ??? ?? kernel ? ???? ?? ??. ? Suppose we choose the following squared-exponential (SE) kernel for the noisy observations ? (a) l = 1 good fit ? (b) l = 0.3 ?? --> ?????, ????? ?? ? (c ) l = 3 controls the vertical scale of the function noise variance
  • 18. 15.2.4 Estimating the kernel parameters ? kernel parameters ? ???? ??? ?? CV? ?? ??? ? ??? ?? likelihood ? ????? kernel parameters ? ??? ?? ? The first term is a data fit term, ? kernel ? RBF? ??, bandwidth ? ????, ?? ?? ??? ??? 0? ??, ??? ?? ??? ??? ?? ???, GPR? ?? ??? ?? data ? ?? fit ??. bandwidth ? ??? ????, Ky? diagonal matrix ? ? ??( ?? near point ?? similarity ? 0 ??), log|Ky| ? ?? ????? (?? ??? determinant ? ?? ??? ?), ??? Fit ??? ???? ? ??? kernel parameter(bandwidth) ? ???, data fit error ? ???, log|Ky| ? ???? ? ????, ??? term = likelihood, ? ?? term = model complexity ?? ? ? ??, trade off ? ??.
  • 19. ? ??? ????? ??? ?? kernel parameter l? ???. ? ??? ?? gradient ? ??? ? gradient descent ?? standard gradient-based optimizer ? ???
  • 20. 15.3 GPs meet GLMs ? GP ? ?? GLMs ??? ??? ? ?? ? ???? f(x) = wTx ?? ??? ??? x ? ???? ?????, ??? f ~ GP() ?? ? ?? ??? 15.3.1 binary classification ??? ??? ????? ? define the model ? yi ( {?1, +1}, and we let σ(z) = sigm(z) (logistic regression) ? yi = 1??, fi? 0?? ?? ??? 0.5?? ?? ? yi = -1?? fi? 0?? ??? ??? 0.5?? ?? ? As for GP regression, we assume f ゛ GP(0, κ).
  • 21. Prerequisite: Gaussian approximation(aka Laplace approximation) ? http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression- 8.4.1Laplaceapproximation ? posterior? ?? ??? ???? ?? posterior ? ??? posterior ? MAP ????, ? ?? posterior? ?? ??? ?? ???? ??? ?? ???? ?
  • 22. Prerequisite: Gaussian approximation for logistic regression ? ????? hat{w}? ??? ?? logistic regression? MAP ??? ?? H? posterIor? ? ? ????
  • 23. 15.3 GPs meet GLMs 15.3.1.1 Computing the posterior ? Define the log of the unnormalized posterior as follows: ? Let J(f) ?(f) be the function we want to minimize ? The gradient and Hessian of this are given by ? Newton¨s Method? ???? ???? ?? ? ??? f? posterior ? ??? ? f ? hat{f} ?? ?? ? ???, f? posterior ? Gaussian Approximation? σ ??? ????? ?? ?? ?
  • 24. 15.3 GPs meet GLMs 15.3.1.2 Computing the posterior predictive ? Test point? ?? posterior predictive ? f ? ???? f? MAP ????? ??
  • 25. 15.3 GPs meet GLMs 15.3.1.3 Computing the marginal likelihood ? Kernel parameter? ????? ?? marginal likelihood? ???? ? LogP(D) ? kernel parameter? ?? ???? ??????, kernel parameter? W,K,f? ?? depend?? ? ?? ??? ((Rasmussen and Williams 2006,p125)) 15.3.1.1 Computing the posterior ? ??? ??? ??
  • 26. 5 Summary ? ? Gaussian Process ? ???? ? Bayesian method ?? ? ??? uncertainty ? ??? ? ? ?? ? ?? model selection ? hyperparameter selection ? ?? Bayesian method? ??? ??? ? ?? ? ??? RBF ? bandwidth ? ?? likelihood ? ????? ?? ?? ? Non-parametric ?? ? input point ? ?? ??? ??? ????? (No model assumption)