ŗŻŗŻß£

ŗŻŗŻß£Share a Scribd company logo
Jeonghun Yoon
ģ§€ė‚œ ģ‹œź°„.....Naive Bayes Classifier
arg max
š‘¦
š‘ƒ š‘„1, ā€¦ , š‘„ š‘‘ š‘¦ š‘ƒ(š‘¦) = arg max
š‘¦
š‘ƒ š‘„š‘– š‘¦ š‘ƒ(š‘¦)
š‘‘
š‘–=1
class š‘¦ ģ˜ ė°œģƒ ķ™•ė„ ź³¼ test setģ—ģ„œ class š‘¦ģ˜ labelģ„ ź°€ģ§„ ė°ģ“ķ„°ģ˜ ķŠ¹ģ„± ė²”ķ„°ģ˜
ģ›ģ†Œ š‘„š‘– (ė¬øģ„œģ˜ ģ˜ˆģ—ģ„œėŠ” ė‹Øģ–“) ź°€ ė‚˜ģ˜¬ ķ™•ė„ ģ˜ ź³±
ex) (I, love, you)ź°€ spamģøģ§€ ģ•„ė‹Œģ§€ ģ•Œźø° ģœ„ķ•“ģ„œėŠ”,
test setģ—ģ„œ spamģ“ ģ°Øģ§€ķ•˜ėŠ” ė¹„ģœØź³¼
spamģœ¼ė”œ labeling ėœ ė¬øģ„œģ—ģ„œ Iģ™€ loveģ™€ youź°€ ė°œģƒķ•˜ėŠ” ķ™•ė„ ģ„ ėŖØė‘ ź³±ķ•œ ź²ƒź³¼,
test setģ—ģ„œ hamģ“ ģ°Øģ§€ķ•˜ėŠ” ė¹„ģœØź³¼
hamģœ¼ė”œ labeling ėœ ė¬øģ„œģ—ģ„œ Iģ™€ loveģ™€ youź°€ ė°œģƒķ•˜ėŠ” ķ™•ė„ ģ„ ėŖØė‘ ź³±ķ•œ ź²ƒģ„,
ė¹„źµķ•œė‹¤.
ģ§€ė‚œ ģ‹œź°„ ėÆøė¹„ķ–ˆė˜ ģ  ė“¤...
1. Laplacian Smoothing (appendix ģ°øź³ )
2. MLE / MAP
1
Bayesā€™ Rule
š‘ šœƒ š•© =
š‘ š•© šœƒ š‘(šœƒ)
š‘ š•© šœƒ š‘(šœƒ)
posteriori
(ģ‚¬ķ›„ ķ™•ė„ )
likelihood
(ģš°ė„ ź°’)
prior
(ģ‚¬ģ „ ķ™•ė„ )
ģ‚¬ķ›„ ķ™•ė„  : ź“€ģ°° ź°’ė“¤ģ“ ź“€ģ°° ėœ ķ›„ģ— ėŖØģˆ˜(parameter)ģ˜ ė°œģƒ ķ™•ė„ ģ„ źµ¬ķ•œė‹¤.
ģ‚¬ģ „ ķ™•ė„  : ź“€ģ°° ź°’ė“¤ģ“ ź“€ģ°° ė˜źø° ģ „ģ— ėŖØģˆ˜ģ˜ ė°œģƒ ķ™•ė„ ģ„ źµ¬ķ•œė‹¤.
ģš°ė„ ź°’ : ėŖØģˆ˜ģ˜ ź°’ģ“ ģ£¼ģ–“ģ”Œģ„ ė•Œ ź“€ģ°° ź°’ė“¤ģ“ ė°œģƒķ•  ķ™•ė„ 
Maximum Likelihood Estimate
š•© = (š‘„1, ā€¦ , š‘„ š‘›)
š“› šœ½ = š’‘ š•© šœ½
ģš°ė„(likelihood)ėŠ” ė‹¤ģŒź³¼ ź°™ģ“ ģ •ģ˜ ėœė‹¤.
ė³€ģˆ˜(parameter) šœƒź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ, data set š•© = (š‘„1, ā€¦ , š‘„ š‘›) (ź“€ģ°° ėœ, observed) ė„¼
ģ–»ģ„ ģˆ˜ ģžˆėŠ”(obtaining) ķ™•ė„ 
š‘(š•©|šœƒ)
š‘‹
šœƒģ˜ ķ•Øģˆ˜.
šœƒģ˜ pdfėŠ” ģ•„ė‹˜.
š•© = (š‘„1, ā€¦ , š‘„ š‘›)
Maximum Likelihood EstimateėŠ” ė‹¤ģŒź³¼ ź°™ģ“ ģ •ģ˜ ėœė‹¤.
ź“€ģ°° ėœ data set š•© = š‘„1, ā€¦ , š‘„ š‘› ģ„ ģ–»ģ„ ģˆ˜ ģžˆėŠ” ķ™•ė„ ģ“ ź°€ģž„ ķ° šœƒź°€ MLEģ“ė‹¤.
š‘(š•©|šœƒ1)
š‘‹
š•© = (š‘„1, ā€¦ , š‘„ š‘›)
šœ½ = ššš«š  š¦ššš±
šœ½
š“› šœ½ = ššš«š  š¦ššš±
šœ½
š’‘(š•©|šœ½)Ģ‚
š‘(š•©|šœƒ2) š‘(š•©|šœƒ3)
š‘(š•©|šœƒ)
šœƒ = šœƒ2
Ģ‚
ģš°ė¦¬ź°€ likelihood function š‘(š•©|šœƒ)ģ™€ prior š‘(šœƒ)ė„¼ ģ•Œ ė•Œ, Bayes ruleģ— ģ˜ķ•˜ģ—¬
posteriori functionģ˜ ź°’ģ„ źµ¬ķ•  ģˆ˜ ģžˆė‹¤.
š’‘ šœ½ š•© āˆ š’‘ š•© šœ½ š’‘(šœ½)
Maximum A Posteriori Estimate
š‘ šœƒ š•© =
š‘ š•© šœƒ š‘(šœƒ)
š‘ š•© šœƒ š‘(šœƒ)
posteriori
(ģ‚¬ķ›„ ķ™•ė„ )
likelihood
(ģš°ė„ ź°’)
prior
(ģ‚¬ģ „ ķ™•ė„ )
Likelihood š‘(š•©|šœƒ)
Prior š‘(šœƒ)
Posterior
š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
Likelihood š‘(š•©|šœƒ)
Prior š‘(šœƒ)
Posterior
š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
šœ½ = ššš«š  š¦ššš±
šœ½
š’‘(šœ½|š•©)
Likelihood š‘(š•©|šœƒ)
Prior š‘(šœƒ)
Posterior
š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
Regression
ė‚˜ėŠ” ķ° ģ‹ ė°œķšŒģ‚¬ģ˜ CEOģ“ė‹¤. ė§Žģ€ ģ§€ģ ė“¤ģ„ ź°€ģ§€ź³  ģžˆė‹¤.
ź·øė¦¬ź³  ģ“ė²ˆģ— ģƒˆė”œģš“ ģ§€ģ ģ„ ė‚“ź³  ģ‹¶ė‹¤. ģ–“ėŠ ģ§€ģ—­ģ— ė‚“ģ•¼ ė ź¹Œ?
ė‚“ź°€ ģƒˆė”œģš“ ģ§€ģ ģ„ ė‚“ź³  ģ‹¶ģ–“ķ•˜ėŠ” ģ§€ģ—­ė“¤ģ˜ ģ˜ˆģƒ ģˆ˜ģµė§Œ ķŒŒģ•…ķ•  ģˆ˜ ģžˆģœ¼ė©“
ķ° ė„ģ›€ģ“ ė  ź²ƒģøė°!
ė‚“ź°€ ź°€ģ§€ź³  ģžˆėŠ” ģžė£Œ(data)ėŠ” ź° ģ§€ģ ģ˜ ģˆ˜ģµ(profits)ź³¼ ź° ģ§€ģ ģ“ ģžˆėŠ” ģ§€ģ—­ģ˜
ģøźµ¬ģˆ˜(populations)ģ“ė‹¤.
ķ•“ź²°ģ±…! Linear Regression!
ģ“ź²ƒģ„ ķ†µķ•˜ģ—¬, ģƒˆė”œģš“ ģ§€ģ—­ģ˜ ģøźµ¬ģˆ˜ė„¼ ģ•Œź²Œ ė  ź²½ģš°, ź·ø ģ§€ģ—­ģ˜ ģ˜ˆģƒ ģˆ˜ģµģ„ źµ¬
ķ•  ģˆ˜ ģžˆė‹¤.
Example 1)
Example 2)
ė‚˜ėŠ” ģ§€źøˆ Pittsburghė”œ ģ“ģ‚¬ė„¼ ģ™”ė‹¤
ė‚˜ėŠ” ź°€ģž„ ķ•©ė¦¬ģ ģø ź°€ź²©ģ˜ ģ•„ķŒŒķŠøė„¼ ģ–»źø° ģ›ķ•œė‹¤.
ź·øė¦¬ź³  ė‹¤ģŒģ˜ ģ”°ź±“ė“¤ģ€ ė‚“ź°€ ģ§‘ģ„ ģ‚¬źø° ģœ„ķ•“ ź³ ė ¤ķ•˜ėŠ” ź²ƒė“¤ģ“ė‹¤.
square-ft(ķ‰ė°©ėÆøķ„°), ģ¹Øģ‹¤ģ˜ ģˆ˜, ķ•™źµ ź¹Œģ§€ģ˜ ź±°ė¦¬...
ė‚“ź°€ ģ›ķ•˜ėŠ” ķ¬źø°ģ™€ ģ¹Øģ‹¤ģ˜ ģˆ˜ė„¼ ź°€ģ§€ź³  ģžˆėŠ” ģ§‘ģ˜ ź°€ź²©ģ€ ź³¼ģ—° ģ–¼ė§ˆģ¼ź¹Œ?
ā‘  Given an input š‘„ we would like to compute an output š‘¦.
(ė‚“ź°€ ģ›ķ•˜ėŠ” ģ§‘ģ˜ ķ¬źø°ģ™€, ė°©ģ˜ ź°œģˆ˜ė„¼ ģž…ė „ķ–ˆģ„ ė•Œ, ģ§‘ ź°€ź²©ģ˜ ģ˜ˆģø” ź°’ģ„ ź³„ģ‚°)
ā‘” For example
1) Predict height from age (height = š‘¦, age = š‘„)
2) Predict Google`s price from Yahoo`s price (Google's price = š‘¦, Yahoo's price = š‘„)
š‘¦ = šœƒ0 + šœƒ1 š‘„
ģ¦‰, źø°ģ”“ģ˜ dataė“¤ģ—ģ„œ
ģ§ģ„ (š‘¦ = šœƒ0 + šœƒ1 š‘„)ģ„ ģ°¾ģ•„ė‚“ė©“,
ģƒˆė”œģš“ ź°’ š‘„ š‘›š‘’š‘¤ź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ,
ķ•“ė‹¹ķ•˜ėŠ” š‘¦ģ˜ ź°’ģ„ ģ˜ˆģø”ķ•  ģˆ˜
ģžˆź² źµ¬ė‚˜!
learning, training
prediction
Input : ģ§‘ģ˜ ķ¬źø°(š‘„1), ė°©ģ˜ ź°œģˆ˜(š‘„2), ķ•™źµź¹Œģ§€ģ˜ ź±°ė¦¬(š‘„3),.....
(š‘„1, š‘„2, ā€¦ , š‘„ š‘›) : ķŠ¹ģ„± ė²”ķ„° feature vector
Output : ģ§‘ ź°’(š‘¦)
š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’
training setģ„ ķ†µķ•˜ģ—¬ ķ•™ģŠµ(learning)
Simple Linear Regression
š‘¦š‘– = šœƒ0 + šœƒ1 š‘„š‘– + šœ€š‘–
š‘–ė²ˆģ§ø ź“€ģ°°ģ  š‘¦š‘–, š‘„š‘– ź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ ė‹Øģˆœ ķšŒź·€ ėŖØķ˜•ģ€ ė‹¤ģŒź³¼ ź°™ė‹¤.
šœ–3
šœ–š‘– : š‘–ė²ˆģ§ø ź“€ģ°°ģ ģ—ģ„œ ģš°ė¦¬ź°€ źµ¬ķ•˜ź³ ģž ķ•˜ėŠ”
ķšŒź·€ģ§ģ„ ź³¼ ģ‹¤ģ œ ź“€ģ°°ėœ š‘¦š‘–ģ˜ ģ°Øģ“ (error)
ģš°ė¦¬ėŠ” ģ˜¤ė„˜ģ˜ ķ•©ģ„ ź°€ģž„ ģž‘ź²Œ
ė§Œė“œėŠ” ģ§ģ„ ģ„ ģ°¾ź³  ģ‹¶ė‹¤. ģ¦‰ ź·øė ‡ź²Œ
ė§Œė“œėŠ” šœ½ šŸŽģ™€ šœ½ šŸģ„ ģ¶”ģ •ķ•˜ź³  ģ‹¶ė‹¤ !
How!! ģµœģ†Œ ģ œź³± ė²•! (Least Squares Method)
min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘–
2
š‘–
= š‘šš‘–š‘› šœ–š‘–
2
š‘–
š‘¦ = šœƒ0 + šœƒ1 š‘„
ģ‹¤ģ œ ź“€ģø” ź°’ ķšŒź·€ ģ§ģ„ ģ˜ ź°’(ģ“ģƒģ ģø ź°’)
ģ¢…ģ† ė³€ģˆ˜ ģ„¤ėŖ… ė³€ģˆ˜, ė…ė¦½ ė³€ģˆ˜
min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘–
2
š‘–
= min šœ–š‘–
2
š‘–
ģ‹¤ģ œ ź“€ģø” ź°’ ķšŒź·€ ģ§ģ„ ģ˜ ź°’(ģ“ģƒģ ģø ź°’)
ģœ„ģ˜ ģ‹ģ„ ģµœėŒ€ķ•œ ė§Œģ”± ģ‹œķ‚¤ėŠ” šœƒ0, šœƒ1ģ„ ģ¶”ģ •ķ•˜ėŠ” ė°©ė²•ģ€ ė¬“ģ—‡ģ¼ź¹Œ?
(ģ“ėŸ¬ķ•œ šœƒ1, šœƒ2ė„¼ šœƒ1, šœƒ2 ė¼ź³  ķ•˜ģž.)
- Normal Equation
- Steepest Gradient Descent
Ė† Ė†
What is normal equation?
ź·¹ėŒ€ ź°’, ź·¹ģ†Œ ź°’ģ„ źµ¬ķ•  ė•Œ, ģ£¼ģ–“ģ§„ ģ‹ģ„ ėÆøė¶„ķ•œ ķ›„ģ—, ėÆøė¶„ķ•œ ģ‹ģ„ 0ģœ¼ė”œ
ė§Œė“œėŠ” ź°’ģ„ ģ°¾ėŠ”ė‹¤.
min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘–
2
š‘–
ėؼģ €, šœƒ0ģ— ėŒ€ķ•˜ģ—¬ ėÆøė¶„ķ•˜ģž. āˆ’ š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– = 0
š‘–
šœ•
šœ•šœƒ0
š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘–
2
š‘–
=
ė‹¤ģŒģœ¼ė”œ, šœƒ1ģ— ėŒ€ķ•˜ģ—¬ ėÆøė¶„ķ•˜ģž. āˆ’ š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– š‘„š‘– = 0
š‘–
šœ•
šœ•šœƒ1
š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘–
2
š‘–
=
ģœ„ ģ˜ ė‘ ģ‹ģ„ 0ģœ¼ė”œ ė§Œģ”±ģ‹œķ‚¤ėŠ” šœƒ0, šœƒ1ė„¼ ģ°¾ģœ¼ė©“ ėœė‹¤. ģ“ģ²˜ėŸ¼ 2ź°œģ˜ ėÆøģ§€ģˆ˜ģ— ėŒ€ķ•˜ģ—¬,
2ź°œģ˜ ė°©ģ •ģ‹(system)ģ“ ģžˆģ„ ė•Œ, ģš°ė¦¬ėŠ” ģ“ systemģ„ normal equation(ģ •ź·œė°©ģ •ģ‹)ģ“ė¼ ė¶€ė„øė‹¤.
The normal equation form
š•©š‘– = 1, š‘„š‘–
š‘‡
, Ī˜ = šœƒ0, šœƒ1
š‘‡
, š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦ š‘›
š‘‡
, š‘‹ =
1
1
ā€¦
š‘„1
š‘„2
ā€¦
1 š‘„ š‘›
, š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž.
š•Ŗ = š‘‹Ī˜ + š•–
š‘¦1 = šœƒ0 + šœƒ1 š‘„1 + šœ–1
š‘¦2 = šœƒ0 + šœƒ1 š‘„2 + šœ–2
.......
š‘¦ š‘›āˆ’1 = šœƒ0 + šœƒ1 š‘„ š‘›āˆ’1 + šœ– š‘›āˆ’1
š‘¦ š‘› = šœƒ0 + šœƒ1 š‘„ š‘› + šœ– š‘›
š‘›ź°œģ˜ ź“€ģø” ź°’ (š‘„š‘–, š‘¦š‘–)ģ€ ģ•„ėž˜ģ™€ ź°™ģ€ ķšŒź·€ ėŖØķ˜•ģ„ ź°€ģ§„ė‹¤ź³  ź°€ģ •ķ•˜ģž.
š‘¦1
š‘¦2
š‘¦3
ā€¦
š‘¦ š‘›
=
1
1
1
ā€¦
š‘„1
š‘„2
š‘„3
ā€¦
1 š‘„ š‘›
šœƒ0
šœƒ1
+
šœ–1
šœ–2
šœ–3
ā€¦
šœ– š‘›
šœ–š‘—
2
š‘›
š‘—=1
= š•– š‘‡
š•– = š•Ŗ āˆ’ š‘‹Ī˜ š‘‡
(š•Ŗ āˆ’ š‘‹Ī˜)
= š•Ŗ š‘‡
š•Ŗ āˆ’ Ī˜ š‘‡
š‘‹ š‘‡
š•Ŗ āˆ’ š•Ŗ š‘‡
š‘‹Ī˜ + Ī˜ š‘‡
š‘‹ š‘‡
š‘‹Ī˜
= š•Ŗ š‘‡
š•Ŗ āˆ’ 2Ī˜ š‘‡
š‘‹ š‘‡
š•Ŗ + Ī˜ š‘‡
š‘‹ š‘‡
š‘‹Ī˜
1 by 1 ķ–‰ė ¬ģ“ėƀė”œ
ģ „ģ¹˜ķ–‰ė ¬ģ˜ ź°’ģ“ ź°™ė‹¤!
šœ•(š•– š‘‡
š•–)
šœ•Ī˜
= šŸŽ
šœ•(š•– š‘‡
š•–)
šœ•Ī˜
= āˆ’2š‘‹ š‘‡
š•Ŗ + 2š‘‹ š‘‡
š‘‹Ī˜ = šŸŽ
š‘‹ š‘‡
š‘‹ššÆ = š‘‹ š‘‡
š•Ŗ ššÆ = š‘‹ š‘‡
š‘‹ āˆ’1
š‘‹ š‘‡
š•ŖĖ†
ģ •ź·œė°©ģ •ģ‹
š•Ŗ = š‘‹Ī˜ + š•– š•– = š•Ŗ āˆ’ š‘‹Ī˜
Minimize šœ–š‘—
2
š‘›
š‘—=1
What is Gradient Descent?
machine learningģ—ģ„œėŠ” ė§¤ź°œ ė³€ģˆ˜(parameter, ģ„ ķ˜•ķšŒź·€ģ—ģ„œėŠ” šœƒ0, šœƒ1)ź°€ ģˆ˜ģ‹­~
ģˆ˜ė°± ģ°Øģ›ģ˜ ė²”ķ„°ģø ź²½ģš°ź°€ ėŒ€ė¶€ė¶„ģ“ė‹¤. ė˜ķ•œ ėŖ©ģ  ķ•Øģˆ˜(ģ„ ķ˜•ķšŒź·€ģ—ģ„œėŠ” Ī£šœ–š‘–
2
)ź°€
ėŖØė“  źµ¬ź°„ģ—ģ„œ ėÆøė¶„ ź°€ėŠ„ķ•˜ė‹¤ėŠ” ė³“ģž„ģ“ ķ•­ģƒ ģžˆėŠ” ź²ƒė„ ģ•„ė‹ˆė‹¤.
ė”°ė¼ģ„œ ķ•œ ė²ˆģ˜ ģˆ˜ģ‹ ģ „ź°œė”œ ķ•“ė„¼ źµ¬ķ•  ģˆ˜ ģ—†ėŠ” ģƒķ™©ģ“ ģ ģ§€ ģ•Šź²Œ ģžˆė‹¤.
ģ“ėŸ° ź²½ģš°ģ—ėŠ” ģ“ˆźø° ķ•“ģ—ģ„œ ģ‹œģž‘ķ•˜ģ—¬ ķ•“ė„¼ ė°˜ė³µģ ģœ¼ė”œ ź°œģ„ ķ•“ ė‚˜ź°€ėŠ” ģˆ˜ģ¹˜ģ 
ė°©ė²•ģ„ ģ‚¬ģš©ķ•œė‹¤. (ėÆøė¶„ģ“ ģ‚¬ģš© ėØ)
What is Gradient Descent?
ģ“ˆźø°ķ•“ š›¼0 ģ„¤ģ •
š‘” = 0
š›¼ š‘”ź°€ ė§Œģ”±ģŠ¤ėŸ½ė‚˜?
š›¼ š‘”+1 = š‘ˆ š›¼ š‘”
š‘” = š‘” + 1
š›¼ = š›¼ š‘”
Ė†No
Yes
What is Gradient Descent?
Gradient Descent
ķ˜„ģž¬ ģœ„ģ¹˜ģ—ģ„œ ź²½ģ‚¬ź°€ ź°€ģž„ źø‰ķ•˜ź²Œ ķ•˜ź°•ķ•˜ėŠ” ė°©ķ–„ģ„ ģ°¾ź³ ,
ź·ø ė°©ķ–„ģœ¼ė”œ ģ•½ź°„ ģ“ė™ķ•˜ģ—¬ ģƒˆė”œģš“ ģœ„ģ¹˜ė„¼ ģž”ėŠ”ė‹¤.
ģ“ėŸ¬ķ•œ ź³¼ģ •ģ„ ė°˜ė³µķ•Øģœ¼ė”œģØ ź°€ģž„ ė‚®ģ€ ģ§€ģ (ģ¦‰ ģµœģ € ģ )ģ„ ģ°¾ģ•„ ź°„ė‹¤.
Gradient Ascent
ķ˜„ģž¬ ģœ„ģ¹˜ģ—ģ„œ ź²½ģ‚¬ź°€ ź°€ģž„ źø‰ķ•˜ź²Œ ģƒģŠ¹ķ•˜ėŠ” ė°©ķ–„ģ„ ģ°¾ź³ ,
ź·ø ė°©ķ–„ģœ¼ė”œ ģ•½ź°„ ģ“ė™ķ•˜ģ—¬ ģƒˆė”œģš“ ģœ„ģ¹˜ė„¼ ģž”ėŠ”ė‹¤.
ģ“ėŸ¬ķ•œ ź³¼ģ •ģ„ ė°˜ė³µķ•Øģœ¼ė”œģØ ź°€ģž„ ė†’ģ€ ģ§€ģ (ģ¦‰ ģµœėŒ€ ģ )ģ„ ģ°¾ģ•„ ź°„ė‹¤.
What is Gradient Descent?
Gradient Descent
š›¼ š‘”+1 = š›¼ š‘” āˆ’ šœŒ
šœ•š½
šœ•š›¼ š›¼ š‘”
š½ = ėŖ©ģ ķ•Øģˆ˜
šœ•š½
šœ•š›¼ š›¼ š‘”
: š›¼ š‘”ģ—ģ„œģ˜ ė„ķ•Øģˆ˜
šœ•š½
šœ•š›¼
ģ˜ ź°’
š›¼ š‘” š›¼ š‘”+1
āˆ’
šš‘±
ššœ¶ šœ¶ š’•
šš‘±
ššœ¶ šœ¶ š’•
š›¼ š‘”ģ—ģ„œģ˜ ėÆøė¶„ź°’ģ€ ģŒģˆ˜ģ“ė‹¤.
ź·øėž˜ģ„œ
šœ•J
šœ•Ī± Ī±t
ė„¼ ė”ķ•˜ź²Œ ė˜ė©“
ģ™¼ģŖ½ģœ¼ė”œ ģ“ė™ķ•˜ź²Œ ėœė‹¤.
ź·øėŸ¬ė©“ ėŖ©ģ ķ•Øģˆ˜ģ˜ ź°’ģ“ ģ¦ź°€ķ•˜ėŠ”
ė°©ķ–„ģœ¼ė”œ ģ“ė™ķ•˜ź²Œ ėœė‹¤.
ė”°ė¼ģ„œ
šœ•J
šœ•Ī± Ī±t
ė„¼ ė¹¼ģ¤€ė‹¤.
ź·øė¦¬ź³  ģ ė‹¹ķ•œ šœŒė„¼ ź³±ķ•“ģ£¼ģ–“ģ„œ ģ”°źøˆė§Œ
ģ“ė™ķ•˜ź²Œ ķ•œė‹¤.
āˆ’š†
šš‘±
ššœ¶ šœ¶ š’•
What is Gradient Descent?
Gradient Descent
š›¼ š‘”+1 = š›¼ š‘” āˆ’ šœŒ
šœ•š½
šœ•š›¼ š›¼ š‘”
Gradient Ascent
š›¼ š‘”+1 = š›¼ š‘” + šœŒ
šœ•š½
šœ•š›¼ š›¼ š‘”
š½ = ėŖ©ģ ķ•Øģˆ˜
šœ•š½
šœ•š›¼ š›¼ š‘”
: š›¼ š‘”ģ—ģ„œģ˜ ė„ķ•Øģˆ˜
šœ•š½
šœ•š›¼
ģ˜ ź°’
Gradient Descent, Gradient AscentėŠ” ģ „ķ˜•ģ ģø Greedy algorithmģ“ė‹¤.
ź³¼ź±° ė˜ėŠ” ėÆøėž˜ė„¼ ź³ ė ¤ķ•˜ģ§€ ģ•Šź³  ķ˜„ģž¬ ģƒķ™©ģ—ģ„œ ź°€ģž„ ģœ ė¦¬ķ•œ ė‹¤ģŒ ģœ„ģ¹˜ė„¼ ģ°¾ģ•„
Local optimal pointė”œ ėė‚  ź°€ėŠ„ģ„±ģ„ ź°€ģ§„ ģ•Œź³ ė¦¬ģ¦˜ģ“ė‹¤.
š½ Ī˜ =
1
2
šœƒ0 + šœƒ1 š‘„š‘– āˆ’ š‘¦š‘–
2
š‘›
š‘–=1
=
1
2
Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–
2
š‘›
š‘–=1
š•©š‘– = 1, š‘„š‘–
š‘‡
, Ī˜ = šœƒ0, šœƒ1
š‘‡
, š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘›
š‘‡
, š‘‹ =
1
1
ā€¦
š‘„1
š‘„2
ā€¦
1 š‘„ š‘›
, š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž.
šœƒ0
š‘”+1
= šœƒ0
š‘”
āˆ’ š›¼
šœ•
šœ•šœƒ0
š½(Ī˜)š‘”
šœƒ1
š‘”+1
= šœƒ1
š‘”
āˆ’ š›¼
šœ•
šœ•šœƒ1
š½(Ī˜)š‘”
šœƒ0ģ˜ š‘”ė²ˆģ§ø ź°’ģ„,
š½(Ī˜)ė„¼ šœƒ0ģœ¼ė”œ ėÆøė¶„ķ•œ ģ‹ģ—ė‹¤ź°€ ėŒ€ģž….
ź·ø ķ›„ģ—, ģ“ ź°’ģ„ šœƒ0ģ—ģ„œ ė¹¼ ģ¤Œ.
ėÆøė¶„ķ•  ė•Œ ģ“ģš©.
Gradient descentė„¼ ģ¤‘ģ§€ķ•˜ėŠ”
źø°ģ¤€ģ“ ė˜ėŠ” ķ•Øģˆ˜
š½ Ī˜ =
1
2
šœƒ0 + šœƒ1 š‘„š‘– āˆ’ š‘¦š‘–
2
š‘›
š‘–=1
=
1
2
Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–
2
š‘›
š‘–=1
š•©š‘– = 1, š‘„š‘–
š‘‡
, Ī˜ = šœƒ0, šœƒ1
š‘‡
, š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘›
š‘‡
, š‘‹ =
1
1
ā€¦
š‘„1
š‘„2
ā€¦
1 š‘„ š‘›
, š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž.
Gradient of š½(Ī˜)
šœ•
šœ•šœƒ0
š½ šœƒ = (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–)
š‘›
š‘–=1
1
šœ•
šœ•šœƒ1
š½ šœƒ = (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–)
š‘›
š‘–=1
š‘„š‘–
š›»š½ Ī˜ =
šœ•
šœ•šœƒ0
š½ Ī˜ ,
šœ•
šœ•šœƒ1
š½ Ī˜
š‘‡
= Ī˜ š‘‡
š•©š‘– āˆ’ š‘¦š‘– š•©š‘–
š‘›
š‘–=1
š•©š‘– = 1, š‘„š‘–
š‘‡
, Ī˜ = šœƒ0, šœƒ1
š‘‡
, š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘›
š‘‡
, š‘‹ =
1
1
ā€¦
š‘„1
š‘„2
ā€¦
1 š‘„ š‘›
, š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž.
šœƒ0
š‘”+1
= šœƒ0
š‘”
āˆ’ š›¼ (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–)
š‘›
š‘–=1
1
ė‹Ø, ģ“ ė•Œģ˜ Ī˜ģžė¦¬ģ—ėŠ”
š‘”ė²ˆģ§øģ— ģ–»ģ–“ģ§„ Ī˜ź°’ģ„ ėŒ€ģž…ķ•“ģ•¼ ķ•œė‹¤.
šœƒ1
š‘”+1
= šœƒ1
š‘”
āˆ’ š›¼ Ī˜ š‘‡
š•©š‘– āˆ’ š‘¦š‘– š‘„š‘–
š‘›
š‘–=1
Steepest Descent
Steepest Descent
ģž„ģ  : easy to implement, conceptually clean, guaranteed convergence
ė‹Øģ  : often slow converging
Ī˜ š‘”+1 = Ī˜ š‘” āˆ’ š›¼ {(Ī˜ š‘”) š‘‡ š•©š‘– āˆ’ š‘¦š‘–}š•©š‘–
š‘›
š‘–=1
Normal Equations
ģž„ģ  : a single-shot algorithm! Easiest to implement.
ė‹Øģ  : need to compute pseudo-inverse š‘‹ š‘‡
š‘‹ āˆ’1
, expensive, numerical issues
(e.g., matrix is singular..), although there are ways to get around this ...
š•– = š‘‹ š‘‡ š‘‹ āˆ’1 š‘‹ š‘‡ š•ŖĖ†
Multivariate Linear Regression
š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’
ė‹Øģˆœ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ€, input ė³€ģˆ˜ź°€ 1.
ė‹¤ģ¤‘ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ€, input ė³€ģˆ˜ź°€ 2ź°œ ģ“ģƒ.
Googleģ˜ ģ£¼ģ‹ ź°€ź²©
Yahooģ˜ ģ£¼ģ‹ ź°€ź²©
Microsoftģ˜ ģ£¼ģ‹ ź°€ź²©
š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ
šŸ + šœ½ šŸ š’™ šŸ
šŸ’ + š
ģ˜ˆė„¼ ė“¤ģ–“, ģ•„ėž˜ģ™€ ź°™ģ€ ģ‹ģ„ ģ„ ķ˜•ģœ¼ė”œ ģƒź°ķ•˜ģ—¬ ķ’€ ģˆ˜ ģžˆėŠ”ź°€?
ė¬¼ė” , input ė³€ģˆ˜ź°€ polynomial(ė‹¤ķ•­ģ‹)ģ˜ ķ˜•ķƒœģ“ģ§€ė§Œ, coefficients šœƒš‘–ź°€
ģ„ ķ˜•(linear)ģ“ėƀė”œ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ˜ ķ•“ė²•ģœ¼ė”œ ķ’€ ģˆ˜ ģžˆė‹¤.
ššÆ = š‘‹ š‘‡ š‘‹ āˆ’1 š‘‹ š‘‡ š•ŖĖ†
šœƒ0, šœƒ1, ā€¦ , šœƒ š‘›
š‘‡
General Linear Regression
š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’ģ¤‘ ķšŒź·€ ė¶„ģ„
ģ¼ė°˜ ķšŒź·€ ė¶„ģ„ š’š = šœ½ šŸŽ + šœ½ šŸ š’ˆ šŸ(š’™ šŸ) + šœ½ šŸ š’ˆ šŸ(š’™ šŸ) + ā‹Æ + šœ½ š’ š’ˆ š’(š’™ š’)
š‘”š‘—ėŠ” š‘„ š‘—
ė˜ėŠ”
(š‘„āˆ’šœ‡ š‘—)
2šœŽ š‘—
ė˜ėŠ”
1
1+exp(āˆ’š‘  š‘— š‘„)
ė“±ģ˜ ķ•Øģˆ˜ź°€ ė  ģˆ˜ ģžˆė‹¤.
ģ“ź²ƒė„ ė§ˆģ°¬ź°€ģ§€ė”œ ģ„ ķ˜• ķšŒź·€ ķ’€ģ“ ė°©ė²•ģœ¼ė”œ ė¬øģ œė„¼ ķ’€ ģˆ˜ ģžˆė‹¤.
š‘¤ š‘‡
= (š‘¤0, š‘¤1, ā€¦ , š‘¤ š‘›)
šœ™ š‘„ š‘– š‘‡
= šœ™0 š‘„ š‘–
, šœ™1 š‘„ š‘–
, ā€¦ , šœ™ š‘› š‘„ š‘–
š‘¤ š‘‡
= (š‘¤0, š‘¤1, ā€¦ , š‘¤ š‘›)
šœ™ š‘„ š‘– š‘‡
= šœ™0 š‘„ š‘–
, šœ™1 š‘„ š‘–
, ā€¦ , šœ™ š‘› š‘„ š‘–
normal equation
[ ģžė£Œģ˜ ė¶„ģ„ ]
ā‘  ėŖ©ģ  : ģ§‘ģ„ ķŒ”źø° ģ›ķ•Ø. ģ•Œė§žģ€ ź°€ź²©ģ„ ģ°¾źø° ģ›ķ•Ø.
ā‘” ź³ ė ¤ķ•  ė³€ģˆ˜(feature) : ģ§‘ģ˜ ķ¬źø°(in square feet), ģ¹Øģ‹¤ģ˜ ź°œģˆ˜, ģ§‘ ź°€ź²©
(ģ¶œģ²˜ : http://aimotion.blogspot.kr/2011/10/machine-learning-with-python-linear.html)
ā‘¢ ģ£¼ģ˜ģ‚¬ķ•­ : ģ§‘ģ˜ ķ¬źø°ģ™€ ģ¹Øģ‹¤ģ˜ ź°œģˆ˜ģ˜ ģ°Øģ“ź°€ ķ¬ė‹¤. ģ˜ˆė„¼ ė“¤ģ–“, ģ§‘ģ˜ ķ¬źø°ź°€ 4000 square feetģøė°,
ģ¹Øģ‹¤ģ˜ ź°œģˆ˜ėŠ” 3ź°œģ“ė‹¤. ģ¦‰, ė°ģ“ķ„° ģƒ featureė“¤ ź°„ ź·œėŖØģ˜ ģ°Øģ“ź°€ ķ¬ė‹¤. ģ“ėŸ“ ź²½ģš°,
featureģ˜ ź°’ģ„ ģ •ź·œķ™”(normalizing)ė„¼ ķ•“ģ¤€ė‹¤. ź·øėž˜ģ•¼, Gradient Descentė„¼ ģˆ˜ķ–‰ķ•  ė•Œ,
ź²°ź³¼ź°’ģœ¼ė”œ ė¹ ė„“ź²Œ ģˆ˜ė “ķ•˜ė‹¤.
ā‘£ ģ •ź·œķ™”ģ˜ ė°©ė²•
- featureģ˜ mean(ķ‰ź· )ģ„ źµ¬ķ•œ ķ›„, featureė‚“ģ˜ ėŖØė“  dataģ˜ ź°’ģ—ģ„œ meanģ„ ė¹¼ģ¤€ė‹¤.
- dataģ—ģ„œ meanģ„ ė¹¼ ģ¤€ ź°’ģ„, ź·ø dataź°€ ģ†ķ•˜ėŠ” standard deviation(ķ‘œģ¤€ ķŽøģ°Ø)ė”œ ė‚˜ėˆ„ģ–“ ģ¤€ė‹¤. (scaling)
ģ“ķ•“ź°€ ģ•ˆ ė˜ė©“, ģš°ė¦¬ź°€ ź³ ė“±ķ•™źµ ė•Œ ė°°ģ› ė˜ ģ •ź·œė¶„ķ¬ė„¼ ķ‘œģ¤€ģ •ź·œė¶„ķ¬ė”œ ė°”ź¾øģ–“ģ£¼ėŠ” ź²ƒģ„ ė– ģ˜¬ė ¤ė³“ģž.
ķ‘œģ¤€ģ •ź·œė¶„ķ¬ė„¼ ģ‚¬ģš©ķ•˜ėŠ” ģ“ģœ  ģ¤‘ ķ•˜ė‚˜ėŠ”, ģ„œė”œ ė‹¤ė„ø ė‘ ė¶„ķ¬, ģ¦‰ ė¹„źµź°€ ė¶ˆź°€ėŠ„ķ•˜ź±°ė‚˜ ģ–“ė ¤ģš“ ė‘ ė¶„ķ¬ė„¼ ģ‰½ź²Œ
ė¹„źµķ•  ģˆ˜ ģžˆź²Œ ķ•“ģ£¼ėŠ” ź²ƒģ“ģ—ˆė‹¤.
š‘ =
š‘‹ āˆ’ šœ‡
šœŽ
If š‘‹~(šœ‡, šœŽ) then š‘~š‘(1,0)
1. http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture5-LiR.pdf
2. http://www.cs.cmu.edu/~10701/lecture/RegNew.pdf
3. ķšŒź·€ė¶„ģ„ ģ œ 3ķŒ (ė°•ģ„±ķ˜„ ģ €)
4. ķŒØķ„“ģøģ‹ (ģ˜¤ģ¼ģ„ ģ§€ģŒ)
5. ģˆ˜ė¦¬ķ†µź³„ķ•™ ģ œ 3ķŒ (ģ „ėŖ…ģ‹ ģ§€ģŒ)
Laplacian Smoothing
multinomial random variable š‘§ : š‘§ėŠ” 1ė¶€ķ„° š‘˜ź¹Œģ§€ģ˜ ź°’ģ„ ź°€ģ§ˆ ģˆ˜ ģžˆė‹¤.
ģš°ė¦¬ėŠ” test setģœ¼ė”œ š‘šź°œģ˜ ė…ė¦½ģø ź“€ģ°° ź°’ š‘§ 1
, ā€¦ , š‘§ š‘š
ģ„ ź°€ģ§€ź³  ģžˆė‹¤.
ģš°ė¦¬ėŠ” ź“€ģ°° ź°’ģ„ ķ†µķ•“, š’‘(š’› = š’Š) ė„¼ ģ¶”ģ •ķ•˜ź³  ģ‹¶ė‹¤. (š‘– = 1, ā€¦ , š‘˜)
ģ¶”ģ • ź°’(MLE)ģ€,
š‘ š‘§ = š‘— =
š¼{š‘§ š‘–
= š‘—}š‘š
š‘–=1
š‘š
ģ“ė‹¤. ģ—¬źø°ģ„œ š¼ . ėŠ” ģ§€ģ‹œ ķ•Øģˆ˜ ģ“ė‹¤. ź“€ģ°° ź°’ ė‚“ģ—ģ„œģ˜ ė¹ˆė„ģˆ˜ė„¼ ģ‚¬ģš©ķ•˜ģ—¬ ģ¶”ģ •ķ•œė‹¤.
ķ•œ ź°€ģ§€ ģ£¼ģ˜ ķ•  ź²ƒģ€, ģš°ė¦¬ź°€ ģ¶”ģ •ķ•˜ė ¤ėŠ” ź°’ģ€ ėŖØģ§‘ė‹Ø(population)ģ—ģ„œģ˜ ėŖØģˆ˜
š‘(š‘§ = š‘–)ė¼ėŠ” ź²ƒģ“ė‹¤. ģ¶”ģ •ķ•˜źø° ģœ„ķ•˜ģ—¬ test set(or ķ‘œė³ø ģ§‘ė‹Ø)ģ„ ģ‚¬ģš©ķ•˜ėŠ” ź²ƒ ėæģ“ė‹¤.
ģ˜ˆė„¼ ė“¤ģ–“, š‘§(š‘–)
ā‰  3 for all š‘– = 1, ā€¦ , š‘š ģ“ė¼ė©“, š‘ š‘§ = 3 = 0 ģ“ ė˜ėŠ” ź²ƒģ“ė‹¤.
ģ“ź²ƒģ€, ķ†µź³„ģ ģœ¼ė”œ ė³¼ ė•Œ, ģ¢‹ģ§€ ģ•Šģ€ ģƒź°ģ“ė‹¤. ė‹Øģ§€, ķ‘œė³ø ģ§‘ė‹Øģ—ģ„œ ė³“ģ“ģ§€
ģ•ŠėŠ” ė‹¤ėŠ” ģ“ģœ ė”œ ģš°ė¦¬ź°€ ģ¶”ģ •ķ•˜ź³ ģž ķ•˜ėŠ” ėŖØģ§‘ė‹Øģ˜ ėŖØģˆ˜ ź°’ģ„ 0ģœ¼ė”œ ķ•œė‹¤ėŠ” ź²ƒģ€
ķ†µź³„ģ ģœ¼ė”œ ģ¢‹ģ§€ ģ•Šģ€ ģƒź°(bad idea)ģ“ė‹¤. (MLEģ˜ ģ•½ģ )
ģ“ź²ƒģ„ ź·¹ė³µķ•˜źø° ģœ„ķ•“ģ„œėŠ”,
ā‘  ė¶„ģžź°€ 0ģ“ ė˜ģ–“ģ„œėŠ” ģ•ˆ ėœė‹¤.
ā‘” ģ¶”ģ • ź°’ģ˜ ķ•©ģ“ 1ģ“ ė˜ģ–“ģ•¼ ķ•œė‹¤. š‘ š‘§ = š‘—š‘§ =1 (āˆµ ķ™•ė„ ģ˜ ķ•©ģ€ 1ģ“ ė˜ģ–“ģ•¼ ķ•Ø)
ė”°ė¼ģ„œ,
š’‘ š’› = š’‹ =
š‘° š’› š’Š
= š’‹ + šŸš’Ž
š’Š=šŸ
š’Ž + š’Œ
ģ“ė¼ź³  ķ•˜ģž.
ā‘ ģ˜ ģ„±ė¦½ : test set ė‚“ģ— š‘—ģ˜ ź°’ģ“ ģ—†ģ–“ė„, ķ•“ė‹¹ ģ¶”ģ • ź°’ģ€ 0ģ“ ė˜ģ§€ ģ•ŠėŠ”ė‹¤.
ā‘”ģ˜ ģ„±ė¦½ : š‘§(š‘–)
= š‘—ģø dataģ˜ ģˆ˜ė„¼ š‘›š‘—ė¼ź³  ķ•˜ģž. š‘ š‘§ = 1 =
š‘›1+1
š‘š+š‘˜
, ā€¦ , š‘ š‘§ = š‘˜ =
š‘› š‘˜+1
š‘š+š‘˜
ģ“ė‹¤. ź° ģ¶”ģ • ź°’ģ„ ė‹¤ ė”ķ•˜ź²Œ ė˜ė©“ 1ģ“ ė‚˜ģ˜Øė‹¤.
ģ“ź²ƒģ“ ė°”ė”œ Laplacian smoothingģ“ė‹¤.
š‘§ź°€ ė  ģˆ˜ ģžˆėŠ” ź°’ģ“ 1ė¶€ķ„° š‘˜ź¹Œģ§€ ź· ė“±ķ•˜ź²Œ ė‚˜ģ˜¬ ģˆ˜ ģžˆė‹¤ėŠ” ź°€ģ •ģ“ ģ¶”ź°€ė˜ģ—ˆė‹¤ź³ 
ģ§ź“€ģ ģœ¼ė”œ ģ•Œ ģˆ˜ ģžˆė‹¤.
1

More Related Content

What's hot (20)

06. graph mining
06. graph mining06. graph mining
06. graph mining
Jeonghun Yoon
Ģż
SVM
SVMSVM
SVM
Jeonghun Yoon
Ģż
Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
Jeonghun Yoon
Ģż
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Jeonghun Yoon
Ģż
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
Jeonghun Yoon
Ģż
0124 2 linear_algebra_basic_matrix
0124 2 linear_algebra_basic_matrix0124 2 linear_algebra_basic_matrix
0124 2 linear_algebra_basic_matrix
Jeonghun Yoon
Ģż
0307 1 estimation_theory
0307 1 estimation_theory0307 1 estimation_theory
0307 1 estimation_theory
Jeonghun Yoon
Ģż
Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)
Jeonghun Yoon
Ģż
0124 1 linear_algebra_basic_vector
0124 1 linear_algebra_basic_vector0124 1 linear_algebra_basic_vector
0124 1 linear_algebra_basic_vector
Jeonghun Yoon
Ģż
Topic models
Topic modelsTopic models
Topic models
Jeonghun Yoon
Ģż
0314 1 anova
0314 1 anova0314 1 anova
0314 1 anova
Jeonghun Yoon
Ģż
0131 1 spectral_theorem_transformation
0131 1 spectral_theorem_transformation0131 1 spectral_theorem_transformation
0131 1 spectral_theorem_transformation
Jeonghun Yoon
Ģż
01. introduction
01. introduction01. introduction
01. introduction
Jeonghun Yoon
Ģż
0221 basic probability theory
0221 basic probability theory0221 basic probability theory
0221 basic probability theory
Jeonghun Yoon
Ģż
0314 2 correlation
0314 2 correlation0314 2 correlation
0314 2 correlation
Jeonghun Yoon
Ģż
0307 2 hypothesis_testing
0307 2 hypothesis_testing0307 2 hypothesis_testing
0307 2 hypothesis_testing
Jeonghun Yoon
Ģż
Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)
ź°•ėƼźµ­ ź°•ėƼźµ­
Ģż
Linear algebra
Linear algebraLinear algebra
Linear algebra
Sungbin Lim
Ģż
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term Review
Mario Cho
Ģż
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° IWasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Sungbin Lim
Ģż
06. graph mining
06. graph mining06. graph mining
06. graph mining
Jeonghun Yoon
Ģż
Neural network (perceptron)
Neural network (perceptron)Neural network (perceptron)
Neural network (perceptron)
Jeonghun Yoon
Ģż
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
Jeonghun Yoon
Ģż
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
05. k means clustering ( k-means ķ“ėŸ¬ģŠ¤ķ„°ė§)
Jeonghun Yoon
Ģż
0124 2 linear_algebra_basic_matrix
0124 2 linear_algebra_basic_matrix0124 2 linear_algebra_basic_matrix
0124 2 linear_algebra_basic_matrix
Jeonghun Yoon
Ģż
0307 1 estimation_theory
0307 1 estimation_theory0307 1 estimation_theory
0307 1 estimation_theory
Jeonghun Yoon
Ģż
Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)
Jeonghun Yoon
Ģż
0124 1 linear_algebra_basic_vector
0124 1 linear_algebra_basic_vector0124 1 linear_algebra_basic_vector
0124 1 linear_algebra_basic_vector
Jeonghun Yoon
Ģż
0131 1 spectral_theorem_transformation
0131 1 spectral_theorem_transformation0131 1 spectral_theorem_transformation
0131 1 spectral_theorem_transformation
Jeonghun Yoon
Ģż
01. introduction
01. introduction01. introduction
01. introduction
Jeonghun Yoon
Ģż
0221 basic probability theory
0221 basic probability theory0221 basic probability theory
0221 basic probability theory
Jeonghun Yoon
Ģż
0314 2 correlation
0314 2 correlation0314 2 correlation
0314 2 correlation
Jeonghun Yoon
Ģż
0307 2 hypothesis_testing
0307 2 hypothesis_testing0307 2 hypothesis_testing
0307 2 hypothesis_testing
Jeonghun Yoon
Ģż
Linear algebra
Linear algebraLinear algebra
Linear algebra
Sungbin Lim
Ģż
ESM Mid term Review
ESM Mid term ReviewESM Mid term Review
ESM Mid term Review
Mario Cho
Ģż
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° IWasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Wasserstein GAN ģˆ˜ķ•™ ģ“ķ•“ķ•˜źø° I
Sungbin Lim
Ģż

Similar to 03. linear regression (20)

Linear regression
Linear regressionLinear regression
Linear regression
ģ „ ķ¬ģ²œ
Ģż
[Probability for machine learning]
[Probability for machine learning][Probability for machine learning]
[Probability for machine learning]
ź°•ėƼźµ­ ź°•ėƼźµ­
Ģż
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
Woong won Lee
Ģż
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
Haesun Park
Ģż
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML
ķ¬ģˆ˜ ė°•
Ģż
Lecture 2: Supervised Learning
Lecture 2: Supervised LearningLecture 2: Supervised Learning
Lecture 2: Supervised Learning
Sang Jun Lee
Ģż
Deep Learning from scratch 4ģž„ : neural network learning
Deep Learning from scratch 4ģž„ : neural network learningDeep Learning from scratch 4ģž„ : neural network learning
Deep Learning from scratch 4ģž„ : neural network learning
JinSooKim80
Ģż
Probability with MLE, MAP
Probability with MLE, MAPProbability with MLE, MAP
Probability with MLE, MAP
Junho Lee
Ģż
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
Ģż
Deep Learning from scratch 5ģž„ : backpropagation
 Deep Learning from scratch 5ģž„ : backpropagation Deep Learning from scratch 5ģž„ : backpropagation
Deep Learning from scratch 5ģž„ : backpropagation
JinSooKim80
Ģż
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
AHRA CHO
Ģż
Gaussian Mixture Model
Gaussian Mixture ModelGaussian Mixture Model
Gaussian Mixture Model
KyeongUkJang
Ģż
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
JungHyun Hong
Ģż
Vae
VaeVae
Vae
Lee Gyeong Hoon
Ģż
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
Dongmin Lee
Ģż
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introductionź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
Taehoon Kim
Ģż
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
Ģż
Auto-Encoders and Variational Auto-Encoders
Auto-Encoders and Variational Auto-EncodersAuto-Encoders and Variational Auto-Encoders
Auto-Encoders and Variational Auto-Encoders
Jinho Lee
Ģż
Random walk, brownian motion, black scholes equation
Random walk, brownian motion, black scholes equationRandom walk, brownian motion, black scholes equation
Random walk, brownian motion, black scholes equation
ģ°½ķ˜ø ģ†
Ģż
Linear algebra.pptx
Linear algebra.pptxLinear algebra.pptx
Linear algebra.pptx
GeonWooYoo1
Ģż
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
RLCodeģ™€ A3C ģ‰½ź³  ź¹Šź²Œ ģ“ķ•“ķ•˜źø°
Woong won Lee
Ģż
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
į„’į…¢į„į…„į„‹į…¦į„€į…¦ į„Œį…„į†«į„’į…¢į„ƒį…³į†Æį„‹į…³į†« į„†į…„į„‰į…µį†«į„…į…„į„‚į…µį†¼ #3
Haesun Park
Ģż
Multinomial classification and application of ML
Multinomial classification and application of MLMultinomial classification and application of ML
Multinomial classification and application of ML
ķ¬ģˆ˜ ė°•
Ģż
Lecture 2: Supervised Learning
Lecture 2: Supervised LearningLecture 2: Supervised Learning
Lecture 2: Supervised Learning
Sang Jun Lee
Ģż
Deep Learning from scratch 4ģž„ : neural network learning
Deep Learning from scratch 4ģž„ : neural network learningDeep Learning from scratch 4ģž„ : neural network learning
Deep Learning from scratch 4ģž„ : neural network learning
JinSooKim80
Ģż
Probability with MLE, MAP
Probability with MLE, MAPProbability with MLE, MAP
Probability with MLE, MAP
Junho Lee
Ģż
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
Ģż
Deep Learning from scratch 5ģž„ : backpropagation
 Deep Learning from scratch 5ģž„ : backpropagation Deep Learning from scratch 5ģž„ : backpropagation
Deep Learning from scratch 5ģž„ : backpropagation
JinSooKim80
Ģż
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
ģ„ ķ˜•ėŒ€ģˆ˜ 08. ģ„ ķ˜• ė³€ķ™˜ (Linear Transformation)
AHRA CHO
Ģż
Gaussian Mixture Model
Gaussian Mixture ModelGaussian Mixture Model
Gaussian Mixture Model
KyeongUkJang
Ģż
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
[GomGuard] ė‰“ėŸ°ė¶€ķ„° YOLO ź¹Œģ§€ - ė”„ėŸ¬ė‹ ģ „ė°˜ģ— ėŒ€ķ•œ ģ“ģ•¼źø°
JungHyun Hong
Ģż
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
ź°•ķ™”ķ•™ģŠµ ģ•Œź³ ė¦¬ģ¦˜ģ˜ ķė¦„ė„ Part 2
Dongmin Lee
Ģż
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introductionź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
ź°•ķ™” ķ•™ģŠµ źø°ģ“ˆ Reinforcement Learning an introduction
Taehoon Kim
Ģż
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
Ģż
Auto-Encoders and Variational Auto-Encoders
Auto-Encoders and Variational Auto-EncodersAuto-Encoders and Variational Auto-Encoders
Auto-Encoders and Variational Auto-Encoders
Jinho Lee
Ģż
Random walk, brownian motion, black scholes equation
Random walk, brownian motion, black scholes equationRandom walk, brownian motion, black scholes equation
Random walk, brownian motion, black scholes equation
ģ°½ķ˜ø ģ†
Ģż
Linear algebra.pptx
Linear algebra.pptxLinear algebra.pptx
Linear algebra.pptx
GeonWooYoo1
Ģż

03. linear regression

  • 2. ģ§€ė‚œ ģ‹œź°„.....Naive Bayes Classifier arg max š‘¦ š‘ƒ š‘„1, ā€¦ , š‘„ š‘‘ š‘¦ š‘ƒ(š‘¦) = arg max š‘¦ š‘ƒ š‘„š‘– š‘¦ š‘ƒ(š‘¦) š‘‘ š‘–=1 class š‘¦ ģ˜ ė°œģƒ ķ™•ė„ ź³¼ test setģ—ģ„œ class š‘¦ģ˜ labelģ„ ź°€ģ§„ ė°ģ“ķ„°ģ˜ ķŠ¹ģ„± ė²”ķ„°ģ˜ ģ›ģ†Œ š‘„š‘– (ė¬øģ„œģ˜ ģ˜ˆģ—ģ„œėŠ” ė‹Øģ–“) ź°€ ė‚˜ģ˜¬ ķ™•ė„ ģ˜ ź³± ex) (I, love, you)ź°€ spamģøģ§€ ģ•„ė‹Œģ§€ ģ•Œźø° ģœ„ķ•“ģ„œėŠ”, test setģ—ģ„œ spamģ“ ģ°Øģ§€ķ•˜ėŠ” ė¹„ģœØź³¼ spamģœ¼ė”œ labeling ėœ ė¬øģ„œģ—ģ„œ Iģ™€ loveģ™€ youź°€ ė°œģƒķ•˜ėŠ” ķ™•ė„ ģ„ ėŖØė‘ ź³±ķ•œ ź²ƒź³¼, test setģ—ģ„œ hamģ“ ģ°Øģ§€ķ•˜ėŠ” ė¹„ģœØź³¼ hamģœ¼ė”œ labeling ėœ ė¬øģ„œģ—ģ„œ Iģ™€ loveģ™€ youź°€ ė°œģƒķ•˜ėŠ” ķ™•ė„ ģ„ ėŖØė‘ ź³±ķ•œ ź²ƒģ„, ė¹„źµķ•œė‹¤.
  • 3. ģ§€ė‚œ ģ‹œź°„ ėÆøė¹„ķ–ˆė˜ ģ  ė“¤... 1. Laplacian Smoothing (appendix ģ°øź³ ) 2. MLE / MAP 1
  • 4. Bayesā€™ Rule š‘ šœƒ š•© = š‘ š•© šœƒ š‘(šœƒ) š‘ š•© šœƒ š‘(šœƒ) posteriori (ģ‚¬ķ›„ ķ™•ė„ ) likelihood (ģš°ė„ ź°’) prior (ģ‚¬ģ „ ķ™•ė„ ) ģ‚¬ķ›„ ķ™•ė„  : ź“€ģ°° ź°’ė“¤ģ“ ź“€ģ°° ėœ ķ›„ģ— ėŖØģˆ˜(parameter)ģ˜ ė°œģƒ ķ™•ė„ ģ„ źµ¬ķ•œė‹¤. ģ‚¬ģ „ ķ™•ė„  : ź“€ģ°° ź°’ė“¤ģ“ ź“€ģ°° ė˜źø° ģ „ģ— ėŖØģˆ˜ģ˜ ė°œģƒ ķ™•ė„ ģ„ źµ¬ķ•œė‹¤. ģš°ė„ ź°’ : ėŖØģˆ˜ģ˜ ź°’ģ“ ģ£¼ģ–“ģ”Œģ„ ė•Œ ź“€ģ°° ź°’ė“¤ģ“ ė°œģƒķ•  ķ™•ė„ 
  • 5. Maximum Likelihood Estimate š•© = (š‘„1, ā€¦ , š‘„ š‘›) š“› šœ½ = š’‘ š•© šœ½ ģš°ė„(likelihood)ėŠ” ė‹¤ģŒź³¼ ź°™ģ“ ģ •ģ˜ ėœė‹¤. ė³€ģˆ˜(parameter) šœƒź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ, data set š•© = (š‘„1, ā€¦ , š‘„ š‘›) (ź“€ģ°° ėœ, observed) ė„¼ ģ–»ģ„ ģˆ˜ ģžˆėŠ”(obtaining) ķ™•ė„  š‘(š•©|šœƒ) š‘‹ šœƒģ˜ ķ•Øģˆ˜. šœƒģ˜ pdfėŠ” ģ•„ė‹˜. š•© = (š‘„1, ā€¦ , š‘„ š‘›)
  • 6. Maximum Likelihood EstimateėŠ” ė‹¤ģŒź³¼ ź°™ģ“ ģ •ģ˜ ėœė‹¤. ź“€ģ°° ėœ data set š•© = š‘„1, ā€¦ , š‘„ š‘› ģ„ ģ–»ģ„ ģˆ˜ ģžˆėŠ” ķ™•ė„ ģ“ ź°€ģž„ ķ° šœƒź°€ MLEģ“ė‹¤. š‘(š•©|šœƒ1) š‘‹ š•© = (š‘„1, ā€¦ , š‘„ š‘›) šœ½ = ššš«š  š¦ššš± šœ½ š“› šœ½ = ššš«š  š¦ššš± šœ½ š’‘(š•©|šœ½)Ģ‚ š‘(š•©|šœƒ2) š‘(š•©|šœƒ3) š‘(š•©|šœƒ) šœƒ = šœƒ2 Ģ‚
  • 7. ģš°ė¦¬ź°€ likelihood function š‘(š•©|šœƒ)ģ™€ prior š‘(šœƒ)ė„¼ ģ•Œ ė•Œ, Bayes ruleģ— ģ˜ķ•˜ģ—¬ posteriori functionģ˜ ź°’ģ„ źµ¬ķ•  ģˆ˜ ģžˆė‹¤. š’‘ šœ½ š•© āˆ š’‘ š•© šœ½ š’‘(šœ½) Maximum A Posteriori Estimate š‘ šœƒ š•© = š‘ š•© šœƒ š‘(šœƒ) š‘ š•© šœƒ š‘(šœƒ) posteriori (ģ‚¬ķ›„ ķ™•ė„ ) likelihood (ģš°ė„ ź°’) prior (ģ‚¬ģ „ ķ™•ė„ )
  • 8. Likelihood š‘(š•©|šœƒ) Prior š‘(šœƒ) Posterior š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
  • 9. Likelihood š‘(š•©|šœƒ) Prior š‘(šœƒ) Posterior š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
  • 10. šœ½ = ššš«š  š¦ššš± šœ½ š’‘(šœ½|š•©) Likelihood š‘(š•©|šœƒ) Prior š‘(šœƒ) Posterior š‘ šœƒ š•© āˆ š‘ š•© šœƒ š‘(šœƒ)
  • 12. ė‚˜ėŠ” ķ° ģ‹ ė°œķšŒģ‚¬ģ˜ CEOģ“ė‹¤. ė§Žģ€ ģ§€ģ ė“¤ģ„ ź°€ģ§€ź³  ģžˆė‹¤. ź·øė¦¬ź³  ģ“ė²ˆģ— ģƒˆė”œģš“ ģ§€ģ ģ„ ė‚“ź³  ģ‹¶ė‹¤. ģ–“ėŠ ģ§€ģ—­ģ— ė‚“ģ•¼ ė ź¹Œ? ė‚“ź°€ ģƒˆė”œģš“ ģ§€ģ ģ„ ė‚“ź³  ģ‹¶ģ–“ķ•˜ėŠ” ģ§€ģ—­ė“¤ģ˜ ģ˜ˆģƒ ģˆ˜ģµė§Œ ķŒŒģ•…ķ•  ģˆ˜ ģžˆģœ¼ė©“ ķ° ė„ģ›€ģ“ ė  ź²ƒģøė°! ė‚“ź°€ ź°€ģ§€ź³  ģžˆėŠ” ģžė£Œ(data)ėŠ” ź° ģ§€ģ ģ˜ ģˆ˜ģµ(profits)ź³¼ ź° ģ§€ģ ģ“ ģžˆėŠ” ģ§€ģ—­ģ˜ ģøźµ¬ģˆ˜(populations)ģ“ė‹¤. ķ•“ź²°ģ±…! Linear Regression! ģ“ź²ƒģ„ ķ†µķ•˜ģ—¬, ģƒˆė”œģš“ ģ§€ģ—­ģ˜ ģøźµ¬ģˆ˜ė„¼ ģ•Œź²Œ ė  ź²½ģš°, ź·ø ģ§€ģ—­ģ˜ ģ˜ˆģƒ ģˆ˜ģµģ„ źµ¬ ķ•  ģˆ˜ ģžˆė‹¤. Example 1)
  • 13. Example 2) ė‚˜ėŠ” ģ§€źøˆ Pittsburghė”œ ģ“ģ‚¬ė„¼ ģ™”ė‹¤ ė‚˜ėŠ” ź°€ģž„ ķ•©ė¦¬ģ ģø ź°€ź²©ģ˜ ģ•„ķŒŒķŠøė„¼ ģ–»źø° ģ›ķ•œė‹¤. ź·øė¦¬ź³  ė‹¤ģŒģ˜ ģ”°ź±“ė“¤ģ€ ė‚“ź°€ ģ§‘ģ„ ģ‚¬źø° ģœ„ķ•“ ź³ ė ¤ķ•˜ėŠ” ź²ƒė“¤ģ“ė‹¤. square-ft(ķ‰ė°©ėÆøķ„°), ģ¹Øģ‹¤ģ˜ ģˆ˜, ķ•™źµ ź¹Œģ§€ģ˜ ź±°ė¦¬... ė‚“ź°€ ģ›ķ•˜ėŠ” ķ¬źø°ģ™€ ģ¹Øģ‹¤ģ˜ ģˆ˜ė„¼ ź°€ģ§€ź³  ģžˆėŠ” ģ§‘ģ˜ ź°€ź²©ģ€ ź³¼ģ—° ģ–¼ė§ˆģ¼ź¹Œ?
  • 14. ā‘  Given an input š‘„ we would like to compute an output š‘¦. (ė‚“ź°€ ģ›ķ•˜ėŠ” ģ§‘ģ˜ ķ¬źø°ģ™€, ė°©ģ˜ ź°œģˆ˜ė„¼ ģž…ė „ķ–ˆģ„ ė•Œ, ģ§‘ ź°€ź²©ģ˜ ģ˜ˆģø” ź°’ģ„ ź³„ģ‚°) ā‘” For example 1) Predict height from age (height = š‘¦, age = š‘„) 2) Predict Google`s price from Yahoo`s price (Google's price = š‘¦, Yahoo's price = š‘„) š‘¦ = šœƒ0 + šœƒ1 š‘„ ģ¦‰, źø°ģ”“ģ˜ dataė“¤ģ—ģ„œ ģ§ģ„ (š‘¦ = šœƒ0 + šœƒ1 š‘„)ģ„ ģ°¾ģ•„ė‚“ė©“, ģƒˆė”œģš“ ź°’ š‘„ š‘›š‘’š‘¤ź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ, ķ•“ė‹¹ķ•˜ėŠ” š‘¦ģ˜ ź°’ģ„ ģ˜ˆģø”ķ•  ģˆ˜ ģžˆź² źµ¬ė‚˜! learning, training prediction
  • 15. Input : ģ§‘ģ˜ ķ¬źø°(š‘„1), ė°©ģ˜ ź°œģˆ˜(š‘„2), ķ•™źµź¹Œģ§€ģ˜ ź±°ė¦¬(š‘„3),..... (š‘„1, š‘„2, ā€¦ , š‘„ š‘›) : ķŠ¹ģ„± ė²”ķ„° feature vector Output : ģ§‘ ź°’(š‘¦) š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’ training setģ„ ķ†µķ•˜ģ—¬ ķ•™ģŠµ(learning)
  • 17. š‘¦š‘– = šœƒ0 + šœƒ1 š‘„š‘– + šœ€š‘– š‘–ė²ˆģ§ø ź“€ģ°°ģ  š‘¦š‘–, š‘„š‘– ź°€ ģ£¼ģ–“ģ”Œģ„ ė•Œ ė‹Øģˆœ ķšŒź·€ ėŖØķ˜•ģ€ ė‹¤ģŒź³¼ ź°™ė‹¤. šœ–3 šœ–š‘– : š‘–ė²ˆģ§ø ź“€ģ°°ģ ģ—ģ„œ ģš°ė¦¬ź°€ źµ¬ķ•˜ź³ ģž ķ•˜ėŠ” ķšŒź·€ģ§ģ„ ź³¼ ģ‹¤ģ œ ź“€ģ°°ėœ š‘¦š‘–ģ˜ ģ°Øģ“ (error) ģš°ė¦¬ėŠ” ģ˜¤ė„˜ģ˜ ķ•©ģ„ ź°€ģž„ ģž‘ź²Œ ė§Œė“œėŠ” ģ§ģ„ ģ„ ģ°¾ź³  ģ‹¶ė‹¤. ģ¦‰ ź·øė ‡ź²Œ ė§Œė“œėŠ” šœ½ šŸŽģ™€ šœ½ šŸģ„ ģ¶”ģ •ķ•˜ź³  ģ‹¶ė‹¤ ! How!! ģµœģ†Œ ģ œź³± ė²•! (Least Squares Method) min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– 2 š‘– = š‘šš‘–š‘› šœ–š‘– 2 š‘– š‘¦ = šœƒ0 + šœƒ1 š‘„ ģ‹¤ģ œ ź“€ģø” ź°’ ķšŒź·€ ģ§ģ„ ģ˜ ź°’(ģ“ģƒģ ģø ź°’) ģ¢…ģ† ė³€ģˆ˜ ģ„¤ėŖ… ė³€ģˆ˜, ė…ė¦½ ė³€ģˆ˜
  • 18. min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– 2 š‘– = min šœ–š‘– 2 š‘– ģ‹¤ģ œ ź“€ģø” ź°’ ķšŒź·€ ģ§ģ„ ģ˜ ź°’(ģ“ģƒģ ģø ź°’) ģœ„ģ˜ ģ‹ģ„ ģµœėŒ€ķ•œ ė§Œģ”± ģ‹œķ‚¤ėŠ” šœƒ0, šœƒ1ģ„ ģ¶”ģ •ķ•˜ėŠ” ė°©ė²•ģ€ ė¬“ģ—‡ģ¼ź¹Œ? (ģ“ėŸ¬ķ•œ šœƒ1, šœƒ2ė„¼ šœƒ1, šœƒ2 ė¼ź³  ķ•˜ģž.) - Normal Equation - Steepest Gradient Descent Ė† Ė†
  • 19. What is normal equation? ź·¹ėŒ€ ź°’, ź·¹ģ†Œ ź°’ģ„ źµ¬ķ•  ė•Œ, ģ£¼ģ–“ģ§„ ģ‹ģ„ ėÆøė¶„ķ•œ ķ›„ģ—, ėÆøė¶„ķ•œ ģ‹ģ„ 0ģœ¼ė”œ ė§Œė“œėŠ” ź°’ģ„ ģ°¾ėŠ”ė‹¤. min š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– 2 š‘– ėؼģ €, šœƒ0ģ— ėŒ€ķ•˜ģ—¬ ėÆøė¶„ķ•˜ģž. āˆ’ š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– = 0 š‘– šœ• šœ•šœƒ0 š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– 2 š‘– = ė‹¤ģŒģœ¼ė”œ, šœƒ1ģ— ėŒ€ķ•˜ģ—¬ ėÆøė¶„ķ•˜ģž. āˆ’ š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– š‘„š‘– = 0 š‘– šœ• šœ•šœƒ1 š‘¦š‘– āˆ’ šœƒ0 + šœƒ1 š‘„š‘– 2 š‘– = ģœ„ ģ˜ ė‘ ģ‹ģ„ 0ģœ¼ė”œ ė§Œģ”±ģ‹œķ‚¤ėŠ” šœƒ0, šœƒ1ė„¼ ģ°¾ģœ¼ė©“ ėœė‹¤. ģ“ģ²˜ėŸ¼ 2ź°œģ˜ ėÆøģ§€ģˆ˜ģ— ėŒ€ķ•˜ģ—¬, 2ź°œģ˜ ė°©ģ •ģ‹(system)ģ“ ģžˆģ„ ė•Œ, ģš°ė¦¬ėŠ” ģ“ systemģ„ normal equation(ģ •ź·œė°©ģ •ģ‹)ģ“ė¼ ė¶€ė„øė‹¤.
  • 20. The normal equation form š•©š‘– = 1, š‘„š‘– š‘‡ , Ī˜ = šœƒ0, šœƒ1 š‘‡ , š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦ š‘› š‘‡ , š‘‹ = 1 1 ā€¦ š‘„1 š‘„2 ā€¦ 1 š‘„ š‘› , š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž. š•Ŗ = š‘‹Ī˜ + š•– š‘¦1 = šœƒ0 + šœƒ1 š‘„1 + šœ–1 š‘¦2 = šœƒ0 + šœƒ1 š‘„2 + šœ–2 ....... š‘¦ š‘›āˆ’1 = šœƒ0 + šœƒ1 š‘„ š‘›āˆ’1 + šœ– š‘›āˆ’1 š‘¦ š‘› = šœƒ0 + šœƒ1 š‘„ š‘› + šœ– š‘› š‘›ź°œģ˜ ź“€ģø” ź°’ (š‘„š‘–, š‘¦š‘–)ģ€ ģ•„ėž˜ģ™€ ź°™ģ€ ķšŒź·€ ėŖØķ˜•ģ„ ź°€ģ§„ė‹¤ź³  ź°€ģ •ķ•˜ģž. š‘¦1 š‘¦2 š‘¦3 ā€¦ š‘¦ š‘› = 1 1 1 ā€¦ š‘„1 š‘„2 š‘„3 ā€¦ 1 š‘„ š‘› šœƒ0 šœƒ1 + šœ–1 šœ–2 šœ–3 ā€¦ šœ– š‘›
  • 21. šœ–š‘— 2 š‘› š‘—=1 = š•– š‘‡ š•– = š•Ŗ āˆ’ š‘‹Ī˜ š‘‡ (š•Ŗ āˆ’ š‘‹Ī˜) = š•Ŗ š‘‡ š•Ŗ āˆ’ Ī˜ š‘‡ š‘‹ š‘‡ š•Ŗ āˆ’ š•Ŗ š‘‡ š‘‹Ī˜ + Ī˜ š‘‡ š‘‹ š‘‡ š‘‹Ī˜ = š•Ŗ š‘‡ š•Ŗ āˆ’ 2Ī˜ š‘‡ š‘‹ š‘‡ š•Ŗ + Ī˜ š‘‡ š‘‹ š‘‡ š‘‹Ī˜ 1 by 1 ķ–‰ė ¬ģ“ėƀė”œ ģ „ģ¹˜ķ–‰ė ¬ģ˜ ź°’ģ“ ź°™ė‹¤! šœ•(š•– š‘‡ š•–) šœ•Ī˜ = šŸŽ šœ•(š•– š‘‡ š•–) šœ•Ī˜ = āˆ’2š‘‹ š‘‡ š•Ŗ + 2š‘‹ š‘‡ š‘‹Ī˜ = šŸŽ š‘‹ š‘‡ š‘‹ššÆ = š‘‹ š‘‡ š•Ŗ ššÆ = š‘‹ š‘‡ š‘‹ āˆ’1 š‘‹ š‘‡ š•ŖĖ† ģ •ź·œė°©ģ •ģ‹ š•Ŗ = š‘‹Ī˜ + š•– š•– = š•Ŗ āˆ’ š‘‹Ī˜ Minimize šœ–š‘— 2 š‘› š‘—=1
  • 22. What is Gradient Descent? machine learningģ—ģ„œėŠ” ė§¤ź°œ ė³€ģˆ˜(parameter, ģ„ ķ˜•ķšŒź·€ģ—ģ„œėŠ” šœƒ0, šœƒ1)ź°€ ģˆ˜ģ‹­~ ģˆ˜ė°± ģ°Øģ›ģ˜ ė²”ķ„°ģø ź²½ģš°ź°€ ėŒ€ė¶€ė¶„ģ“ė‹¤. ė˜ķ•œ ėŖ©ģ  ķ•Øģˆ˜(ģ„ ķ˜•ķšŒź·€ģ—ģ„œėŠ” Ī£šœ–š‘– 2 )ź°€ ėŖØė“  źµ¬ź°„ģ—ģ„œ ėÆøė¶„ ź°€ėŠ„ķ•˜ė‹¤ėŠ” ė³“ģž„ģ“ ķ•­ģƒ ģžˆėŠ” ź²ƒė„ ģ•„ė‹ˆė‹¤. ė”°ė¼ģ„œ ķ•œ ė²ˆģ˜ ģˆ˜ģ‹ ģ „ź°œė”œ ķ•“ė„¼ źµ¬ķ•  ģˆ˜ ģ—†ėŠ” ģƒķ™©ģ“ ģ ģ§€ ģ•Šź²Œ ģžˆė‹¤. ģ“ėŸ° ź²½ģš°ģ—ėŠ” ģ“ˆźø° ķ•“ģ—ģ„œ ģ‹œģž‘ķ•˜ģ—¬ ķ•“ė„¼ ė°˜ė³µģ ģœ¼ė”œ ź°œģ„ ķ•“ ė‚˜ź°€ėŠ” ģˆ˜ģ¹˜ģ  ė°©ė²•ģ„ ģ‚¬ģš©ķ•œė‹¤. (ėÆøė¶„ģ“ ģ‚¬ģš© ėØ)
  • 23. What is Gradient Descent? ģ“ˆźø°ķ•“ š›¼0 ģ„¤ģ • š‘” = 0 š›¼ š‘”ź°€ ė§Œģ”±ģŠ¤ėŸ½ė‚˜? š›¼ š‘”+1 = š‘ˆ š›¼ š‘” š‘” = š‘” + 1 š›¼ = š›¼ š‘” Ė†No Yes
  • 24. What is Gradient Descent? Gradient Descent ķ˜„ģž¬ ģœ„ģ¹˜ģ—ģ„œ ź²½ģ‚¬ź°€ ź°€ģž„ źø‰ķ•˜ź²Œ ķ•˜ź°•ķ•˜ėŠ” ė°©ķ–„ģ„ ģ°¾ź³ , ź·ø ė°©ķ–„ģœ¼ė”œ ģ•½ź°„ ģ“ė™ķ•˜ģ—¬ ģƒˆė”œģš“ ģœ„ģ¹˜ė„¼ ģž”ėŠ”ė‹¤. ģ“ėŸ¬ķ•œ ź³¼ģ •ģ„ ė°˜ė³µķ•Øģœ¼ė”œģØ ź°€ģž„ ė‚®ģ€ ģ§€ģ (ģ¦‰ ģµœģ € ģ )ģ„ ģ°¾ģ•„ ź°„ė‹¤. Gradient Ascent ķ˜„ģž¬ ģœ„ģ¹˜ģ—ģ„œ ź²½ģ‚¬ź°€ ź°€ģž„ źø‰ķ•˜ź²Œ ģƒģŠ¹ķ•˜ėŠ” ė°©ķ–„ģ„ ģ°¾ź³ , ź·ø ė°©ķ–„ģœ¼ė”œ ģ•½ź°„ ģ“ė™ķ•˜ģ—¬ ģƒˆė”œģš“ ģœ„ģ¹˜ė„¼ ģž”ėŠ”ė‹¤. ģ“ėŸ¬ķ•œ ź³¼ģ •ģ„ ė°˜ė³µķ•Øģœ¼ė”œģØ ź°€ģž„ ė†’ģ€ ģ§€ģ (ģ¦‰ ģµœėŒ€ ģ )ģ„ ģ°¾ģ•„ ź°„ė‹¤.
  • 25. What is Gradient Descent? Gradient Descent š›¼ š‘”+1 = š›¼ š‘” āˆ’ šœŒ šœ•š½ šœ•š›¼ š›¼ š‘” š½ = ėŖ©ģ ķ•Øģˆ˜ šœ•š½ šœ•š›¼ š›¼ š‘” : š›¼ š‘”ģ—ģ„œģ˜ ė„ķ•Øģˆ˜ šœ•š½ šœ•š›¼ ģ˜ ź°’ š›¼ š‘” š›¼ š‘”+1 āˆ’ šš‘± ššœ¶ šœ¶ š’• šš‘± ššœ¶ šœ¶ š’• š›¼ š‘”ģ—ģ„œģ˜ ėÆøė¶„ź°’ģ€ ģŒģˆ˜ģ“ė‹¤. ź·øėž˜ģ„œ šœ•J šœ•Ī± Ī±t ė„¼ ė”ķ•˜ź²Œ ė˜ė©“ ģ™¼ģŖ½ģœ¼ė”œ ģ“ė™ķ•˜ź²Œ ėœė‹¤. ź·øėŸ¬ė©“ ėŖ©ģ ķ•Øģˆ˜ģ˜ ź°’ģ“ ģ¦ź°€ķ•˜ėŠ” ė°©ķ–„ģœ¼ė”œ ģ“ė™ķ•˜ź²Œ ėœė‹¤. ė”°ė¼ģ„œ šœ•J šœ•Ī± Ī±t ė„¼ ė¹¼ģ¤€ė‹¤. ź·øė¦¬ź³  ģ ė‹¹ķ•œ šœŒė„¼ ź³±ķ•“ģ£¼ģ–“ģ„œ ģ”°źøˆė§Œ ģ“ė™ķ•˜ź²Œ ķ•œė‹¤. āˆ’š† šš‘± ššœ¶ šœ¶ š’•
  • 26. What is Gradient Descent? Gradient Descent š›¼ š‘”+1 = š›¼ š‘” āˆ’ šœŒ šœ•š½ šœ•š›¼ š›¼ š‘” Gradient Ascent š›¼ š‘”+1 = š›¼ š‘” + šœŒ šœ•š½ šœ•š›¼ š›¼ š‘” š½ = ėŖ©ģ ķ•Øģˆ˜ šœ•š½ šœ•š›¼ š›¼ š‘” : š›¼ š‘”ģ—ģ„œģ˜ ė„ķ•Øģˆ˜ šœ•š½ šœ•š›¼ ģ˜ ź°’ Gradient Descent, Gradient AscentėŠ” ģ „ķ˜•ģ ģø Greedy algorithmģ“ė‹¤. ź³¼ź±° ė˜ėŠ” ėÆøėž˜ė„¼ ź³ ė ¤ķ•˜ģ§€ ģ•Šź³  ķ˜„ģž¬ ģƒķ™©ģ—ģ„œ ź°€ģž„ ģœ ė¦¬ķ•œ ė‹¤ģŒ ģœ„ģ¹˜ė„¼ ģ°¾ģ•„ Local optimal pointė”œ ėė‚  ź°€ėŠ„ģ„±ģ„ ź°€ģ§„ ģ•Œź³ ė¦¬ģ¦˜ģ“ė‹¤.
  • 27. š½ Ī˜ = 1 2 šœƒ0 + šœƒ1 š‘„š‘– āˆ’ š‘¦š‘– 2 š‘› š‘–=1 = 1 2 Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘– 2 š‘› š‘–=1 š•©š‘– = 1, š‘„š‘– š‘‡ , Ī˜ = šœƒ0, šœƒ1 š‘‡ , š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘› š‘‡ , š‘‹ = 1 1 ā€¦ š‘„1 š‘„2 ā€¦ 1 š‘„ š‘› , š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž. šœƒ0 š‘”+1 = šœƒ0 š‘” āˆ’ š›¼ šœ• šœ•šœƒ0 š½(Ī˜)š‘” šœƒ1 š‘”+1 = šœƒ1 š‘” āˆ’ š›¼ šœ• šœ•šœƒ1 š½(Ī˜)š‘” šœƒ0ģ˜ š‘”ė²ˆģ§ø ź°’ģ„, š½(Ī˜)ė„¼ šœƒ0ģœ¼ė”œ ėÆøė¶„ķ•œ ģ‹ģ—ė‹¤ź°€ ėŒ€ģž…. ź·ø ķ›„ģ—, ģ“ ź°’ģ„ šœƒ0ģ—ģ„œ ė¹¼ ģ¤Œ. ėÆøė¶„ķ•  ė•Œ ģ“ģš©. Gradient descentė„¼ ģ¤‘ģ§€ķ•˜ėŠ” źø°ģ¤€ģ“ ė˜ėŠ” ķ•Øģˆ˜
  • 28. š½ Ī˜ = 1 2 šœƒ0 + šœƒ1 š‘„š‘– āˆ’ š‘¦š‘– 2 š‘› š‘–=1 = 1 2 Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘– 2 š‘› š‘–=1 š•©š‘– = 1, š‘„š‘– š‘‡ , Ī˜ = šœƒ0, šœƒ1 š‘‡ , š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘› š‘‡ , š‘‹ = 1 1 ā€¦ š‘„1 š‘„2 ā€¦ 1 š‘„ š‘› , š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž. Gradient of š½(Ī˜) šœ• šœ•šœƒ0 š½ šœƒ = (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–) š‘› š‘–=1 1 šœ• šœ•šœƒ1 š½ šœƒ = (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–) š‘› š‘–=1 š‘„š‘– š›»š½ Ī˜ = šœ• šœ•šœƒ0 š½ Ī˜ , šœ• šœ•šœƒ1 š½ Ī˜ š‘‡ = Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘– š•©š‘– š‘› š‘–=1
  • 29. š•©š‘– = 1, š‘„š‘– š‘‡ , Ī˜ = šœƒ0, šœƒ1 š‘‡ , š•Ŗ = š‘¦1, š‘¦2, ā€¦ , š‘¦š‘› š‘‡ , š‘‹ = 1 1 ā€¦ š‘„1 š‘„2 ā€¦ 1 š‘„ š‘› , š•– = (šœ–1, ā€¦ , šœ– š‘›) ė¼ź³  ķ•˜ģž. šœƒ0 š‘”+1 = šœƒ0 š‘” āˆ’ š›¼ (Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘–) š‘› š‘–=1 1 ė‹Ø, ģ“ ė•Œģ˜ Ī˜ģžė¦¬ģ—ėŠ” š‘”ė²ˆģ§øģ— ģ–»ģ–“ģ§„ Ī˜ź°’ģ„ ėŒ€ģž…ķ•“ģ•¼ ķ•œė‹¤. šœƒ1 š‘”+1 = šœƒ1 š‘” āˆ’ š›¼ Ī˜ š‘‡ š•©š‘– āˆ’ š‘¦š‘– š‘„š‘– š‘› š‘–=1
  • 31. Steepest Descent ģž„ģ  : easy to implement, conceptually clean, guaranteed convergence ė‹Øģ  : often slow converging Ī˜ š‘”+1 = Ī˜ š‘” āˆ’ š›¼ {(Ī˜ š‘”) š‘‡ š•©š‘– āˆ’ š‘¦š‘–}š•©š‘– š‘› š‘–=1 Normal Equations ģž„ģ  : a single-shot algorithm! Easiest to implement. ė‹Øģ  : need to compute pseudo-inverse š‘‹ š‘‡ š‘‹ āˆ’1 , expensive, numerical issues (e.g., matrix is singular..), although there are ways to get around this ... š•– = š‘‹ š‘‡ š‘‹ āˆ’1 š‘‹ š‘‡ š•ŖĖ†
  • 33. š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’ ė‹Øģˆœ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ€, input ė³€ģˆ˜ź°€ 1. ė‹¤ģ¤‘ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ€, input ė³€ģˆ˜ź°€ 2ź°œ ģ“ģƒ. Googleģ˜ ģ£¼ģ‹ ź°€ź²© Yahooģ˜ ģ£¼ģ‹ ź°€ź²© Microsoftģ˜ ģ£¼ģ‹ ź°€ź²©
  • 34. š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ šŸ + šœ½ šŸ š’™ šŸ šŸ’ + š ģ˜ˆė„¼ ė“¤ģ–“, ģ•„ėž˜ģ™€ ź°™ģ€ ģ‹ģ„ ģ„ ķ˜•ģœ¼ė”œ ģƒź°ķ•˜ģ—¬ ķ’€ ģˆ˜ ģžˆėŠ”ź°€? ė¬¼ė” , input ė³€ģˆ˜ź°€ polynomial(ė‹¤ķ•­ģ‹)ģ˜ ķ˜•ķƒœģ“ģ§€ė§Œ, coefficients šœƒš‘–ź°€ ģ„ ķ˜•(linear)ģ“ėƀė”œ ģ„ ķ˜• ķšŒź·€ ė¶„ģ„ģ˜ ķ•“ė²•ģœ¼ė”œ ķ’€ ģˆ˜ ģžˆė‹¤. ššÆ = š‘‹ š‘‡ š‘‹ āˆ’1 š‘‹ š‘‡ š•ŖĖ† šœƒ0, šœƒ1, ā€¦ , šœƒ š‘› š‘‡
  • 36. š’š = šœ½ šŸŽ + šœ½ šŸ š’™ šŸ + šœ½ šŸ š’™ šŸ + ā‹Æ + šœ½ š’ š’™ š’ģ¤‘ ķšŒź·€ ė¶„ģ„ ģ¼ė°˜ ķšŒź·€ ė¶„ģ„ š’š = šœ½ šŸŽ + šœ½ šŸ š’ˆ šŸ(š’™ šŸ) + šœ½ šŸ š’ˆ šŸ(š’™ šŸ) + ā‹Æ + šœ½ š’ š’ˆ š’(š’™ š’) š‘”š‘—ėŠ” š‘„ š‘— ė˜ėŠ” (š‘„āˆ’šœ‡ š‘—) 2šœŽ š‘— ė˜ėŠ” 1 1+exp(āˆ’š‘  š‘— š‘„) ė“±ģ˜ ķ•Øģˆ˜ź°€ ė  ģˆ˜ ģžˆė‹¤. ģ“ź²ƒė„ ė§ˆģ°¬ź°€ģ§€ė”œ ģ„ ķ˜• ķšŒź·€ ķ’€ģ“ ė°©ė²•ģœ¼ė”œ ė¬øģ œė„¼ ķ’€ ģˆ˜ ģžˆė‹¤.
  • 37. š‘¤ š‘‡ = (š‘¤0, š‘¤1, ā€¦ , š‘¤ š‘›) šœ™ š‘„ š‘– š‘‡ = šœ™0 š‘„ š‘– , šœ™1 š‘„ š‘– , ā€¦ , šœ™ š‘› š‘„ š‘–
  • 38. š‘¤ š‘‡ = (š‘¤0, š‘¤1, ā€¦ , š‘¤ š‘›) šœ™ š‘„ š‘– š‘‡ = šœ™0 š‘„ š‘– , šœ™1 š‘„ š‘– , ā€¦ , šœ™ š‘› š‘„ š‘– normal equation
  • 39. [ ģžė£Œģ˜ ė¶„ģ„ ] ā‘  ėŖ©ģ  : ģ§‘ģ„ ķŒ”źø° ģ›ķ•Ø. ģ•Œė§žģ€ ź°€ź²©ģ„ ģ°¾źø° ģ›ķ•Ø. ā‘” ź³ ė ¤ķ•  ė³€ģˆ˜(feature) : ģ§‘ģ˜ ķ¬źø°(in square feet), ģ¹Øģ‹¤ģ˜ ź°œģˆ˜, ģ§‘ ź°€ź²©
  • 40. (ģ¶œģ²˜ : http://aimotion.blogspot.kr/2011/10/machine-learning-with-python-linear.html) ā‘¢ ģ£¼ģ˜ģ‚¬ķ•­ : ģ§‘ģ˜ ķ¬źø°ģ™€ ģ¹Øģ‹¤ģ˜ ź°œģˆ˜ģ˜ ģ°Øģ“ź°€ ķ¬ė‹¤. ģ˜ˆė„¼ ė“¤ģ–“, ģ§‘ģ˜ ķ¬źø°ź°€ 4000 square feetģøė°, ģ¹Øģ‹¤ģ˜ ź°œģˆ˜ėŠ” 3ź°œģ“ė‹¤. ģ¦‰, ė°ģ“ķ„° ģƒ featureė“¤ ź°„ ź·œėŖØģ˜ ģ°Øģ“ź°€ ķ¬ė‹¤. ģ“ėŸ“ ź²½ģš°, featureģ˜ ź°’ģ„ ģ •ź·œķ™”(normalizing)ė„¼ ķ•“ģ¤€ė‹¤. ź·øėž˜ģ•¼, Gradient Descentė„¼ ģˆ˜ķ–‰ķ•  ė•Œ, ź²°ź³¼ź°’ģœ¼ė”œ ė¹ ė„“ź²Œ ģˆ˜ė “ķ•˜ė‹¤. ā‘£ ģ •ź·œķ™”ģ˜ ė°©ė²• - featureģ˜ mean(ķ‰ź· )ģ„ źµ¬ķ•œ ķ›„, featureė‚“ģ˜ ėŖØė“  dataģ˜ ź°’ģ—ģ„œ meanģ„ ė¹¼ģ¤€ė‹¤. - dataģ—ģ„œ meanģ„ ė¹¼ ģ¤€ ź°’ģ„, ź·ø dataź°€ ģ†ķ•˜ėŠ” standard deviation(ķ‘œģ¤€ ķŽøģ°Ø)ė”œ ė‚˜ėˆ„ģ–“ ģ¤€ė‹¤. (scaling) ģ“ķ•“ź°€ ģ•ˆ ė˜ė©“, ģš°ė¦¬ź°€ ź³ ė“±ķ•™źµ ė•Œ ė°°ģ› ė˜ ģ •ź·œė¶„ķ¬ė„¼ ķ‘œģ¤€ģ •ź·œė¶„ķ¬ė”œ ė°”ź¾øģ–“ģ£¼ėŠ” ź²ƒģ„ ė– ģ˜¬ė ¤ė³“ģž. ķ‘œģ¤€ģ •ź·œė¶„ķ¬ė„¼ ģ‚¬ģš©ķ•˜ėŠ” ģ“ģœ  ģ¤‘ ķ•˜ė‚˜ėŠ”, ģ„œė”œ ė‹¤ė„ø ė‘ ė¶„ķ¬, ģ¦‰ ė¹„źµź°€ ė¶ˆź°€ėŠ„ķ•˜ź±°ė‚˜ ģ–“ė ¤ģš“ ė‘ ė¶„ķ¬ė„¼ ģ‰½ź²Œ ė¹„źµķ•  ģˆ˜ ģžˆź²Œ ķ•“ģ£¼ėŠ” ź²ƒģ“ģ—ˆė‹¤. š‘ = š‘‹ āˆ’ šœ‡ šœŽ If š‘‹~(šœ‡, šœŽ) then š‘~š‘(1,0)
  • 41. 1. http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture5-LiR.pdf 2. http://www.cs.cmu.edu/~10701/lecture/RegNew.pdf 3. ķšŒź·€ė¶„ģ„ ģ œ 3ķŒ (ė°•ģ„±ķ˜„ ģ €) 4. ķŒØķ„“ģøģ‹ (ģ˜¤ģ¼ģ„ ģ§€ģŒ) 5. ģˆ˜ė¦¬ķ†µź³„ķ•™ ģ œ 3ķŒ (ģ „ėŖ…ģ‹ ģ§€ģŒ)
  • 42. Laplacian Smoothing multinomial random variable š‘§ : š‘§ėŠ” 1ė¶€ķ„° š‘˜ź¹Œģ§€ģ˜ ź°’ģ„ ź°€ģ§ˆ ģˆ˜ ģžˆė‹¤. ģš°ė¦¬ėŠ” test setģœ¼ė”œ š‘šź°œģ˜ ė…ė¦½ģø ź“€ģ°° ź°’ š‘§ 1 , ā€¦ , š‘§ š‘š ģ„ ź°€ģ§€ź³  ģžˆė‹¤. ģš°ė¦¬ėŠ” ź“€ģ°° ź°’ģ„ ķ†µķ•“, š’‘(š’› = š’Š) ė„¼ ģ¶”ģ •ķ•˜ź³  ģ‹¶ė‹¤. (š‘– = 1, ā€¦ , š‘˜) ģ¶”ģ • ź°’(MLE)ģ€, š‘ š‘§ = š‘— = š¼{š‘§ š‘– = š‘—}š‘š š‘–=1 š‘š ģ“ė‹¤. ģ—¬źø°ģ„œ š¼ . ėŠ” ģ§€ģ‹œ ķ•Øģˆ˜ ģ“ė‹¤. ź“€ģ°° ź°’ ė‚“ģ—ģ„œģ˜ ė¹ˆė„ģˆ˜ė„¼ ģ‚¬ģš©ķ•˜ģ—¬ ģ¶”ģ •ķ•œė‹¤. ķ•œ ź°€ģ§€ ģ£¼ģ˜ ķ•  ź²ƒģ€, ģš°ė¦¬ź°€ ģ¶”ģ •ķ•˜ė ¤ėŠ” ź°’ģ€ ėŖØģ§‘ė‹Ø(population)ģ—ģ„œģ˜ ėŖØģˆ˜ š‘(š‘§ = š‘–)ė¼ėŠ” ź²ƒģ“ė‹¤. ģ¶”ģ •ķ•˜źø° ģœ„ķ•˜ģ—¬ test set(or ķ‘œė³ø ģ§‘ė‹Ø)ģ„ ģ‚¬ģš©ķ•˜ėŠ” ź²ƒ ėæģ“ė‹¤. ģ˜ˆė„¼ ė“¤ģ–“, š‘§(š‘–) ā‰  3 for all š‘– = 1, ā€¦ , š‘š ģ“ė¼ė©“, š‘ š‘§ = 3 = 0 ģ“ ė˜ėŠ” ź²ƒģ“ė‹¤. ģ“ź²ƒģ€, ķ†µź³„ģ ģœ¼ė”œ ė³¼ ė•Œ, ģ¢‹ģ§€ ģ•Šģ€ ģƒź°ģ“ė‹¤. ė‹Øģ§€, ķ‘œė³ø ģ§‘ė‹Øģ—ģ„œ ė³“ģ“ģ§€ ģ•ŠėŠ” ė‹¤ėŠ” ģ“ģœ ė”œ ģš°ė¦¬ź°€ ģ¶”ģ •ķ•˜ź³ ģž ķ•˜ėŠ” ėŖØģ§‘ė‹Øģ˜ ėŖØģˆ˜ ź°’ģ„ 0ģœ¼ė”œ ķ•œė‹¤ėŠ” ź²ƒģ€ ķ†µź³„ģ ģœ¼ė”œ ģ¢‹ģ§€ ģ•Šģ€ ģƒź°(bad idea)ģ“ė‹¤. (MLEģ˜ ģ•½ģ )
  • 43. ģ“ź²ƒģ„ ź·¹ė³µķ•˜źø° ģœ„ķ•“ģ„œėŠ”, ā‘  ė¶„ģžź°€ 0ģ“ ė˜ģ–“ģ„œėŠ” ģ•ˆ ėœė‹¤. ā‘” ģ¶”ģ • ź°’ģ˜ ķ•©ģ“ 1ģ“ ė˜ģ–“ģ•¼ ķ•œė‹¤. š‘ š‘§ = š‘—š‘§ =1 (āˆµ ķ™•ė„ ģ˜ ķ•©ģ€ 1ģ“ ė˜ģ–“ģ•¼ ķ•Ø) ė”°ė¼ģ„œ, š’‘ š’› = š’‹ = š‘° š’› š’Š = š’‹ + šŸš’Ž š’Š=šŸ š’Ž + š’Œ ģ“ė¼ź³  ķ•˜ģž. ā‘ ģ˜ ģ„±ė¦½ : test set ė‚“ģ— š‘—ģ˜ ź°’ģ“ ģ—†ģ–“ė„, ķ•“ė‹¹ ģ¶”ģ • ź°’ģ€ 0ģ“ ė˜ģ§€ ģ•ŠėŠ”ė‹¤. ā‘”ģ˜ ģ„±ė¦½ : š‘§(š‘–) = š‘—ģø dataģ˜ ģˆ˜ė„¼ š‘›š‘—ė¼ź³  ķ•˜ģž. š‘ š‘§ = 1 = š‘›1+1 š‘š+š‘˜ , ā€¦ , š‘ š‘§ = š‘˜ = š‘› š‘˜+1 š‘š+š‘˜ ģ“ė‹¤. ź° ģ¶”ģ • ź°’ģ„ ė‹¤ ė”ķ•˜ź²Œ ė˜ė©“ 1ģ“ ė‚˜ģ˜Øė‹¤. ģ“ź²ƒģ“ ė°”ė”œ Laplacian smoothingģ“ė‹¤. š‘§ź°€ ė  ģˆ˜ ģžˆėŠ” ź°’ģ“ 1ė¶€ķ„° š‘˜ź¹Œģ§€ ź· ė“±ķ•˜ź²Œ ė‚˜ģ˜¬ ģˆ˜ ģžˆė‹¤ėŠ” ź°€ģ •ģ“ ģ¶”ź°€ė˜ģ—ˆė‹¤ź³  ģ§ź“€ģ ģœ¼ė”œ ģ•Œ ģˆ˜ ģžˆė‹¤. 1