�ݺ�ߣ

Support Vector Machine
Easy Tutorial(I wish)

??? ??? ?? ??? ? ?? ??
1, ?? ??? ??feature space??? mapping ? linear sepeable
2, lable? , 1? -1 ???

(1) y t (x) ? wT ? (x) ? b

(2)???? feature vector ?? ?? ?
t n ( w T ? ( x) ? b)
?
(? (1))
w

y ( x) t n y ( x n )
?
(? we assumelinear seperable)
w
w

Perceptron algorithm

w ' ? w ? yn x n

if tn ? sign{ y(x n )}

???(mistake)? ??
(sample)?? ??? ???
(hyperplane)? ??(normal)
??? ??? ???.
??? perceptron ????
? mistake driven ????
??? ??.

Perceptron, convergence
? ( k ) : the paramater vector after k update
?
?

Claim : perceptron algorithm indeed convergences in a finite of updates.
Assume that

|| xt || ? R for all t and some finite R

?? ? 0 s.t yt (? * )T xt ? ?

Convergenc
e proof

? ' ? ? ? yt x t

if yt ? f (x t ;? )

(? * )T ? ( k ) ? (? * )T ? ( k ?1) ? yt (? * )T x t
? (? * )T ? ( k ?1) ? ?
? (? * )T (? ( k ? 2 ) ? yt x t ) ? ?
? (? * )T ? ( k ? 2 ) ? yt (? * )T x t ? ?
? (? * )T ? ( k ? 2 ) ? 2?
...
? k?
|| ? ( k ) || 2 ? || ? ( k ?1) ? yt x t || 2
? || ? ( k ?1) || 2 ?2 yt (? ( k ?1) )T x t ? || x t || 2
? | | ? ( k ?1) || 2 ? || x t || 2
? || ? ( k ?1) || 2 ? R 2
...
? kR2
1 ? cos(? *,?
?1 ?

(k )

(? * )T ? ( k )
k?
k?
) ? (k )
? (k )
?
|| ? || || ? * || || ? || || ? * ||
kR2 || ? * ||

k?
kR || ? ||
2

*

or k ?

R 2 || ? * || 2

?2

Support Vector, and Margin? ??

??? ????? Support vector?? support vector? ???
(hyperplane)?? ????? margin??? ??.

SVM? ??(motivation)?, ??(intuition )?
??? ????? ??.
?? ???? ?? ??? ??(generalization error)? ????
Maximal margin classifier!!

? 1
?
?
?
T
(3) arg max ?
min t n (w ? (x) ? b) ?
w ,b
?w n
?
?
?
?????? ??? ?????? ??? ?? ???? ???

?

?

Linear
Classifiers

x

denotes +1

a

f

yest

f(x,w,b) = sign(w. x - b)

denotes -1

How would you
classify this data?

Linear
Classifiers

x

denotes +1

a

f

yest

f(x,w,b) = sign(w. x - b)

denotes -1

Any of these
would be fine..
..but which is
best?

Support Vector Machine Tutorial ???

SVM ????? ???? ?? ????

t n ( w T ? ( x) ? b)
.
; w ? kw, b ? kb ? _ ??????? _ ??? _ ????
w
t n (kw T ? (x) ? kb) t n k (w T ? (x) ? b) t n (w T ? (x) ? b)
.
?
?
kw
kw
w
(4) t n (w T ? (x) ? b) ? 1; ???? ??????????? ???????b????
(5) t n (w T ? (x) ? b) ? 1; ?? ?????? ??, linear seperable????
y ( x) t n y ( x n )
(2)???? feature vector ?? ?? ?
?
(? we assume linear seperable)
w
w
t n ( w T ? ( x) ? b)
1
?
(? (1)) ?
? m arg in
w
w
?1

w ; ?? _ ???
? w ? _ ???
2

1
1
2
w ????; ??????????(??)
2
2
1
2
(6) arg min w ; ??????????????????
2
w ,b
?

SVM ????? ???

1
arg min w
2
w ,b

2

subject to t n (wT ? (x) ? b) ? 1

????? SVM ????!!!!!!
?? ? ? ??? ??? ?? ?
????? ??? ???? ? ??? linear programming? ???
Quadratic programming??? ??.
? Quadratic programming ??? dual form?? ???? ?? ???
Lagrange multipliers ??? ????.
Dual form?? ?? ??
-> ?? ??? ??,
? ??? convex ??? ???? ???.
Convex , non convex ??? ??
->convex?? ??? ��???? 0�� ???? grobal minimum? ?? ? ?
?? ?.

Lagrange multiplier

g (x ? ��) ? g (x) ? �� T ?g (x);??????
g ( x) ? g (x ? ��); x, x ? ��??g (x)???????
? �� T ?g ( x ) ? 0
if lim ? ? 0 then ��T ?g (x) ? 0 ? ��?g (x)?????????g (x)?g (x)?????.
?f ? g (x), ?g ? g (x) ? ?f ? ??g (x) ? 0, where ? ? 0
L(x, ? ) ? f ? ?g (x); KKT????
ex ) L(x, ? ) ? 1 ? x1 2 ? x2 2 ? ? ( x1 ? x2 ? 1)
? 2 x1 ? ? ? 0
? 2 x1 ? ? ? 0
x1 ? x2 ? 1 ? 0
? x1 ? x2 ? 1 / 2;????????.

SVM ??? dual form?? ??(SVM? ??? ??) ??? ??? ?? ???
N
1
2
?
(7) L(w,b, a) ? w ? ? a n ?t n (w T ? (x) ? b) ? 1 ; ???? ??? ??
2
n ?1
N
N
?L(w,b, a)
? w ? ? a n t n? (x n ) ? 0 ? (8)w ? ? a n t n? (x n )
?w
n ?1
n ?1

?L(w,b, a) N
? ? an t n ? 0
?b
n ?1
N

(8)w ? ? a n t n? (x n )
n ?1

N

(9)0 ? ? a n t n ; (7)?? ???? ???? ??? ??
n ?1

(8), (9)?(7)?????????
N
N
N
1 N N
T
T
L(w,b, a) ? ? ? a n a m t n t m? (x) ? (x) ? ? a n t n w ? (x) ? ? a n t n b ? ? a n
2 n ?1 m ?1
n ?1
n ?1
n ?1
N
N
1 N N
1 N N
~
T
? L (a) ? ? a n ? ?? a n a m t n t m? (x) ? (x) ? ? a n ? ?? a n a m t n t m k (x n , x m )
2 n ?1 m ?1
2 n ?1 m ?1
n ?1
n ?1

(11)a n ? 0; (10)??????
N

(12)? a n t n ? 0; (9)??????
n ?1

1
arg min w
2
w ,b

2

subject to t n (wT ? (x) ? b) ? 1

?

?

N
1
2
(7) L(w,b, a) ? w ? ? a n t n (w T ? (x) ? b) ? 1 ;???? ??? ??
2
n ?1

N
N
?L(w,b, a)
? w ? ? a n t n? (x n ) ? 0 ? (8)w ? ? a n t n? (x n )
?w
n ?1
n ?1

?L(w,b, a) N
? ? an t n ? 0
?b
n ?1
N

(8)w ? ? a n t n? (x n )
n ?1

N

(9)0 ? ? ant n ; (7)?? ???????? ??? ??
n?1

?

?

N
1
2
(7) L(w,b, a) ? w ? ? a n t n (w T ? (x) ? b) ? 1 ;???? ??? ??
2
n ?1

N

(8)w ? ? a n t n? (x n )
n ?1

N

(9)0 ? ? a n t n ; (7)?? ???????? ??? ??
n ?1

(8), (9)?(7)?????????
N
N
N
1 N N
T
L(w,b, a) ? ? ? a n a m t n t m k (x n , x m ) ? ? a n t n w ? (x) ? ? a n t n b ? ? a n
2 n ?1 m ?1
n ?1
n ?1
n ?1
N
N
1 N N
1 N N
~
T
? L (a) ? ? a n ? ?? a n a m t n t m? (x) ? (x) ? ? a n ? ?? a n a m t n t m k (x n , x m )
2 n ?1 m ?1
2 n ?1 m ?1
n ?1
n ?1

(11)a n ? 0; (10)??????
N

(12)? a n t n ? 0; (10)??????
n ?1

??? ??? ?? ???
?? ????? ??

SVM? ??

?

?

N
1
2
(7) L(w,b, a) ? w ? ? an t n (w T ? (x) ? b) ? 1 ; ???? ??? ??
2
n ?1

? ? QP ??? ??? ??? ??? ?? ??? O( M 3 ) , M? feature ? ??
N
1 N N
~
(10) L (a) ? ? an ? ?? an am t n t m k (x n , x m ); (8), (9)????????(7)???_ ???
2 n?1 m?1
n ?1

? ? QP ??? ??? ??? ??? ?? ??? O( N 3 ) , N? feature ? ??
?? M ? N?? ??? ?????? ??? ? ?????.
??? M ? ???? ? ? ???? ??? ? ? ??. ??? ?? ??(k )? ??
?? ???? ? ?? ? ? ??.??? ?? ????? ??.
? e.g )? (x)T ? (x) ? 3x10000000 x10000000x3 ? 3x3!!!!!!!; ???????.

SVM? ???

(1) y(x) ? w ?(x) ? b,(8)w ?
T

N

? ant n?(xn )
n
?1

(13) y(x) ?

N

? ant nk (x, xn ) ? b;
n?1

SVM?? learning?? ?? an? b? ???? ???.
N
1 N N
~
L (a) ? ? an ? ?? an amt nt m k (x n , x m )
2 n?1 m?1
n ?1
N

subject to an ? 0 , ? ant n ? 0;
n?1

an? ?? ???? QP? ????.
QP? SVM ??? ?? ??? ????. ??? QP ? ??????
?????. ?? ??? SMO(sequential minimal optimization)
MATLAB ' quadprog ' ?? ??

N
1
2
?
(7) L(w,b, a) ? w ? ? a n ?t n y (x) ? 1 ; ???? ??? ??
2
n ?1

?? ??? ??? ?? ??? ???? ( KKT condition )
(14) a n ? 0
(15)t n y (x n ) ? 1 ? 0

(16)a n ?t n y (x n ) ? 1? ? 0;
a n ? 0 or t n y (x n ) ? 1
? ??? ??? a n ? 0?? x n ? SV?? ???.
? , QP? ?? SV? ? ? ??.
? SV? ??? ??? (13)??? ??? ? ? ??.

b? ??

N

(13) y(x) ? ? ant n k (x, x n ) ? b.? t n y (x n ) ? 1 ? ????,
n?1

? N
?
(17)t n ? ? am t m k (x, x m ) ? b ? ? 1;
? m?SV
?
1
(18)b ?
NS

?
?
? ? t n ? ? amt m k (x, x n ) ?
?
?
m?S
? n?S
?

Support vector? ??? ?
???? ??? ??.

Linear seperable ?? ????
????? sample?? ?
?? mapping ??? ??
? linear seperable ???
?? ???? ??? ??
?? ???.
??? ?? ????? ?
?? ??? ?? ??.
??? ????? linear
seperable?? ?? ?? ?
??? ?? ??? ???
??? ????.

def .slack : ? n ? t n ? y (x n ) ; ?? ?? ?? ????? ??
(20)t n y (x n ) ? 1 ? ? n , n ? 1,..., N ; ???(5)????????.

? n ? 0; ????, 0 ? ? n ? 1; ????, ? n ? 1; ???????
N

1
2
w ; ?????????????????????????????
2
n ?1
1
2
C ? ??? (6) arg min w ?? ????
2
w ,b
N
N
N
1
2
(22) L(w, b, ? , a, ��) ? w ? C ? ? n ? ? an ?t n y (x n ) ? 1 ? ? n ? ? ? ? n? n
2
n ?1
n ?1
n ?1
(21) C ? ? n ?

(23)an ? 0
?
?
? (24)t y (x ) ? 1 ? ? ? 0 ?
n
n
?
?
?(25)an (t n y (x n ) ? 1 ? ? ) ? 0?
?
? KKT
(26) ? n ? 0
?
?
?
?
(27)? n ? 0
?
?
(28) ? n? n ? 0
?
?
?
?

Linear seperable ?? ????

???
N
?
?L
?
(29)
? 0 ? w ? ? an t n? ( xn )?
?
?w
n ?1
?
?
N
?
?
?L
(30)
? 0 ? ? an t n ? 0 ? ? ????? ??? ??? ?
?
?b
n ?1
?
?
?
?
?L
? 0 ? an ? C ? ? n ?
? (31)
?? n
?
?
N
N
N
N
N
1
2
(32) L(w, b, ? , a, ��) ? w ? C ? ? n ? ? ant n y (x n ) ? ? an ? ? an? n ? ? ? n? n
2
n ?1
n ?1
n ?1
n ?1
n ?1
N
N
N
1
2
? w ? ? ant n y (x n ) ? ? ? n (C ? an ? ? n ) ? ? an
2
n ?1
n ?1
n ?1
N
1 N N
~
? L (a) ? ? an ? ?? an am t n t m k (x n , x m );
2 n?1 m?1
n ?1
(33)(31)an ? C ? ? n and (23)an ? 0 and (26) ? n ? 0 ? 0 ? an ? C
N

(34)? an t n ? 0(? (30))
n ?1

Example.

w1 : (1,5) t , (?2,?4) t
w2 : (2,3) t , (?1,5) t
?? ??? ?? ??? SVM? ???? ????. ?
polynomial kernel

?

k(x ,z) ? 1 ? x z
T

? ? ?1? x z
2

1

2

1 ? x 2? z 2 ?

2

2

2

? 1 ? 2x 1z 1 ? 2x 2z 2 ? x 1 z 1 ? 2x 1z 1x 2z 2 ? x 2 z 2

?

2
1

2

??

2

2
1

? 1 2x 1, 2x 2, x , 2x 1x 2, x 2 1 2z 1, 2z 2, z , 2z 1z 2, z 2
,
,
? ? ?x ? ? ?z ?
T

?

2T

N
1 N N
~
L (a) ? ? an ? ?? an amt nt m k (x n , x m )
2 n?1 m?1
n ?1
N

subject to an ? 0 , ? ant n ? 0;
n?1

??? QP? ???
??? ??? ???
? ??? ?????
???.lagrange
multiplier ????
??? ???.
N

(8)w ? ? a n t n? (x n )
n ?1

????? ????
????? ????
??? ??? ???
?? ??? ???.

????
1,???
2,MIT ????
http://ocw.mit.edu/OcwWeb/Ele
ctrical-Engineering-andComputer-Science/6-867Fall2006/LectureNotes/index.htm
3,??? ??? ????

Fast Training of Support Vector Machines using
Sequential Minimal Optimization,
Platt, John.

Jung-Kyu Lee @ MLLAB

SVM revisited
? the (dual) optimization problem we covered in SVM class
m

1 m
max W (a ) ? ? a i ? ? yi y ja ia j xi , x j .
a
2 i , j ?1
i ?1
s.t 0 ? a i ? C , i ? 1,..., m
m

?a y
i ?1

i

i

?0

? Solving this optimization problem for large-scale
application by using quadratic programming is very
computationally expensive.
? John Platt a colleague at Microsoft, proposed efficient
algorithm called SMO (Sequential Minimal Optimization) to
actually solve this optimization problem.

Digression: Coordinate
ascent(1/3)
? Consider trying to solve the following unconstrained optimization
problem.

max W (a1 , a 2 ,..., a m )
a

? Coordinate ascent:

ascent(2/3)
?
?
?
?
?
?

How coordinate ascent work?
We optimize a quadratic function.
Minimum is (0,0)
First, minimize this w.r.t ��1
Next, minimize this w.r.t ��2
With same argument, iterate until
convergence.

ascent(3/3)
? Coordinate assent will usually take a lot more steps than other
iterative optimization algorithm such as gradient descent,
Newton��s method, conjugate gradient descent etc.
? However, there are many optimization problems for which it��s
particularly easy to fix all but one of the parameters and optimize
with respect to just that one parameter.
? It turns out that this will be true when we modify this algorithm to
solve the SVM optimization problem.

SMO
? Coordinate assent in its basic form does not work to apply SVM
dual optimization problem.
m
1 m
max W (a ) ? ? a i ? ? yi y ja ia j xi , x j .
a
2 i , j ?1
i ?1
s.t 0 ? a i ? C , i ? 1,..., m
m

?a y
i ?1

i

i

?0

? As coordinate assent progress, we can��t change ��i without
violating the m
constraint.
? a i yi ? 0
i ?1

m

? a1 y1 ? ?? a i yi
i?2
m

? a1 ? ? y1 ? a i yi (? y1 ? {1,?1}, y1 ? 1)
i?2

2

Outline of SMO
? The SMO (sequential minimal optimization ) algorithm, therefore,
instead of trying to change one �� at a time, we will try to change
two �� at a time to keep satisfying the constraints.
? The term ��minimal�� refers to the fact that we��re choosing the
smallest number of ��

Convergence of SMO
? To test for convergence of SMO algorithm, we can check whether
the KKT conditions are satisfied to within some tolerance
ai ? 0 ? yi ( wT xi ? b) ? 1
ai ? C ? yi ( wT xi ? b) ? 1
0 ? ai ? C ? yi ( wT xi ? b) ? 1

? The key reason that SMO is an efficient algorithm is that updating
��i, ��j can be computed very efficiently.

Detail of SMO
? Suppose we��ve decide to hold ��3,��,��m , update ��1, ��2

m

?a y
i ?1

i

i

?0
m

? a1 y1 ? a 2 y2 ? ? a i yi
i ?3

? a1 y1 ? a 2 y2 ? ?

Detail of SMO
? We can thus picture the constraints on ��1, ��2 as follows
? ��1 and ��2 must lie within
the box [0,C]��[0,C] shown.
? ��1 and ��2 must lie on line :
a1 y1 ? a 2 y2 ? ?
? From these constraints, we know:
L ? a2 ? H

Detail of SMO
? From

a1 y1 ? a 2 y2 ? ?
? a1 y1 ? (? ? a 2 y2 )
? a1 y1 ? (? ? a 2 y2 ) y1
2

? a1 ? (? ? a 2 y2 ) y1

? W(��) become

W (a1 , a 2 ,..., a m ) ? W ((? ? a 2 y2 ) y1 , a 2 ,..., a m )
? aa ? ? ? ba 2 ? c
? This is a standard quadratic function.

Detail of SMO
? From

?

aa ? ? ba 2 ? c

? This is really easy to optimize by setting its derivative to zero.
? If you end up with a value outside,
you must clip your solution.
? That��ll give you the optimal solution
of this quadratic optimization
problem subject to your solution
satisfying this box constraint
and lying on this straight line.

Detail of SMO
? In conclusion,
? H if a 2 new,unclipped ? H
?
a 2 new ? ? a 2 new,unclipped if L ? a 2 new,unclipped ? H
? L if a new,unclipped ? L
2
?

a1 ? (? ? a 2 y2 ) y1
? We do this process very quickly, which makes the inner loop of
the SMO algorithm very efficient.

Conclusion
? SMO
- has scalability for large scale data.

- makes SVM computationally inexpensive.
- can be implemented easily because it has simple training
structure (coordinate ascent).

Reference
?

?

[1] Platt, John. Fast Training of Support Vector Machines using Sequential
Minimal Optimization, in Advances in Kernel Methods �C Support Vector
Learning, B. Scholkopf, C. Burges, A. Smola, eds., MIT Press (1998).
[2] The Simplified SMO Algorithm , Machine Learning course @ Stanford
http://www.stanford.edu/class/cs229/materials/smo.pdf

�ݺ�ߣ

Support Vector Machine Tutorial ???

More Related Content

Support Vector Machine Tutorial ???