�ݺ�ߣ

Lunch and Learn
At
Data Science for Social Good
NOVA-SBE and University of Chicago
By
Manas Gaur

Singular Value Decomposition
first right
singular vector
• Singular Value Decomposition (SVD) is also called
Spectral Decomposition
• Instead of using two coordinates (𝒙, 𝒚) to describe point
locations, let’s use only one coordinate 𝒛
• Point’s position is its location along vector 𝒗 𝟏
• How to choose 𝒗 𝟏? Minimize reconstruction error
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

• Goal: Minimize the sum
of reconstruction errors:
) ) 𝑥+, − 𝑧+,
/
0
,12
3
+12
• where 𝒙 𝒊𝒋 are the “old” and 𝒛𝒊𝒋 are the
“new” coordinates
• SVD gives ‘best’ axis to project on:
• ‘best’ = minimizing the reconstruction errors
• In other words, minimum reconstruction error

•A = U Σ VT - example:
• V: “movie-to-concept” matrix
• U: “user-to-concept” matrix
= x x
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
12.4 0 0
0 9.5 0
0 0 1.3
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
variance (‘spread’)
on the v1 axis
Movie 1 rating
Movie2rating

A U
Sigma
VT
=
B U
Sigma
VT
=
B is best approximation of A
How Many Singular Values Should
We Retain?
• A useful rule of thumb is to
retain enough singular values
to make up 90% of the energy
in Σ.
• Sum of the squares of the
retained singular values should
be at least 90% of the sum of the
squares of all the singular values.
• Example: the total energy is
(12.4)2 + (9.5)2 + (1.3)2 =
245.70, while the retained
energy is (12.4)2 + (9.5)2 =
244.01.
• We have retained over 99% of the
energy. However, were we to
eliminate the second singular
value, 9.5, the retained energy
would be only (12.4)2/245.70 or
about 63%.J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Relation to Eigen-decomposition
• SVD gives us:
• A = U Σ VT
• Eigen-decomposition:
• A = X Λ XT
• A is symmetric
• U, V, X are orthonormal (UTU=I),
• Λ, Σ are diagonal
• Now let’s calculate:
• AAT= UΣ VT(UΣ VT)T = UΣ VT(VΣTUT) = UΣΣT UT
• ATA = V ΣT UT (UΣ VT) = V ΣΣT VT
X Λ2 XT
X Λ2 XT
Shows how to compute
SVD using eigenvalue
decomposition!

Non-Linear Dimensionality
Reduction

Brainstorming
• What is dimensionality of data ?
• What is degree of freedoms of data ?
• Is the data always exist in high-dimensional space ?
• What is the rank of a matrix ?
• What motivates us for non-linear dimensionality reduction ?
• Can the deep learning’s popular MNIST dataset problem, solvable by
simple machine learning model ?

Why do we need dimensionality reduction?
• You need to visualize it to some non-technical board members which
are probably not familiar with : terms like cosine similarity etc.
• Based on the constraint, such as preserve 80% of the data.
• You need to reduce the data you have and any new data as it comes,
which method would you choose?

Non-Linear Dimensionality Reduction
• Given a low dimensional surface embedded non-linearly in high dimensional space.
Such a surfaceiscalledManifold.
• Agood wayto representdatapointsisbytheir low-dimensionalcoordinates.
• The low-dimensional representation of the data should capture information about
high- dimensionalpairwisedistances.
• Non-lineardimensionalityreductionisalsocalledManifoldlearning.
• Idea :- Torecoverthe lowdimensionalsurface

ISOMAP
Stochastic
Nearest
Embedding
T-Stochastic
Nearest
Embedding

NLDR over PCA
NLDR PCA
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

Isomap
• Isomapusesthesame core ideasastheMDS algorithm:
• Obtainamatrixofproximities(distancesbetweenpointsinadataset).
• Thisdistancematrixisamatrixofinnerproducts.
• AnEigendecompositionofthismatrixgivesusthelowerdimensionembedding.

Stochastic Neighbor Embedding (SNE)
𝑃,|+ =
𝑒
;(<=;<>)
/?=
@
@
∑ 𝑒
;(<=;<B)
/?=
@
@
CD+
𝑄,|+ =
𝑒 F=;F>
@
∑ 𝑒 F=;F>
@
CD+
𝜎 =
1
2𝜋
𝐾𝐿(𝑃| 𝑄 = ) 𝑃 𝑗 𝑖 log
𝑃 𝑗 𝑖
𝑄 𝑗 𝑖

𝑚𝑖𝑛
F=,F>
𝐾𝐿(𝑃||𝑄)
High dimensional space
Minimization function
Low dimensional space (2-D)
1.Large 𝑷𝒋|𝒊 is modeled as Low 𝑸 𝒋|𝒊 à
High Cost
2.Small 𝑷𝒋|𝒊 is modeled as High 𝑸 𝒋|𝒊 à
Low Cost
1.SNE is not Symmetric whereas
t-SNE is Symmetric.
2.Symmetricity makes t-SNE
fast.

T-Stochastic Neighbor Embedding (t-SNE)
𝑄,+ =
(1 + ( 𝑦+ − 𝑦,
/
);2
∑ (1 + ( 𝑦+ − 𝑦,
/
);2
CD+
𝑃,+ =
𝑒
;(<=;<>)
/?@
@
∑ 𝑒
;(<=;<B)
/?@
@
CD+
1. t-distribution has longer tails, embeds
more points in higher dimension to low
dimension.
2. There are some heuristics underlying t-
SNE.
3. Develops an intuition for what’s going
on in the high dimensional data
4. Find structure where other
dimensionality-reduction algorithms
cannot
High dimensional space
Low dimensional space (2-D)

�ݺ�ߣ

Dimensionality Reduction : Above PCA

Recommended

More Related Content

Similar to Dimensionality Reduction : Above PCA (20)

Recently uploaded (20)

Dimensionality Reduction : Above PCA