25. ????, ???? ??
???(Corpus or Corpora ???, ???) : ??? ??? ??? ??? ??? ??? ??
??? ???
??(collocation) : ?? ?? ??? ?? ????? ??
??? ?? ??? ? ?? ?? ??? ??
Bag of words : bag {a, a, b, c, c, c} = bag {c, a, c, b, a, c}, ??? ??
? ??? ?? John Rupert Firth
(June 17, 1890 in Keighley, Yorkshire ¨C December 14, 1960
in Lindfield, West Sussex)
You shall know a word by the company
it keeps (Firth, J. R. 1957:11)
40. Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
41. Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
42. Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
(King->Queen) + (King -> Man) = Woman
??(?->??) ?->??
43. Word Embedding
B. Word2Vec ??
https://ronxin.github.io/wevi/
blue dots are input vectors
orange dots are output vectors.
44. Word Embedding
B. Word2Vec ??
https://ronxin.github.io/wevi/, https://github.com/ronxin/wevi
{"hidden_size":8,"random_state":1,
"learning_rate":0.2}
Training data (context|target):
apple|drink^juice,
orange|eat^apple,
rice|drink^juice,
juice|drink^milk,
milk|drink^rice,
water|drink^milk,
juice|orange^apple,
juice|apple^drink,
milk|rice^drink,
drink|milk^water,
drink|water^juice,
drink|juice^water
45. Word Embedding
B. Word2Vec ?? king|kindom,queen|kindom,king|palace,queen|palace,king|royal,queen|royal,king|George,q
ueen|Mary,man|rice,woman|rice,man|farmer,woman|farmer,man|house,woman|house,man|G
eorge,woman|Mary
https://ronxin.github.io/wevi/
you can see the infamous analogy: "king - queen = man - woman
https://www.youtube.com/watch?v=D-ekE-Wlcds&feature=youtu.be
48. Word Embedding - Exercise 1 ¨C Kaggle Movie Review Data
https://www.kaggle.com/belayati/word2vec-tutorial-suite
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
52. Word Embedding - Exercise 1
# For the second time
model = word2vec.Word2Vec.load("300features_40minwords_10context")
print("Q1. doesnt_match [man woman child kitchen]")
print(model.doesnt_match("man woman child kitchen".split()))
print("Q2. doesnt_match [france england germany berlin]")
print(model.doesnt_match("france england germany berlin".split()))
print("Q3. doesnt_match [paris berlin london china]")
print(model.doesnt_match("paris berlin london china".split()))
print("Q4. most_similar [man]")
print(model.most_similar("man"))
print("Q5. most_similar [queen]")
print(model.most_similar("queen"))
print("Q6. most_similar [terrible]")
print(model.most_similar("terrible"))
Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews
2017-10-02 22:24:42,461 : INFO : collecting all words and their counts
2017-10-02 22:24:45,961 : INFO : collected 123504 word types from a corpus of 17798082 raw words and 795538
sentences
2017-10-02 22:24:46,110 : INFO : estimated required memory for 16490 words and 300 dimensions: 47821000 bytes
2017-10-02 22:24:46,390 : INFO : training model with 6 workers on 16490 vocabulary and 300 features, using sg=0
hs=0 sample=1e-05 negative=5 window=10
2017-10-02 22:25:43,623 : INFO : saved 300features_40minwords_10context
63. (DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)
64. (DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)
65. (DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)