ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
2017.10.18
???, Jason Min
1
¡°AI LAWYER¡± ???? ??? - I
Artificial Intelligence Lawyer 1
(???? ?? ??)
??? ?? ????
???? ?? ? ??? ?? ?????/ ?? ???? ?? ? ?? ??? ????.
??? ?? ?? ???? ???/ ?? ???? ???./ ??? ??? ???/
??? ???? ??? ?? ????.
??? ??? ?? ? ? ???/ ? ?? ??? ?? ??? ?? ?? ???.
?? ?? ?? ??? ????./ ??? ? ??? ??? ? ??? ??/ ??? ??? ???.
??? ?? ??? ?? ????? ???/ ??? ?? ??????.
??? ??? ??? ??? ?? ????./ ??? ????? ??? ?? ????/
????? ? ?? ?? ?? ???/ ????? ??? ??.
- ???? -
???? ??? ??
[?? ??]
- ?? ???
[???? ??]
- ???? ???
[????? ??]
- ??, ???, ??????? ?
[?? ??]
- ?? : Python (Anaconda, Gensim, NLTK ?)
[?? ??]
- ?? ?? ??? ?? - ? ??? 4? ? ?? ?? ? ??? ??
- ??? ??(???), ?? ?
- KISA ?? ?
[?? ?? - ????]
(????)
1957 TF-IDF (?? ?? ?, http://www.bloter.net/archives/264262, A statistical approach to mechanized encoding and searching of
literary information. IBM Journal of research and development, 1(4), 309-317)
(Topic Modeling)
2003 LDA(?? ???? ??)
(Word Embedding)
2013 Word2Vec (??)
[?? ?? ¨C ????]
[??????? ??]
1. ? ???? ????(20171018)
- ? ?? ?? ??? ?? ??
2. ???
- ???
3. ? ??? (QnA) ??? ???
???? ??? ??  1? - Ai lawyer ??? ?? ??
¡®????? ?? ????? ????¡¯? ???? ????? ???? ?? ??? ¡ø??? ?? ?? ??? ?? ¡ø??¡¤?? ??
¡ø??? ??? ?? ??? ???? ?? ?? ??? ???
¡°????? ??¡¤?? ?? ???? ??? ???¡±??? ¡°?? ??? ???? ?? ?? ???? ??? ?? ????? ?? ?
?? ???? ??¡±? ??? ??. ?? ¡°?? ?? ??? ??? ??? ???? ??? ?? ??¡±??? ¡°????? ??? ??
???? ??? ??? ? ??¡±? ??
?? ?????? 2013? ??? ¡®??? ??(The Future of Employment)¡¯?
? ??? ???, ???? ??? ?? ??? ???? ?? ??? ????
??? ?????? 94%, ?? ???? 50%? ?? ?? ??? ????
???? ??? ??  1? - Ai lawyer ??? ?? ??
COIN : COntract INtelligence
https://www.atriumlts.com/,http://www.venturesquare.net/753779
???? ??? ??  1? - Ai lawyer ??? ?? ??
[?? ??]
- ?? ???
???
Dept. of Law
Mar. 01.2015~ Feb. 09. 2017
2015.3.1 ~ 2017.2.09
2015.4.22015.3.6
2017.8.22
[???? ??]
- ???? ???
2017.4.112016.3.06
[????? ??]
- Computer Science, ??, ???, ??????? ?
2015.3.13 2017.3.18
2015.5.22
???? ??? ??  1? - Ai lawyer ??? ?? ??
http://www.nms.kcl.ac.uk/icail2017/cfcoliee.php
http://webdocs.cs.ualberta.ca/~miyoung2/COLIEE2017/
https://easychair.org/publications/paper/350845
Competition on Legal Information Extraction and Entailment ??
- ???? ?? ? ?? ?? (JAPAN) - * 2017?, 4?
https://easychair.org/publications/volume/COLIEE_2017
https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)
https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)
Legal Question Answering Data Corpus The corpus of legal questions is
drawn from Japanese Legal Bar exams, and the relevant Japanese Civil
Law articles have been also provided.
https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)
Latent Dirichlet allocation (LDA)
https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)
https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)
???? ??? ??  1? - Ai lawyer ??? ?? ??
http://thrillfighter.tistory.com/466
conda search python
conda create -n py35 python=3.5.3 anaconda
activate py35
deactivate py35
[?? ??]
- ?? : Python (Anaconda, Gensim, NLTK ?)
https://www.lucypark.kr/slides/2015-pyconkr/#2
http://radimrehurek.com/gensim/install.html
????, ???? ??
???(Corpus or Corpora ???, ???) : ??? ??? ??? ??? ??? ??? ??
??? ???
??(collocation) : ?? ?? ??? ?? ????? ??
??? ?? ??? ? ?? ?? ??? ??
Bag of words : bag {a, a, b, c, c, c} = bag {c, a, c, b, a, c}, ??? ??
? ??? ?? John Rupert Firth
(June 17, 1890 in Keighley, Yorkshire ¨C December 14, 1960
in Lindfield, West Sussex)
You shall know a word by the company
it keeps (Firth, J. R. 1957:11)
https://m.blog.naver.com/2feelus/220384206922
?????? ???????
(????? ???? )
? ?
? ?
9890
????? : 2010. 11. 12.
? ? ? : ???????????
???????????
???????????
??? ??(10?)
???? ? ????
???? ???? ???? ??? ??? ??? ???? ??? ????, ?? ???? ??? ???? ??? ????? ?
? ?? ??? ??? ? 6? ??? ?? ?? ???? ???? ??? ??? ???? ?? ???? ???? ?? ??? ??
?? ? ????? ????? ??? ??? ? ?? ??. ??? ????? ??? ??? ??? ? 8? ??? ????? ??(? ?
63??2??4?).
¡­
¡­
? ¡¤??????
? ?
? ? ?
?63?(??) ¢Ù (? ?)
?63?(??) ¢Ù (??? ??)
¢Ú ???? ?? ? ?? ? ¢Ú -------------------------
? ??? ???? ??? ? ----------------------------
?? ??? ????? ?? ----------------------------
? ?? ? ??. ??, ?4? -------------.---------------
? ???? ?????? ? ----------------------------
?? ??? ??? ??? ? ----------------------------
?? ???? ??. -------------.
1. ¡« 3. (? ?)
1. ¡« 3. (??? ??)
?? ??? - konlpy.corpus import kobill, 1809890.txt
??????(????? 1?)
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
https://m.blog.naver.com/2feelus/220384206922
??????(????? 1?)
?? ??? - from konlpy.tag import Twitter ,
morphs ??(????? ¨C ????? ????)
tokens_ko
['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', '?', '?', '?', '9890', '?
?', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', '???', '?', '???', '?', '???', '???', '?', '
???', '?', '???', '???',
¡­
¡­
¡­
'???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr', ')', '-', '11', '-']
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
https://m.blog.naver.com/2feelus/220384206922
??????(????? 1?)
?? ??? - import nltk
ko = nltk.Text ()
tokens_ko
['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', '
?', '?', '?', '9890', '??', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', '
???', '?', '???', '?', '???', '???', '?', '???', '?', '???', '???',
¡­
¡­
¡­
'???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr',
')', '-', '11', '-']
#4. ?? ??? ?? ?? ?? ????
print(len(ko.tokens)) # returns number of tokens (document length)
print(len(set(ko.tokens))) # returns number of unique tokens
?? ?? ? : 1707
?? ?? ? : 476
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
https://m.blog.naver.com/2feelus/220384206922
??????(????? 1?)
?? ??? - import nltk
ko = nltk.Text ()
?? ??? ??? ??? ????
print("???? ??? : " + str(ko.count(str('????'))))
print("?? ??? : " + str(ko.count(str('??'))))
???? ??? : 38
?? ??? : 7
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
https://m.blog.naver.com/2feelus/220384206922
??????(????? 1?)
?? ??? - import nltk
ko = nltk.Text ()
print ("??????")
pylab.show = lambda: pylab.savefig('1809890_dispersion_plot.jpg')
ko.dispersion_plot(['????', '????', '???'])
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
https://m.blog.naver.com/2feelus/220384206922
??????(????? 1?)
?? ??? - import nltk
ko = nltk.Text ()
print ("???? (????? ?? ???? ??? ??? ?????)")
#ko.concordance(unicode('?'))
ko.concordance('?')
Displaying 4 of 4 matches:
?? ?? ? ?? ? ) ? ?? ? ¡± ? ? ? . ? ? ? ? ? ?? ? ? ?? ?? ? ? . - 3 - ? ¡¤ ?? ???
? ?? ?? ? 1 . ?? ?? ?? ??? ?? ? ????? ? ? ? 71 ?? 2 ?? 4 ? ? ????? ? ?? ?? ??
? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ? ? ? ? ? ? 63 ?? 2 ?? 4 ? ? ?? ??? ? ???? ??
? ? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ??? ? ? ? 44 ?? 1 ? ? 7 ? ? ?? ??? ? ???? ??
https://m.blog.naver.com/2feelus/220384206922
??? ??
????????(Tagging and chunking) 10.1 ??? ???(POS tagging)
from konlpy.tag import Twitter;
t = Twitter()
tags_ko = t.pos('?? ?? ???? ????? ???')
print(tags_ko)
??? ??? ??? ? ??? ????.
???? ? ??? ??? ????,
???? ??? ??? ???? ??? ?? ? ?? ???.
https://m.blog.naver.com/2feelus/220384206922
https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/
http://dalpo0814.tistory.com/13
??????(????? 1?)
?? ??? - from gensim import corpora
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict') # save dictionary to file for future use
¡­
¡­
??/NounrM?X
??/NounrK{X
??/NounrMX????/VerbrM|X???/VerbrM*X
??/NounrMJX
??/NounrMKX????/VerbrMX???/AdjectiverM?X???/NounrMRX???/NounrM?X
??/NounrM?X
??/NounrM?X
??/NounrM?X?/NounrMZX?/NounrM'X???/NounrM)X???/NounrM?X
172/Numberr M?X
??/Nounr!M?X?/Eomir"K?X???/Nounr#M?X
??/Nounr$K?X???/Verbr%MbX?/Determinerr&K?X?/Nounr'M?X?/Nounr(KjX
??/Nounr)M?X???/Nounr*MfX
??/Nounr+M?X?/Nounr,M?Xí—µÚ/Foreignr-M)X
??/Nounr.M6X
??/Nounr/M?X????/Verbr0M?X
¡­
¡­
...
from gensim import corpora
print('nnencode tokens to integers')
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict') # save dictionary to file for future use
???? ??? ??  1? - Ai lawyer ??? ?? ??
??????(Natural Language Processing)
????(Distributional Hypothesis), ??????(Vector Space Models)
???? : ¡®??¡¯? ¡®??¡¯? ??? ? ??
* ???? ???? Harris (1954), Firth (1957)? ?? ??(Distributional Hypothesis)? ??? ?? ???? ??
?? ??(???)? ??? ??? ??*?? ????? ??? ??? ? ??.
????? ??? ??? ??? ??? ??? ???? ??? ????.
distributional hypothesis : ??? ??? ???? ???? ??? ??? ??? ??? ??.
statistical semantics hypothesis : ?? ??? ??? ??? ???? ???? ?? ???? ? ?? ? ??.
bag of words hypothesis : ?? ??? ??? ???? ??? ??? ??? ???? ???? ??? ??.
?? ??? ?? ??? ??? ???? ? ??? ????.
Latent relation hypothesis : ??? ???? ??? ???? ???? ??? ??? ??? ??? ??? ??.
* ?? ?? ???? ?? ? ? ??
??? : ??-????(Term-Document Matrix),
??-????(Word-Context Matrix),
??-????(Pair-Pattern Matrix),
Word2Vec, Glove, Fasttext ?
Word Embedding : ??? ??? ??
¡®one-hot encoding¡¯?? : ?? ??? ??? ??? ??(Bag of Words)
S1. "I am a boy"
S2. "I am a girl"
["I": 0, "am": 1, "a": 2, "boy": 3, "girl": 4]
S1 OHE : [11110]
S2 OHE : [11101]
SVD(?????, Singular Value Decomposition), PCA(?????, Principal Component Analysis)
-> LSA(??????, Latent Sematic Analysis)-> NNLM, Word2Vec, Glove, Fasttext ?
- Word Embedding
???? ???? ???
Unsupervised Learning ???
? ???? ????? ??~
????? ???? ??
(Feature)? ?? ???? ??
????. ??? ????? ?
??? ??? ?? ?????
? ??? ????? ??
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/
Deerwester et ak.(1990)? Landauer and Dumais(1997)? ? ??? ???? ??? ?? ?? ???? ??(latent/hidden
meaning)? ????? ??? ? ?? ? ????? ?? ? ??? ?? ? ??? ?? ??? ??
Rapp(2003)? ?? ???? ??? ??, Vozalis and Margaritis(2003)? ?????? sparsity? ??? ??
https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/
Word Embedding ???
A. Neural Network Language Model(NNLM), Bengio(2003)
P(??P(??|?,??,??,??)?,??,??,??)
?????? ????? ???? ??
???? ??? n?1? ???? n?? ??? ??? N-gram ??? ? ??
ex : ¡®?¡¯, ¡®??¡¯, ¡®??¡¯, ¡®??¡¯ ? ? ??? ¡®??¡¯? ???? ?
?? ???? ???? ???? ?? ??? ???? ???? ??
"A Neural Probabilistic Language Model", Bengio, et al. 2003
http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Neural Network Language
Model(NNLM, Bengio)
Word2Vec
(Google Mikolov)
2003 2013
CBOW Skip-Gram
GloVe
(???? ??)
2014
Fasttext
(????)
2016
http://nlp.stanford.edu/projects/glove/
https://research.fb.com/projects/fasttext/
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/
Word Embedding
B. Word2Vec ??
Distributional Hypothesis ? ??? ???
"Efficient Estimation of Word Representations in Vector Space", Mikolov, et al. 2013
https://arxiv.org/pdf/1301.3781v3.pdf
"word2vec Parameter Learning Explained", Xin Rong,
http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf
Neural Network Language
Model(NNLM, Bengio)
Word2Vec
(Google Mikolov)
2003 2013
Neural Network Language Model(NNLM)? ?????? ?? ??? ??? ????? ????
CBOW(Continuous Bag of Words)? Skip-Gram ? ?? ??? ??
- ??? ??? ?? ???? ??? ??? ?? ??? ??? ?? ?? : ?? ____ ? ???.
- ??? ??? ?? ??? ?? ??? ???? ?? ?? : _____ ??__ ______
???(???, Corpus or Corpora) : ?? ?? ???. ?? ??? ???. ?? ??? ???.
CBOW Skip-Gram
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/
Word Embedding
B. Word2Vec ??
??? ??? ???? ???? ? ??? ??? ???? ??
Word2Vec(Skip-Gram)? ?? ?? ????? ? ??? ?
O : ????(surrounding word)
C : ????(context word)
p(o|c) : ????(c)? ???? ? ????(o)? ??? ?????
? ?? ????? ?? ????? ????? ? ???? ??
¡®?????¡¯? ???? ? ¡®??¡¯?? ??? ?? ???? ??? ??
u? v? ?????
?? : ¡®?????¡¯?? ???? ??? vc, ¡®??¡¯?? ???? ??? uo
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/11/embedding/
[?? ??]
???? ???? ? ?? ? ?(window)? ????, Word2Vec? ???? window ??? ????,
?????? ?????? ?? ? ??? ???? ???? ????? ??? ???????? ???
??? ???
* Word2Vec? window ?? ???? ?? ??? ???? ??? ???? ??? ??????? ?????
(??? ???), ???? ???? ??? ???? ??? ??????(??? ???) ??? ?
?? ??? ? ? ??? ??? ??? ???? ??? ???? ??? ??? ??? ¡®??-????¡®?
?? ???, Word2Vec? ?? count ?? ????? ?? ?? ???? ???? ??(Co-occurrence)?
??
* Word2Vec? ????? ?? count ??? ???? ??? ??? ?? ??
Neural Word Embedding as Implicit Matrix Factorization, Omer and Yoav(2014)
Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/
(King->Queen) + (King -> Man) = Woman
??(?->??) ?->??
Word Embedding
B. Word2Vec ??
https://ronxin.github.io/wevi/
blue dots are input vectors
orange dots are output vectors.
Word Embedding
B. Word2Vec ??
https://ronxin.github.io/wevi/, https://github.com/ronxin/wevi
{"hidden_size":8,"random_state":1,
"learning_rate":0.2}
Training data (context|target):
apple|drink^juice,
orange|eat^apple,
rice|drink^juice,
juice|drink^milk,
milk|drink^rice,
water|drink^milk,
juice|orange^apple,
juice|apple^drink,
milk|rice^drink,
drink|milk^water,
drink|water^juice,
drink|juice^water
Word Embedding
B. Word2Vec ?? king|kindom,queen|kindom,king|palace,queen|palace,king|royal,queen|royal,king|George,q
ueen|Mary,man|rice,woman|rice,man|farmer,woman|farmer,man|house,woman|house,man|G
eorge,woman|Mary
https://ronxin.github.io/wevi/
you can see the infamous analogy: "king - queen = man - woman
https://www.youtube.com/watch?v=D-ekE-Wlcds&feature=youtu.be
https://code.google.com/archive/p/word2vec/
???? ??? ??  1? - Ai lawyer ??? ?? ??
Word Embedding - Exercise 1 ¨C Kaggle Movie Review Data
https://www.kaggle.com/belayati/word2vec-tutorial-suite
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
Word Embedding - Exercise 1
Word Embedding - Exercise 1
Word Embedding - Exercise 1
Word Embedding - Exercise 1
# For the second time
model = word2vec.Word2Vec.load("300features_40minwords_10context")
print("Q1. doesnt_match [man woman child kitchen]")
print(model.doesnt_match("man woman child kitchen".split()))
print("Q2. doesnt_match [france england germany berlin]")
print(model.doesnt_match("france england germany berlin".split()))
print("Q3. doesnt_match [paris berlin london china]")
print(model.doesnt_match("paris berlin london china".split()))
print("Q4. most_similar [man]")
print(model.most_similar("man"))
print("Q5. most_similar [queen]")
print(model.most_similar("queen"))
print("Q6. most_similar [terrible]")
print(model.most_similar("terrible"))
Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews
2017-10-02 22:24:42,461 : INFO : collecting all words and their counts
2017-10-02 22:24:45,961 : INFO : collected 123504 word types from a corpus of 17798082 raw words and 795538
sentences
2017-10-02 22:24:46,110 : INFO : estimated required memory for 16490 words and 300 dimensions: 47821000 bytes
2017-10-02 22:24:46,390 : INFO : training model with 6 workers on 16490 vocabulary and 300 features, using sg=0
hs=0 sample=1e-05 negative=5 window=10
2017-10-02 22:25:43,623 : INFO : saved 300features_40minwords_10context
Word Embedding - Exercise 1
[ ?? ]
Q1. doesnt_match [man woman child kitchen]
Kitchen
Q2. doesnt_match [france england germany berlin]
Berlin
Q3. doesnt_match [paris berlin london china]
China
Q4. most_similar [man]
[('murderer', 0.9646390676498413), ('seeks', 0.9643397927284241), ('priest', 0.9583220481872559),
('obsessed', 0.9529876708984375), ('patient', 0.9518172144889832), ('accused', 0.9511740207672119),
('prostitute', 0.9504649043083191), ('determined', 0.9503258466720581), ('lonely', 0.94843989610672),
('learns', 0.9481915235519409)]
Q5. most_similar [queen]
[('preston', 0.9884580373764038), ('duke', 0.9870703220367432), ('belle', 0.9855383634567261),
('princess', 0.9838896989822388), ('sally', 0.9837985038757324), ('karl', 0.9834704399108887),
('marshall', 0.9831289649009705), ('cole', 0.9830288887023926), ('virginia', 0.9829562306404114),
('veronica', 0.9828841686248779)]
Q6. most_similar [terrible]
[('horrible', 0.9897130131721497), ('awful', 0.976392924785614), ('lame', 0.9656811952590942), ('horrid',
0.9629285335540771), ('alright', 0.9597908854484558), ('boring', 0.9586129188537598), ('mess',
0.9550553560256958), ('cringe', 0.9532005786895752), ('badly', 0.9410943388938904), ('horrendous',
0.9403845071792603)]
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
Word Embedding - Exercise 1
16490 x 300
model.wv.save_word2vec_format('300features_40minwords_10context.txt', binary=False)
https://www.kaggle.com/cherishzhang/clustering-on-papers
https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py
Word Embedding - Exercise 2
https://www.kaggle.com/cherishzhang/clustering-on-papers
https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py
Word Embedding - Exercise 2
step one: extract keywords from Title, Abstract and PaperText based on tf-idf
step two: keywords are used to build the word2vec model
step three: from keywords to paper document, average the top-n keywords vector to represent the whole paper
papers_data['Title_clean'] = papers_data['Title'].apply(lambda x:clean_text(x))
papers_data['Abstract_clean'] = papers_data['Abstract'].apply(lambda x:clean_text(x))
papers_data['PaperText_clean'] = papers_data['PaperText'].apply(lambda x: clean_text(x))
#title2kw = extract_tfidf_keywords(papers_data['Title_clean'],3)
abstract2kw = extract_tfidf_keywords(papers_data['Abstract_clean'], 20)
text2kw = extract_tfidf_keywords(papers_data['PaperText_clean'],100)
print ("[abstract2kw]", abstract2kw)
print ("[text2kw]", text2kw)
(TFIDF) 403?? ??-> abstract, papertext-> ? 20?, 100? ? ??? ??
[
['possibl', 'onli', 'data', 'qualiti', 'involv', 'reduct', 'multipl', 'label', 'popular', 'address',
¡®lower', 'fast', 'challeng', 'form', 'machin learn', 'low', 'obtain', 'rate', 'natur', 'make'],
['loss', 'convex', 'robust', 'classif', 'strong', 'solut', 'prove', 'ani', 'paper propos', 'result',
'label', 'nois', 'limit', 'standard', 'make', 'random', 'howev', 'class', 'experi', 'linear'],
¡­
]
https://www.kaggle.com/cherishzhang/clustering-on-papers
https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py
Word Embedding - Exercise 2
https://www.kaggle.com/cherishzhang/clustering-on-papers
https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py
Word Embedding - Exercise 2
"""
k-means clustering and wordcloud(it can combine topic-models
to give somewhat more interesting visualizations)
"""
num_clusters = 10
km = KMeans(n_clusters=num_clusters)
km.fit(doc2vecs)
clusters = km.labels_.tolist()
http://better.fsc.go.kr/user/extra/fsc/123/fsc_lawreq/view/jsp/LayOutPage.do?lawreqIdx=1612
[CASE1 : ?? ?? ??? ??] - ? ???, ??? ??(???), ?? ?
(??) WEB Page -> HTML FILES
(????) #subContent > div > div
(DATA) CSV FILE ?? (?? 2238? ??/?? ???)
(DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)
(DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)
(DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)
(DATA ANALYSIS) ?? ?? ?? ??
Category 3 : ???
Category 6 : ????
(DATA ANALYSIS) ?? ?? ?? ??
Category 19 : ????
Category 9 : ????
???? ??? ??  1? - Ai lawyer ??? ?? ??
69
¡°?? ??? ??? ??? ??? ?? ????? ?? ????
???? ?? ? ???? ?? ??? ???? ???.¡±
?? ???(Gwynne Shtwell, SpaceX CEO, COO)
70
?????
(facebook.com/sangshik, mikado22001@yahoo.co.kr)
71
72
73

More Related Content

???? ??? ?? 1? - Ai lawyer ??? ?? ??

  • 1. 2017.10.18 ???, Jason Min 1 ¡°AI LAWYER¡± ???? ??? - I Artificial Intelligence Lawyer 1 (???? ?? ??)
  • 2. ??? ?? ???? ???? ?? ? ??? ?? ?????/ ?? ???? ?? ? ?? ??? ????. ??? ?? ?? ???? ???/ ?? ???? ???./ ??? ??? ???/ ??? ???? ??? ?? ????. ??? ??? ?? ? ? ???/ ? ?? ??? ?? ??? ?? ?? ???. ?? ?? ?? ??? ????./ ??? ? ??? ??? ? ??? ??/ ??? ??? ???. ??? ?? ??? ?? ????? ???/ ??? ?? ??????. ??? ??? ??? ??? ?? ????./ ??? ????? ??? ?? ????/ ????? ? ?? ?? ?? ???/ ????? ??? ??. - ???? -
  • 3. ???? ??? ?? [?? ??] - ?? ??? [???? ??] - ???? ??? [????? ??] - ??, ???, ??????? ? [?? ??] - ?? : Python (Anaconda, Gensim, NLTK ?) [?? ??] - ?? ?? ??? ?? - ? ??? 4? ? ?? ?? ? ??? ?? - ??? ??(???), ?? ? - KISA ?? ? [?? ?? - ????] (????) 1957 TF-IDF (?? ?? ?, http://www.bloter.net/archives/264262, A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development, 1(4), 309-317) (Topic Modeling) 2003 LDA(?? ???? ??) (Word Embedding) 2013 Word2Vec (??) [?? ?? ¨C ????] [??????? ??] 1. ? ???? ????(20171018) - ? ?? ?? ??? ?? ?? 2. ??? - ??? 3. ? ??? (QnA) ??? ???
  • 5. ¡®????? ?? ????? ????¡¯? ???? ????? ???? ?? ??? ¡ø??? ?? ?? ??? ?? ¡ø??¡¤?? ?? ¡ø??? ??? ?? ??? ???? ?? ?? ??? ??? ¡°????? ??¡¤?? ?? ???? ??? ???¡±??? ¡°?? ??? ???? ?? ?? ???? ??? ?? ????? ?? ? ?? ???? ??¡±? ??? ??. ?? ¡°?? ?? ??? ??? ??? ???? ??? ?? ??¡±??? ¡°????? ??? ?? ???? ??? ??? ? ??¡±? ?? ?? ?????? 2013? ??? ¡®??? ??(The Future of Employment)¡¯? ? ??? ???, ???? ??? ?? ??? ???? ?? ??? ???? ??? ?????? 94%, ?? ???? 50%? ?? ?? ??? ????
  • 7. COIN : COntract INtelligence
  • 10. [?? ??] - ?? ??? ??? Dept. of Law Mar. 01.2015~ Feb. 09. 2017 2015.3.1 ~ 2017.2.09 2015.4.22015.3.6 2017.8.22
  • 11. [???? ??] - ???? ??? 2017.4.112016.3.06
  • 12. [????? ??] - Computer Science, ??, ???, ??????? ? 2015.3.13 2017.3.18 2015.5.22
  • 17. https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL) Legal Question Answering Data Corpus The corpus of legal questions is drawn from Japanese Legal Bar exams, and the relevant Japanese Civil Law articles have been also provided.
  • 18. https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL) Latent Dirichlet allocation (LDA)
  • 22. http://thrillfighter.tistory.com/466 conda search python conda create -n py35 python=3.5.3 anaconda activate py35 deactivate py35 [?? ??] - ?? : Python (Anaconda, Gensim, NLTK ?)
  • 25. ????, ???? ?? ???(Corpus or Corpora ???, ???) : ??? ??? ??? ??? ??? ??? ?? ??? ??? ??(collocation) : ?? ?? ??? ?? ????? ?? ??? ?? ??? ? ?? ?? ??? ?? Bag of words : bag {a, a, b, c, c, c} = bag {c, a, c, b, a, c}, ??? ?? ? ??? ?? John Rupert Firth (June 17, 1890 in Keighley, Yorkshire ¨C December 14, 1960 in Lindfield, West Sussex) You shall know a word by the company it keeps (Firth, J. R. 1957:11)
  • 26. https://m.blog.naver.com/2feelus/220384206922 ?????? ??????? (????? ???? ) ? ? ? ? 9890 ????? : 2010. 11. 12. ? ? ? : ??????????? ??????????? ??????????? ??? ??(10?) ???? ? ???? ???? ???? ???? ??? ??? ??? ???? ??? ????, ?? ???? ??? ???? ??? ????? ? ? ?? ??? ??? ? 6? ??? ?? ?? ???? ???? ??? ??? ???? ?? ???? ???? ?? ??? ?? ?? ? ????? ????? ??? ??? ? ?? ??. ??? ????? ??? ??? ??? ? 8? ??? ????? ??(? ? 63??2??4?). ¡­ ¡­ ? ¡¤?????? ? ? ? ? ? ?63?(??) ¢Ù (? ?) ?63?(??) ¢Ù (??? ??) ¢Ú ???? ?? ? ?? ? ¢Ú ------------------------- ? ??? ???? ??? ? ---------------------------- ?? ??? ????? ?? ---------------------------- ? ?? ? ??. ??, ?4? -------------.--------------- ? ???? ?????? ? ---------------------------- ?? ??? ??? ??? ? ---------------------------- ?? ???? ??. -------------. 1. ¡« 3. (? ?) 1. ¡« 3. (??? ??) ?? ??? - konlpy.corpus import kobill, 1809890.txt ??????(????? 1?)
  • 27. http://konlpy- ko.readthedocs.io/ko/v0.4.3/morph/ http://konlpy- ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/ https://m.blog.naver.com/2feelus/220384206922 ??????(????? 1?) ?? ??? - from konlpy.tag import Twitter , morphs ??(????? ¨C ????? ????) tokens_ko ['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', '?', '?', '?', '9890', '? ?', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', '???', '?', '???', '?', '???', '???', '?', ' ???', '?', '???', '???', ¡­ ¡­ ¡­ '???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr', ')', '-', '11', '-']
  • 28. http://konlpy- ko.readthedocs.io/ko/v0.4.3/morph/ http://konlpy- ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/ https://m.blog.naver.com/2feelus/220384206922 ??????(????? 1?) ?? ??? - import nltk ko = nltk.Text () tokens_ko ['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', ' ?', '?', '?', '9890', '??', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', ' ???', '?', '???', '?', '???', '???', '?', '???', '?', '???', '???', ¡­ ¡­ ¡­ '???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr', ')', '-', '11', '-'] #4. ?? ??? ?? ?? ?? ???? print(len(ko.tokens)) # returns number of tokens (document length) print(len(set(ko.tokens))) # returns number of unique tokens ?? ?? ? : 1707 ?? ?? ? : 476
  • 29. http://konlpy- ko.readthedocs.io/ko/v0.4.3/morph/ http://konlpy- ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/ https://m.blog.naver.com/2feelus/220384206922 ??????(????? 1?) ?? ??? - import nltk ko = nltk.Text () ?? ??? ??? ??? ???? print("???? ??? : " + str(ko.count(str('????')))) print("?? ??? : " + str(ko.count(str('??')))) ???? ??? : 38 ?? ??? : 7
  • 30. http://konlpy- ko.readthedocs.io/ko/v0.4.3/morph/ http://konlpy- ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/ https://m.blog.naver.com/2feelus/220384206922 ??????(????? 1?) ?? ??? - import nltk ko = nltk.Text () print ("??????") pylab.show = lambda: pylab.savefig('1809890_dispersion_plot.jpg') ko.dispersion_plot(['????', '????', '???'])
  • 31. http://konlpy- ko.readthedocs.io/ko/v0.4.3/morph/ http://konlpy- ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/ https://m.blog.naver.com/2feelus/220384206922 ??????(????? 1?) ?? ??? - import nltk ko = nltk.Text () print ("???? (????? ?? ???? ??? ??? ?????)") #ko.concordance(unicode('?')) ko.concordance('?') Displaying 4 of 4 matches: ?? ?? ? ?? ? ) ? ?? ? ¡± ? ? ? . ? ? ? ? ? ?? ? ? ?? ?? ? ? . - 3 - ? ¡¤ ?? ??? ? ?? ?? ? 1 . ?? ?? ?? ??? ?? ? ????? ? ? ? 71 ?? 2 ?? 4 ? ? ????? ? ?? ?? ?? ? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ? ? ? ? ? ? 63 ?? 2 ?? 4 ? ? ?? ??? ? ???? ?? ? ? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ??? ? ? ? 44 ?? 1 ? ? 7 ? ? ?? ??? ? ???? ??
  • 32. https://m.blog.naver.com/2feelus/220384206922 ??? ?? ????????(Tagging and chunking) 10.1 ??? ???(POS tagging) from konlpy.tag import Twitter; t = Twitter() tags_ko = t.pos('?? ?? ???? ????? ???') print(tags_ko) ??? ??? ??? ? ??? ????. ???? ? ??? ??? ????, ???? ??? ??? ???? ??? ?? ? ?? ???.
  • 33. https://m.blog.naver.com/2feelus/220384206922 https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/ http://dalpo0814.tistory.com/13 ??????(????? 1?) ?? ??? - from gensim import corpora dictionary_ko = corpora.Dictionary(texts_ko) dictionary_ko.save('ko.dict') # save dictionary to file for future use ¡­ ¡­ ??/NounrM?X ??/NounrK{X ??/NounrMX????/VerbrM|X???/VerbrM*X ??/NounrMJX ??/NounrMKX????/VerbrMX???/AdjectiverM?X???/NounrMRX???/NounrM?X ??/NounrM?X ??/NounrM?X ??/NounrM?X?/NounrMZX?/NounrM'X???/NounrM)X???/NounrM?X 172/Numberr M?X ??/Nounr!M?X?/Eomir"K?X???/Nounr#M?X ??/Nounr$K?X???/Verbr%MbX?/Determinerr&K?X?/Nounr'M?X?/Nounr(KjX ??/Nounr)M?X???/Nounr*MfX ??/Nounr+M?X?/Nounr,M?Xí—µÚ/Foreignr-M)X ??/Nounr.M6X ??/Nounr/M?X????/Verbr0M?X ¡­ ¡­ ... from gensim import corpora print('nnencode tokens to integers') dictionary_ko = corpora.Dictionary(texts_ko) dictionary_ko.save('ko.dict') # save dictionary to file for future use
  • 35. ??????(Natural Language Processing) ????(Distributional Hypothesis), ??????(Vector Space Models) ???? : ¡®??¡¯? ¡®??¡¯? ??? ? ?? * ???? ???? Harris (1954), Firth (1957)? ?? ??(Distributional Hypothesis)? ??? ?? ???? ?? ?? ??(???)? ??? ??? ??*?? ????? ??? ??? ? ??. ????? ??? ??? ??? ??? ??? ???? ??? ????. distributional hypothesis : ??? ??? ???? ???? ??? ??? ??? ??? ??. statistical semantics hypothesis : ?? ??? ??? ??? ???? ???? ?? ???? ? ?? ? ??. bag of words hypothesis : ?? ??? ??? ???? ??? ??? ??? ???? ???? ??? ??. ?? ??? ?? ??? ??? ???? ? ??? ????. Latent relation hypothesis : ??? ???? ??? ???? ???? ??? ??? ??? ??? ??? ??. * ?? ?? ???? ?? ? ? ?? ??? : ??-????(Term-Document Matrix), ??-????(Word-Context Matrix), ??-????(Pair-Pattern Matrix), Word2Vec, Glove, Fasttext ?
  • 36. Word Embedding : ??? ??? ?? ¡®one-hot encoding¡¯?? : ?? ??? ??? ??? ??(Bag of Words) S1. "I am a boy" S2. "I am a girl" ["I": 0, "am": 1, "a": 2, "boy": 3, "girl": 4] S1 OHE : [11110] S2 OHE : [11101] SVD(?????, Singular Value Decomposition), PCA(?????, Principal Component Analysis) -> LSA(??????, Latent Sematic Analysis)-> NNLM, Word2Vec, Glove, Fasttext ? - Word Embedding ???? ???? ??? Unsupervised Learning ??? ? ???? ????? ??~ ????? ???? ?? (Feature)? ?? ???? ?? ????. ??? ????? ? ??? ??? ?? ????? ? ??? ????? ?? https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/ Deerwester et ak.(1990)? Landauer and Dumais(1997)? ? ??? ???? ??? ?? ?? ???? ??(latent/hidden meaning)? ????? ??? ? ?? ? ????? ?? ? ??? ?? ? ??? ?? ??? ?? Rapp(2003)? ?? ???? ??? ??, Vozalis and Margaritis(2003)? ?????? sparsity? ??? ??
  • 37. https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/ Word Embedding ??? A. Neural Network Language Model(NNLM), Bengio(2003) P(??P(??|?,??,??,??)?,??,??,??) ?????? ????? ???? ?? ???? ??? n?1? ???? n?? ??? ??? N-gram ??? ? ?? ex : ¡®?¡¯, ¡®??¡¯, ¡®??¡¯, ¡®??¡¯ ? ? ??? ¡®??¡¯? ???? ? ?? ???? ???? ???? ?? ??? ???? ???? ?? "A Neural Probabilistic Language Model", Bengio, et al. 2003 http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Neural Network Language Model(NNLM, Bengio) Word2Vec (Google Mikolov) 2003 2013 CBOW Skip-Gram GloVe (???? ??) 2014 Fasttext (????) 2016 http://nlp.stanford.edu/projects/glove/ https://research.fb.com/projects/fasttext/
  • 38. https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/ Word Embedding B. Word2Vec ?? Distributional Hypothesis ? ??? ??? "Efficient Estimation of Word Representations in Vector Space", Mikolov, et al. 2013 https://arxiv.org/pdf/1301.3781v3.pdf "word2vec Parameter Learning Explained", Xin Rong, http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf Neural Network Language Model(NNLM, Bengio) Word2Vec (Google Mikolov) 2003 2013 Neural Network Language Model(NNLM)? ?????? ?? ??? ??? ????? ???? CBOW(Continuous Bag of Words)? Skip-Gram ? ?? ??? ?? - ??? ??? ?? ???? ??? ??? ?? ??? ??? ?? ?? : ?? ____ ? ???. - ??? ??? ?? ??? ?? ??? ???? ?? ?? : _____ ??__ ______ ???(???, Corpus or Corpora) : ?? ?? ???. ?? ??? ???. ?? ??? ???. CBOW Skip-Gram
  • 39. https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/ Word Embedding B. Word2Vec ?? ??? ??? ???? ???? ? ??? ??? ???? ?? Word2Vec(Skip-Gram)? ?? ?? ????? ? ??? ? O : ????(surrounding word) C : ????(context word) p(o|c) : ????(c)? ???? ? ????(o)? ??? ????? ? ?? ????? ?? ????? ????? ? ???? ?? ¡®?????¡¯? ???? ? ¡®??¡¯?? ??? ?? ???? ??? ?? u? v? ????? ?? : ¡®?????¡¯?? ???? ??? vc, ¡®??¡¯?? ???? ??? uo https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/11/embedding/ [?? ??] ???? ???? ? ?? ? ?(window)? ????, Word2Vec? ???? window ??? ????, ?????? ?????? ?? ? ??? ???? ???? ????? ??? ???????? ??? ??? ??? * Word2Vec? window ?? ???? ?? ??? ???? ??? ???? ??? ??????? ????? (??? ???), ???? ???? ??? ???? ??? ??????(??? ???) ??? ? ?? ??? ? ? ??? ??? ??? ???? ??? ???? ??? ??? ??? ¡®??-????¡®? ?? ???, Word2Vec? ?? count ?? ????? ?? ?? ???? ???? ??(Co-occurrence)? ?? * Word2Vec? ????? ?? count ??? ???? ??? ??? ?? ?? Neural Word Embedding as Implicit Matrix Factorization, Omer and Yoav(2014)
  • 40. Word Embedding B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong https://ronxin.github.io/wevi/
  • 41. Word Embedding B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong https://ronxin.github.io/wevi/
  • 42. Word Embedding B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong https://ronxin.github.io/wevi/ (King->Queen) + (King -> Man) = Woman ??(?->??) ?->??
  • 43. Word Embedding B. Word2Vec ?? https://ronxin.github.io/wevi/ blue dots are input vectors orange dots are output vectors.
  • 44. Word Embedding B. Word2Vec ?? https://ronxin.github.io/wevi/, https://github.com/ronxin/wevi {"hidden_size":8,"random_state":1, "learning_rate":0.2} Training data (context|target): apple|drink^juice, orange|eat^apple, rice|drink^juice, juice|drink^milk, milk|drink^rice, water|drink^milk, juice|orange^apple, juice|apple^drink, milk|rice^drink, drink|milk^water, drink|water^juice, drink|juice^water
  • 45. Word Embedding B. Word2Vec ?? king|kindom,queen|kindom,king|palace,queen|palace,king|royal,queen|royal,king|George,q ueen|Mary,man|rice,woman|rice,man|farmer,woman|farmer,man|house,woman|house,man|G eorge,woman|Mary https://ronxin.github.io/wevi/ you can see the infamous analogy: "king - queen = man - woman https://www.youtube.com/watch?v=D-ekE-Wlcds&feature=youtu.be
  • 48. Word Embedding - Exercise 1 ¨C Kaggle Movie Review Data https://www.kaggle.com/belayati/word2vec-tutorial-suite https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
  • 49. Word Embedding - Exercise 1
  • 50. Word Embedding - Exercise 1
  • 51. Word Embedding - Exercise 1
  • 52. Word Embedding - Exercise 1 # For the second time model = word2vec.Word2Vec.load("300features_40minwords_10context") print("Q1. doesnt_match [man woman child kitchen]") print(model.doesnt_match("man woman child kitchen".split())) print("Q2. doesnt_match [france england germany berlin]") print(model.doesnt_match("france england germany berlin".split())) print("Q3. doesnt_match [paris berlin london china]") print(model.doesnt_match("paris berlin london china".split())) print("Q4. most_similar [man]") print(model.most_similar("man")) print("Q5. most_similar [queen]") print(model.most_similar("queen")) print("Q6. most_similar [terrible]") print(model.most_similar("terrible")) Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews 2017-10-02 22:24:42,461 : INFO : collecting all words and their counts 2017-10-02 22:24:45,961 : INFO : collected 123504 word types from a corpus of 17798082 raw words and 795538 sentences 2017-10-02 22:24:46,110 : INFO : estimated required memory for 16490 words and 300 dimensions: 47821000 bytes 2017-10-02 22:24:46,390 : INFO : training model with 6 workers on 16490 vocabulary and 300 features, using sg=0 hs=0 sample=1e-05 negative=5 window=10 2017-10-02 22:25:43,623 : INFO : saved 300features_40minwords_10context
  • 53. Word Embedding - Exercise 1 [ ?? ] Q1. doesnt_match [man woman child kitchen] Kitchen Q2. doesnt_match [france england germany berlin] Berlin Q3. doesnt_match [paris berlin london china] China Q4. most_similar [man] [('murderer', 0.9646390676498413), ('seeks', 0.9643397927284241), ('priest', 0.9583220481872559), ('obsessed', 0.9529876708984375), ('patient', 0.9518172144889832), ('accused', 0.9511740207672119), ('prostitute', 0.9504649043083191), ('determined', 0.9503258466720581), ('lonely', 0.94843989610672), ('learns', 0.9481915235519409)] Q5. most_similar [queen] [('preston', 0.9884580373764038), ('duke', 0.9870703220367432), ('belle', 0.9855383634567261), ('princess', 0.9838896989822388), ('sally', 0.9837985038757324), ('karl', 0.9834704399108887), ('marshall', 0.9831289649009705), ('cole', 0.9830288887023926), ('virginia', 0.9829562306404114), ('veronica', 0.9828841686248779)] Q6. most_similar [terrible] [('horrible', 0.9897130131721497), ('awful', 0.976392924785614), ('lame', 0.9656811952590942), ('horrid', 0.9629285335540771), ('alright', 0.9597908854484558), ('boring', 0.9586129188537598), ('mess', 0.9550553560256958), ('cringe', 0.9532005786895752), ('badly', 0.9410943388938904), ('horrendous', 0.9403845071792603)]
  • 54. https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/ Word Embedding - Exercise 1 16490 x 300 model.wv.save_word2vec_format('300features_40minwords_10context.txt', binary=False)
  • 56. https://www.kaggle.com/cherishzhang/clustering-on-papers https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py Word Embedding - Exercise 2 step one: extract keywords from Title, Abstract and PaperText based on tf-idf step two: keywords are used to build the word2vec model step three: from keywords to paper document, average the top-n keywords vector to represent the whole paper papers_data['Title_clean'] = papers_data['Title'].apply(lambda x:clean_text(x)) papers_data['Abstract_clean'] = papers_data['Abstract'].apply(lambda x:clean_text(x)) papers_data['PaperText_clean'] = papers_data['PaperText'].apply(lambda x: clean_text(x)) #title2kw = extract_tfidf_keywords(papers_data['Title_clean'],3) abstract2kw = extract_tfidf_keywords(papers_data['Abstract_clean'], 20) text2kw = extract_tfidf_keywords(papers_data['PaperText_clean'],100) print ("[abstract2kw]", abstract2kw) print ("[text2kw]", text2kw) (TFIDF) 403?? ??-> abstract, papertext-> ? 20?, 100? ? ??? ?? [ ['possibl', 'onli', 'data', 'qualiti', 'involv', 'reduct', 'multipl', 'label', 'popular', 'address', ¡®lower', 'fast', 'challeng', 'form', 'machin learn', 'low', 'obtain', 'rate', 'natur', 'make'], ['loss', 'convex', 'robust', 'classif', 'strong', 'solut', 'prove', 'ani', 'paper propos', 'result', 'label', 'nois', 'limit', 'standard', 'make', 'random', 'howev', 'class', 'experi', 'linear'], ¡­ ]
  • 58. https://www.kaggle.com/cherishzhang/clustering-on-papers https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py Word Embedding - Exercise 2 """ k-means clustering and wordcloud(it can combine topic-models to give somewhat more interesting visualizations) """ num_clusters = 10 km = KMeans(n_clusters=num_clusters) km.fit(doc2vecs) clusters = km.labels_.tolist()
  • 60. (??) WEB Page -> HTML FILES
  • 61. (????) #subContent > div > div
  • 62. (DATA) CSV FILE ?? (?? 2238? ??/?? ???)
  • 63. (DATA ANALYSIS) ???? step one: extract keywords from Document step two: keywords are used to build the word2vec model step three: Document Clustering (K-Means, 20)
  • 64. (DATA ANALYSIS) ???? step one: extract keywords from Document step two: keywords are used to build the word2vec model step three: Document Clustering (K-Means, 20)
  • 65. (DATA ANALYSIS) ???? step one: extract keywords from Document step two: keywords are used to build the word2vec model step three: Document Clustering (K-Means, 20)
  • 66. (DATA ANALYSIS) ?? ?? ?? ?? Category 3 : ??? Category 6 : ????
  • 67. (DATA ANALYSIS) ?? ?? ?? ?? Category 19 : ???? Category 9 : ????
  • 69. 69
  • 70. ¡°?? ??? ??? ??? ??? ?? ????? ?? ???? ???? ?? ? ???? ?? ??? ???? ???.¡± ?? ???(Gwynne Shtwell, SpaceX CEO, COO) 70
  • 72. 72
  • 73. 73