�ݺ�ߣ

2017.10.18
???, Jason Min
1
��AI LAWYER�� ???? ??? - I
Artificial Intelligence Lawyer 1
(???? ?? ??)

??? ?? ????
???? ?? ? ??? ?? ?????/ ?? ???? ?? ? ?? ??? ????.
??? ?? ?? ???? ???/ ?? ???? ???./ ??? ??? ???/
??? ???? ??? ?? ????.
??? ??? ?? ? ? ???/ ? ?? ??? ?? ??? ?? ?? ???.
?? ?? ?? ??? ????./ ??? ? ??? ??? ? ??? ??/ ??? ??? ???.
??? ?? ??? ?? ????? ???/ ??? ?? ??????.
??? ??? ??? ??? ?? ????./ ??? ????? ??? ?? ????/
????? ? ?? ?? ?? ???/ ????? ??? ??.
- ???? -

???? ??? ??
[?? ??]
- ?? ???
[???? ??]
- ???? ???
[????? ??]
- ??, ???, ??????? ?
[?? ??]
- ?? : Python (Anaconda, Gensim, NLTK ?)
[?? ??]
- ?? ?? ??? ?? - ? ??? 4? ? ?? ?? ? ??? ??
- ??? ??(???), ?? ?
- KISA ?? ?
[?? ?? - ????]
(????)
1957 TF-IDF (?? ?? ?, http://www.bloter.net/archives/264262, A statistical approach to mechanized encoding and searching of
literary information. IBM Journal of research and development, 1(4), 309-317)
(Topic Modeling)
2003 LDA(?? ???? ??)
(Word Embedding)
2013 Word2Vec (??)
[?? ?? �C ????]
[??????? ??]
1. ? ???? ????(20171018)
- ? ?? ?? ??? ?? ??
2. ???
- ???
3. ? ??? (QnA) ??? ???

???? ??? ?? 1? - Ai lawyer ??? ?? ??

��????? ?? ????? ????��? ???? ????? ???? ?? ??? ��??? ?? ?? ??? ?? ��??��?? ??
��??? ??? ?? ??? ???? ?? ?? ??? ???
��????? ??��?? ?? ???? ??? ???��??? ��?? ??? ???? ?? ?? ???? ??? ?? ????? ?? ?
?? ???? ??��? ??? ??. ?? ��?? ?? ??? ??? ??? ???? ??? ?? ??��??? ��????? ??? ??
???? ??? ??? ? ??��? ??
?? ?????? 2013? ??? ��??? ??(The Future of Employment)��?
? ??? ???, ???? ??? ?? ??? ???? ?? ??? ????
??? ?????? 94%, ?? ???? 50%? ?? ?? ??? ????

https://www.atriumlts.com/,http://www.venturesquare.net/753779

[?? ??]
- ?? ???
???
Dept. of Law
Mar. 01.2015~ Feb. 09. 2017
2015.3.1 ~ 2017.2.09
2015.4.22015.3.6
2017.8.22

[???? ??]
- ???? ???
2017.4.112016.3.06

[????? ??]
- Computer Science, ??, ???, ??????? ?
2015.3.13 2017.3.18
2015.5.22

http://www.nms.kcl.ac.uk/icail2017/cfcoliee.php
http://webdocs.cs.ualberta.ca/~miyoung2/COLIEE2017/
https://easychair.org/publications/paper/350845
Competition on Legal Information Extraction and Entailment ??
- ???? ?? ? ?? ?? (JAPAN) - * 2017?, 4?

https://easychair.org/publications/volume/COLIEE_2017

https://sites.google.com/site/ntcir11riteval/ Overview_of_COLIEE_2017.pdf International Conference on Artificial Intelligence and Law (ICAIL)

Legal Question Answering Data Corpus The corpus of legal questions is
drawn from Japanese Legal Bar exams, and the relevant Japanese Civil
Law articles have been also provided.

Latent Dirichlet allocation (LDA)

http://thrillfighter.tistory.com/466
conda search python
conda create -n py35 python=3.5.3 anaconda
activate py35
deactivate py35
[?? ??]
- ?? : Python (Anaconda, Gensim, NLTK ?)

https://www.lucypark.kr/slides/2015-pyconkr/#2

http://radimrehurek.com/gensim/install.html

????, ???? ??
???(Corpus or Corpora ???, ???) : ??? ??? ??? ??? ??? ??? ??
??? ???
??(collocation) : ?? ?? ??? ?? ????? ??
??? ?? ??? ? ?? ?? ??? ??
Bag of words : bag {a, a, b, c, c, c} = bag {c, a, c, b, a, c}, ??? ??
? ??? ?? John Rupert Firth
(June 17, 1890 in Keighley, Yorkshire �C December 14, 1960
in Lindfield, West Sussex)
You shall know a word by the company
it keeps (Firth, J. R. 1957:11)

https://m.blog.naver.com/2feelus/220384206922
?????? ???????
(????? ???? )
? ?
? ?
9890
????? : 2010. 11. 12.
? ? ? : ???????????
???????????
???????????
??? ??(10?)
???? ? ????
???? ???? ???? ??? ??? ??? ???? ??? ????, ?? ???? ??? ???? ??? ????? ?
? ?? ??? ??? ? 6? ??? ?? ?? ???? ???? ??? ??? ???? ?? ???? ???? ?? ??? ??
?? ? ????? ????? ??? ??? ? ?? ??. ??? ????? ??? ??? ??? ? 8? ??? ????? ??(? ?
63??2??4?).
��
��
? ��??????
? ?
? ? ?
?63?(??) �� (? ?)
?63?(??) �� (??? ??)
�� ???? ?? ? ?? ? �� -------------------------
? ??? ???? ??? ? ----------------------------
?? ??? ????? ?? ----------------------------
? ?? ? ??. ??, ?4? -------------.---------------
? ???? ?????? ? ----------------------------
?? ??? ??? ??? ? ----------------------------
?? ???? ??. -------------.
1. �� 3. (? ?)
1. �� 3. (??? ??)
?? ??? - konlpy.corpus import kobill, 1809890.txt
??????(????? 1?)

http://konlpy-
ko.readthedocs.io/ko/v0.4.3/morph/
http://konlpy-
ko.readthedocs.io/ko/v0.4.3/api/konlpy.tag/
??????(????? 1?)
?? ??? - from konlpy.tag import Twitter ,
morphs ??(????? �C ????? ????)
tokens_ko
['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', '?', '?', '?', '9890', '?
?', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', '???', '?', '???', '?', '???', '???', '?', '
???', '?', '???', '???',
��
��
��
'???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr', ')', '-', '11', '-']

http://konlpy-
http://konlpy-
??????(????? 1?)
?? ??? - import nltk
ko = nltk.Text ()
tokens_ko
['??????', '??', '??', '??', '?', '(', '???', '??', '??', '??', ')', '?', '
?', '?', '?', '9890', '??', '???', ':', '2010', '.', '11', '.', '12', '.', '?', '?', '?', ':', '
???', '?', '???', '?', '???', '???', '?', '???', '?', '???', '???',
��
��
��
'???', '?', '?', '?', '(', '02', '-', '788', '-', '4649', ',', 'tanzania@assembly.go.kr',
')', '-', '11', '-']
#4. ?? ??? ?? ?? ?? ????
print(len(ko.tokens)) # returns number of tokens (document length)
print(len(set(ko.tokens))) # returns number of unique tokens
?? ?? ? : 1707
?? ?? ? : 476

http://konlpy-
http://konlpy-
??????(????? 1?)
ko = nltk.Text ()
?? ??? ??? ??? ????
print("???? ??? : " + str(ko.count(str('????'))))
print("?? ??? : " + str(ko.count(str('??'))))
???? ??? : 38
?? ??? : 7

http://konlpy-
http://konlpy-
??????(????? 1?)
ko = nltk.Text ()
print ("??????")
pylab.show = lambda: pylab.savefig('1809890_dispersion_plot.jpg')
ko.dispersion_plot(['????', '????', '???'])

http://konlpy-
http://konlpy-
??????(????? 1?)
ko = nltk.Text ()
print ("???? (????? ?? ???? ??? ??? ?????)")
#ko.concordance(unicode('?'))
ko.concordance('?')
Displaying 4 of 4 matches:
?? ?? ? ?? ? ) ? ?? ? �� ? ? ? . ? ? ? ? ? ?? ? ? ?? ?? ? ? . - 3 - ? �� ?? ???
? ?? ?? ? 1 . ?? ?? ?? ??? ?? ? ????? ? ? ? 71 ?? 2 ?? 4 ? ? ????? ? ?? ?? ??
? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ? ? ? ? ? ? 63 ?? 2 ?? 4 ? ? ?? ??? ? ???? ??
? ? 6 ? ?? ?? ? 8 ? ?? ? ?? , ? ?? ??? ? ? ? 44 ?? 1 ? ? 7 ? ? ?? ??? ? ???? ??

??? ??
????????(Tagging and chunking) 10.1 ??? ???(POS tagging)
from konlpy.tag import Twitter;
t = Twitter()
tags_ko = t.pos('?? ?? ???? ????? ???')
print(tags_ko)
??? ??? ??? ? ??? ????.
???? ? ??? ??? ????,
???? ??? ??? ???? ??? ?? ? ?? ???.

https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/
http://dalpo0814.tistory.com/13
??????(????? 1?)
?? ??? - from gensim import corpora
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict') # save dictionary to file for future use
��
��
??/NounrM?X
??/NounrK{X
??/NounrMX????/VerbrM|X???/VerbrM*X
??/NounrMJX
??/NounrMKX????/VerbrMX???/AdjectiverM?X???/NounrMRX???/NounrM?X
??/NounrM?X
??/NounrM?X
??/NounrM?X?/NounrMZX?/NounrM'X???/NounrM)X???/NounrM?X
172/Numberr M?X
??/Nounr!M?X?/Eomir"K?X???/Nounr#M?X
??/Nounr$K?X???/Verbr%MbX?/Determinerr&K?X?/Nounr'M?X?/Nounr(KjX
??/Nounr)M?X???/Nounr*MfX
??/Nounr+M?X?/Nounr,M?X헵�/Foreignr-M)X
??/Nounr.M6X
??/Nounr/M?X????/Verbr0M?X
��
��
...
from gensim import corpora
print('nnencode tokens to integers')
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict') # save dictionary to file for future use

??????(Natural Language Processing)
????(Distributional Hypothesis), ??????(Vector Space Models)
???? : ��??��? ��??��? ??? ? ??
* ???? ???? Harris (1954), Firth (1957)? ?? ??(Distributional Hypothesis)? ??? ?? ???? ??
?? ??(???)? ??? ??? ??*?? ????? ??? ??? ? ??.
????? ??? ??? ??? ??? ??? ???? ??? ????.
distributional hypothesis : ??? ??? ???? ???? ??? ??? ??? ??? ??.
statistical semantics hypothesis : ?? ??? ??? ??? ???? ???? ?? ???? ? ?? ? ??.
bag of words hypothesis : ?? ??? ??? ???? ??? ??? ??? ???? ???? ??? ??.
?? ??? ?? ??? ??? ???? ? ??? ????.
Latent relation hypothesis : ??? ???? ??? ???? ???? ??? ??? ??? ??? ??? ??.
* ?? ?? ???? ?? ? ? ??
??? : ??-????(Term-Document Matrix),
??-????(Word-Context Matrix),
??-????(Pair-Pattern Matrix),
Word2Vec, Glove, Fasttext ?

Word Embedding : ??? ??? ??
��one-hot encoding��?? : ?? ??? ??? ??? ??(Bag of Words)
S1. "I am a boy"
S2. "I am a girl"
["I": 0, "am": 1, "a": 2, "boy": 3, "girl": 4]
S1 OHE : [11110]
S2 OHE : [11101]
SVD(?????, Singular Value Decomposition), PCA(?????, Principal Component Analysis)
-> LSA(??????, Latent Sematic Analysis)-> NNLM, Word2Vec, Glove, Fasttext ?
- Word Embedding
???? ???? ???
Unsupervised Learning ???
? ???? ????? ??~
????? ???? ??
(Feature)? ?? ???? ??
????. ??? ????? ?
??? ??? ?? ?????
? ??? ????? ??
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/
Deerwester et ak.(1990)? Landauer and Dumais(1997)? ? ??? ???? ??? ?? ?? ???? ??(latent/hidden
meaning)? ????? ??? ? ?? ? ????? ?? ? ??? ?? ? ??? ?? ??? ??
Rapp(2003)? ?? ???? ??? ??, Vozalis and Margaritis(2003)? ?????? sparsity? ??? ??

https://datascienceschool.net/view-notebook/6927b0906f884a67b0da9310d3a581ee/
Word Embedding ???
A. Neural Network Language Model(NNLM), Bengio(2003)
P(??P(??|?,??,??,??)?,??,??,??)
?????? ????? ???? ??
???? ??? n?1? ???? n?? ??? ??? N-gram ??? ? ??
ex : ��?��, ��??��, ��??��, ��??�� ? ? ??? ��??��? ???? ?
?? ???? ???? ???? ?? ??? ???? ???? ??
"A Neural Probabilistic Language Model", Bengio, et al. 2003
http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Neural Network Language
Model(NNLM, Bengio)
Word2Vec
(Google Mikolov)
2003 2013
CBOW Skip-Gram
GloVe
(???? ??)
2014
Fasttext
(????)
2016
http://nlp.stanford.edu/projects/glove/
https://research.fb.com/projects/fasttext/

https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/
Word Embedding
B. Word2Vec ??
Distributional Hypothesis ? ??? ???
"Efficient Estimation of Word Representations in Vector Space", Mikolov, et al. 2013
https://arxiv.org/pdf/1301.3781v3.pdf
"word2vec Parameter Learning Explained", Xin Rong,
http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf
Neural Network Language
Model(NNLM, Bengio)
Word2Vec
(Google Mikolov)
2003 2013
Neural Network Language Model(NNLM)? ?????? ?? ??? ??? ????? ????
CBOW(Continuous Bag of Words)? Skip-Gram ? ?? ??? ??
- ??? ??? ?? ???? ??? ??? ?? ??? ??? ?? ?? : ?? ____ ? ???.
- ??? ??? ?? ??? ?? ??? ???? ?? ?? : _____ ??__ ______
???(???, Corpus or Corpora) : ?? ?? ???. ?? ??? ???. ?? ??? ???.
CBOW Skip-Gram

https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/30/word2vec/
Word Embedding
B. Word2Vec ??
??? ??? ???? ???? ? ??? ??? ???? ??
Word2Vec(Skip-Gram)? ?? ?? ????? ? ??? ?
O : ????(surrounding word)
C : ????(context word)
p(o|c) : ????(c)? ???? ? ????(o)? ??? ?????
? ?? ????? ?? ????? ????? ? ???? ??
��?????��? ???? ? ��??��?? ??? ?? ???? ??? ??
u? v? ?????
?? : ��?????��?? ???? ??? vc, ��??��?? ???? ??? uo
https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/03/11/embedding/
[?? ??]
???? ???? ? ?? ? ?(window)? ????, Word2Vec? ???? window ??? ????,
?????? ?????? ?? ? ??? ???? ???? ????? ??? ???????? ???
??? ???
* Word2Vec? window ?? ???? ?? ??? ???? ??? ???? ??? ??????? ?????
(??? ???), ???? ???? ??? ???? ??? ??????(??? ???) ??? ?
?? ??? ? ? ??? ??? ??? ???? ??? ???? ??? ??? ??? ��??-????��?
?? ???, Word2Vec? ?? count ?? ????? ?? ?? ???? ???? ??(Co-occurrence)?
??
* Word2Vec? ????? ?? count ??? ???? ??? ??? ?? ??
Neural Word Embedding as Implicit Matrix Factorization, Omer and Yoav(2014)

Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
https://ronxin.github.io/wevi/

Word Embedding
B. Word2Vec - word2vec Parameter Learning Explained, Xin Rong
(King->Queen) + (King -> Man) = Woman
??(?->??) ?->??

Word Embedding
B. Word2Vec ??
blue dots are input vectors
orange dots are output vectors.

Word Embedding
B. Word2Vec ??
https://ronxin.github.io/wevi/, https://github.com/ronxin/wevi
{"hidden_size":8,"random_state":1,
"learning_rate":0.2}
Training data (context|target):
apple|drink^juice,
orange|eat^apple,
rice|drink^juice,
juice|drink^milk,
milk|drink^rice,
water|drink^milk,
juice|orange^apple,
juice|apple^drink,
milk|rice^drink,
drink|milk^water,
drink|water^juice,
drink|juice^water

https://code.google.com/archive/p/word2vec/

Word Embedding - Exercise 1 �C Kaggle Movie Review Data
https://www.kaggle.com/belayati/word2vec-tutorial-suite
https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/

Word Embedding - Exercise 1
# For the second time
model = word2vec.Word2Vec.load("300features_40minwords_10context")
print("Q1. doesnt_match [man woman child kitchen]")
print(model.doesnt_match("man woman child kitchen".split()))
print("Q2. doesnt_match [france england germany berlin]")
print(model.doesnt_match("france england germany berlin".split()))
print("Q3. doesnt_match [paris berlin london china]")
print(model.doesnt_match("paris berlin london china".split()))
print("Q4. most_similar [man]")
print(model.most_similar("man"))
print("Q5. most_similar [queen]")
print(model.most_similar("queen"))
print("Q6. most_similar [terrible]")
print(model.most_similar("terrible"))
Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews
2017-10-02 22:24:42,461 : INFO : collecting all words and their counts
2017-10-02 22:24:45,961 : INFO : collected 123504 word types from a corpus of 17798082 raw words and 795538
sentences
2017-10-02 22:24:46,110 : INFO : estimated required memory for 16490 words and 300 dimensions: 47821000 bytes
2017-10-02 22:24:46,390 : INFO : training model with 6 workers on 16490 vocabulary and 300 features, using sg=0
hs=0 sample=1e-05 negative=5 window=10
2017-10-02 22:25:43,623 : INFO : saved 300features_40minwords_10context

[ ?? ]
Q1. doesnt_match [man woman child kitchen]
Kitchen
Q2. doesnt_match [france england germany berlin]
Berlin
Q3. doesnt_match [paris berlin london china]
China
Q4. most_similar [man]
[('murderer', 0.9646390676498413), ('seeks', 0.9643397927284241), ('priest', 0.9583220481872559),
('obsessed', 0.9529876708984375), ('patient', 0.9518172144889832), ('accused', 0.9511740207672119),
('prostitute', 0.9504649043083191), ('determined', 0.9503258466720581), ('lonely', 0.94843989610672),
('learns', 0.9481915235519409)]
Q5. most_similar [queen]
[('preston', 0.9884580373764038), ('duke', 0.9870703220367432), ('belle', 0.9855383634567261),
('princess', 0.9838896989822388), ('sally', 0.9837985038757324), ('karl', 0.9834704399108887),
('marshall', 0.9831289649009705), ('cole', 0.9830288887023926), ('virginia', 0.9829562306404114),
('veronica', 0.9828841686248779)]
Q6. most_similar [terrible]
[('horrible', 0.9897130131721497), ('awful', 0.976392924785614), ('lame', 0.9656811952590942), ('horrid',
0.9629285335540771), ('alright', 0.9597908854484558), ('boring', 0.9586129188537598), ('mess',
0.9550553560256958), ('cringe', 0.9532005786895752), ('badly', 0.9410943388938904), ('horrendous',
0.9403845071792603)]

https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/
16490 x 300
model.wv.save_word2vec_format('300features_40minwords_10context.txt', binary=False)

https://www.kaggle.com/cherishzhang/clustering-on-papers
https://github.com/benhamner/nips-2015-papers/blob/master/src/download_papers.py

step one: extract keywords from Title, Abstract and PaperText based on tf-idf
step two: keywords are used to build the word2vec model
step three: from keywords to paper document, average the top-n keywords vector to represent the whole paper
papers_data['Title_clean'] = papers_data['Title'].apply(lambda x:clean_text(x))
papers_data['Abstract_clean'] = papers_data['Abstract'].apply(lambda x:clean_text(x))
papers_data['PaperText_clean'] = papers_data['PaperText'].apply(lambda x: clean_text(x))
#title2kw = extract_tfidf_keywords(papers_data['Title_clean'],3)
abstract2kw = extract_tfidf_keywords(papers_data['Abstract_clean'], 20)
text2kw = extract_tfidf_keywords(papers_data['PaperText_clean'],100)
print ("[abstract2kw]", abstract2kw)
print ("[text2kw]", text2kw)
(TFIDF) 403?? ??-> abstract, papertext-> ? 20?, 100? ? ??? ??
[
['possibl', 'onli', 'data', 'qualiti', 'involv', 'reduct', 'multipl', 'label', 'popular', 'address',
��lower', 'fast', 'challeng', 'form', 'machin learn', 'low', 'obtain', 'rate', 'natur', 'make'],
['loss', 'convex', 'robust', 'classif', 'strong', 'solut', 'prove', 'ani', 'paper propos', 'result',
'label', 'nois', 'limit', 'standard', 'make', 'random', 'howev', 'class', 'experi', 'linear'],
��
]

"""
k-means clustering and wordcloud(it can combine topic-models
to give somewhat more interesting visualizations)
"""
num_clusters = 10
km = KMeans(n_clusters=num_clusters)
km.fit(doc2vecs)
clusters = km.labels_.tolist()

http://better.fsc.go.kr/user/extra/fsc/123/fsc_lawreq/view/jsp/LayOutPage.do?lawreqIdx=1612
[CASE1 : ?? ?? ??? ??] - ? ???, ??? ??(???), ?? ?

(????) #subContent > div > div

(DATA) CSV FILE ?? (?? 2238? ??/?? ???)

(DATA ANALYSIS) ????
step one: extract keywords from Document
step two: keywords are used to build the word2vec model
step three: Document Clustering (K-Means, 20)

(DATA ANALYSIS) ?? ?? ?? ??
Category 3 : ???
Category 6 : ????

(DATA ANALYSIS) ?? ?? ?? ??
Category 19 : ????
Category 9 : ????

��?? ??? ??? ??? ??? ?? ????? ?? ????
???? ?? ? ???? ?? ??? ???? ???.��
?? ???(Gwynne Shtwell, SpaceX CEO, COO)
70

?????
(facebook.com/sangshik, mikado22001@yahoo.co.kr)
71

�ݺ�ߣ

???? ??? ?? 1? - Ai lawyer ??? ?? ??

More Related Content

???? ??? ?? 1? - Ai lawyer ??? ?? ??