端端舝

棒啋祅𦑩
恅𤩸-�g惤俴蹈互操湮卅浩俴蹈卞卅勻化忒卞�尹卅中!
☆友仇★午☆卞扎氏仇★(肮膽惤) 毛肮珨�仄凶中!
�靡及☆泬笢★午華靡及☆泬笢★(嗣膽惤) 毛�e昜分午心卅仄凶中!
? 棒啋祅𦑩 (dimensionality reduction) 毛瞳蚚 (e.g. 弁仿旦正伉件弘﹜玄
疋永弁乒犯伙)
2 / 11

Gensim
Gensim
玄疋永弁乒犯伙 (pLSA, LDA) 支 deep learning(word2vec) 毛��g卞妏尹月仿
奶皮仿伉 [2][3]
鼠宒扔奶玄及 tutorial 反�補煦井曰卞仁中匹允
妏中源反 [4] 支 [1] 卞�仄中
Figure: Mentioned by the author:)
3 / 11

Gensim
Gensim
鼠宒扔奶玄及 tutorial 反�補煦井曰卞仁中匹允
妏中源反 [4] 支 [1] 卞�仄中
..........
..........
..........
..........
..........
..........
System
and human
system
documents
..........
..........
..........
..........
..........
..........
['system',
'and'
'human']
texts
倛怓匼賤昴
{'and': 19,
'minors': 37, ...}
dic = corpora.Dictonary()
..........
..........
..........
..........
..........
..........
[(10, 2),
(19, 1),
(3, 1), ...]
corpus
dic.doc2bow()
棗𤩸午tf�毛�𡛟葆仃
dic.save()
dict.dic
MmCorpus
.serialize()
corpus.mm
tf?idf
LSALSA
LDA
HDP
RP
log
entropy
word
2vec
models
model
.save()
lda.model
dic.load() MmCorpus()
model
.load()
similarities
恅𤩸及�侔俶瓚隅
lda.model
topic
extraction
model
.show_topics()
恅𤩸及玄疋永弁喲堤
Figure: Gensim 毛妏勻凶�I燴及珨瞰
4 / 11

Gensim
Step0. documents
啋及恅𤩸毛伉旦玄倰匹𨃨�
1 # 啋及恅𤩸
2 documents = [
3 §Human machine interface for lab abc computer applications§,
4 §A survey of user opinion of computer system response time§,
5 §The EPS user interface management system§,
6 §System and human system engineering testing of EPS§,
7 §Relation of user perceived response time to error measurement§,
8 §The generation of random binary unordered trees§,
9 §The intersection graph of paths in trees§,
10 §Graph minors IV Widths of trees and well quasi ordering§,
11 §Graph minors A survey§]
5 / 11

Gensim
Step1. 倛怓匼賤昴
1 def parse(doc):
2 # ゜掛惤卅日倛怓匼賤昴
3 # stopword毛壺�允月
4 stoplist = set(＊for a of the and to in＊.split())
5 text = [word for word in doc.lower().split() if word not in stoplist]
6 return text
7
8 texts = [[w for w in parse(doc)] for doc in documents]
9 print texts
10 ＊＊＊ [
11 [＊human＊, ＊machine＊, ＊interface＊, ...],
12 [＊a＊, ＊survey＊, ＊of＊, ＊user＊, ...],
13 ...] ＊＊＊
6 / 11

Gensim
Step2. 棗𤩸毛釬傖
1 dic = corpora.Dictionary(texts)
2 # 操湮卅犯奈正卞�仄化反媆嶲互井井月及匹悵湔﹝
3 dic.save(＊dict.dic＊)
4 # dic.load(＊dict.dic＊) 匹掂心煋心﹝
5
6 print dic.token2id
7 # {＊and＊: 19, ＊minors＊: 37, ＊generation＊: 28, ...}
8 print dic[19]
9 # ＊and＊互堤薯今木月﹝
7 / 11

Gensim
Step3. 戊奈由旦毛釬傖
1 # 釬傖仄凶棗𤩸毛妏勻化﹜恅𤩸毛劐𡥼
2 new_doc = §Human computer interaction§
3 new_vec = dic.doc2bow(parse(new_doc))
4 print new_vec
5 # §interaction§反棗𤩸卞卅中及匹剠�今木月
6 # [(2, 1), (4, 1)]
7
8 # 肮�卞仄化﹜郔場及恅𤩸摩磁卞�仄化corpus(恅𤩸?�g惤俴蹈)毛釬傖
9 # 仇仇匹反﹜�g�卅tf�井日卅月恅𤩸?�g惤俴蹈毛釬傖
10 corpus = [dic.doc2bow(text) for text in texts]
11 print corpus
13 # Matrix Market倛宒匹 corpus毛悵湔﹝坻及倛宒匹手謎中﹝
14 corpora.MmCorpus.serialize(＊corpus.mm＊, corpus)
15 # 悵湔仄凶 corpus毛掂心煋戈午五
16 # corpus = corpora.MmCorpus(＊corpus.mm＊)
17
18 # 釬傖仄凶戊奈由旦匹�侔僅毛䛐月
19 index = similarities.docsim.SparseMatrixSimilarity(corpus, num_features=len(dic))
20 # 弁巨伉毛杻釾矛弁玄伙匹桶政
21 query = [(0,1),(4,1)]
22 # query午�侔允月手及奻弇 10璃毛堤薯
23 print sorted(enumerate(index[query]), reverse=True, key=lambda x:x[1])[:10]
8 / 11

Gensim
Step4. 乒犯伙毛羥蚚 (tf?idf)
1 m = models.T?dfModel(corpus)
2 # tf?idf�井日卅月恅𤩸?�g惤俴蹈毛釬傖
3 # m[corpus[0]] 匹 0楓醴及恅𤩸及杻釾矛弁玄伙卞卅月
4 corpus = m[corpus]
5 # m[corpus]反婬太戊奈由旦午仄化妏蚚褫夔
Step5. 玄疋永弁乒犯伙毛羥蚚
1 # topic杅反 200?500仁日中互癶籵?
2 m = models.LdaModel(corpus, id2word = dic, num_topics = 3)
4 m.save(＊lda.model＊)
5 # m[corpus[i]] 卞漪引木月 tuple反﹜恅𤩸i互 topic j卞扽允月復薹 P(t_j | d_i) 毛桶允
6
7 # 腕日木凶 topic午公及傖煦毛桶尨
8 for n in range(0, m.num_topics):
9 # formatted=True午允月午﹜�倰乒犯伙匹桶尨
10 print m.show_topics(formatted=False)
9 / 11

Gensim
堤薯今木凶玄疋永弁
topic1 = 0.097 ? system + 0.068 ? eps + 0.055 ? human + 0.054 ? interface
+ 0.040 ? trees + 0.040 ? user + 0.039 ? engineering
+ 0.039 ? management + 0.039 ? testing + 0.039 ? binary
topic2 = 0.077 ? graph + 0.074 ? trees + 0.046 ? minors + 0.043 ? response
+ 0.043 ? ordering + 0.043 ? well + 0.043 ? iv + 0.043 ? quasi
+ 0.043 ? widths + 0.042 ? user
topic3 = 0.081 ? computer + 0.060 ? user + 0.060 ? system + 0.060 ? survey
+ 0.059 ? time + 0.058 ? response + 0.058 ? opinion + 0.038 ? lab
+ 0.037 ? abc + 0.037 ? machine
10 / 11

Reference I
Python 蚚及玄疋永弁乒犯伙及仿奶皮仿伉 gensim 及妏中源 (翋卞゜掛惤及氾平旦玄及掂心煋心)
- 峔昜岆淩 @Scaled_Wurm. url:
http://sucrose.hatenablog.com/entry/2013/10/29/001041.
Radim ?eh??ek. gensim: Topic modelling for humans. url:
http://radimrehurek.com/gensim.
Radim ?eh??ek. ※Software Framework for Topic Modelling with Large Corpora§. In:
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010,
pp. 45每50. url: http://www.muni.cz/research/publications/884893.
詢点晪壅. LSI 支 LDA 毛忒幏卞�六月 Gensim 毛妏勻凶赻�晟惤�I燴⻌嬡 - SELECT *
FROM life; url: http://yuku-tech.hatenablog.com/entry/20110623/1308810518.
11 / 11

端端舝

Gensim

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Gensim (20)

More from saireya _ (20)

Recently uploaded (11)

Gensim