狠狠撸

MM text team
蔡捷恩
莊文立
溫鈺瑋
2015@Delta Research Center

Fully automatic F/T matrix
analysis from patent data
蔡捷恩

Function/Technology MatrixUsing keyword “ ”
“The Patent-Classification Technology/Function Matrix - A Systematic Method for Design Around”, Cheng et al. Mar-2013, CSIR

Problem reduce
? detecting problem/solution pairs in a patent
document
“Automatic Discovery of Technology Trends from Patent”, Y. Kim et al. 2009, ACMSAC

Problem term detection
? Step1. finding key frames
? Step2. feature extraction
– Unsupervised feature
– Supervised feature
? Step3. classifier training

Step1. key frames detection
? We define key frames to be “
”

Step2 – unsupervised feature
(language model)
? The model:
Maximize likelihood evaluation(MLE)

Step2 – supervised feature
(linguistic model)
? By part-of-speech(POS) statistic on labeled
patents

Step2 – supervised feature
(linguistic model)
? The model:
Delta function = 1 only when the current key frame
matches the given pattern

Step3. classifier training
? Simply concatenate the features mention
above => LIBSVM

Solution term detection
? Step1. key frame detection
? Step2. feature extraction
– Unsupervised feature
– Supervised feature: based on problem terms
? Step3. classifier training

Problems
? Lacked of labeled data? => the linguistic
model proposed in the paper seems general
enough => believe it directly with porter
stemming

Further improvement
? Coreference resolution
– “the method solves the problem of overfitting.”
? Semantic based clustering
– Okapi BM25 ”The Probabilistic Relevance Framework: BM25 and Beyond”, Robertson et al., 2009
– Word vector “Efficient Estimation of Word Representations in Vector Space” T. Mikolov, ICLR, 2013.
– Document vector “Distributed Representations of Words and Phrases and their
Compositionality”,NIPS, 2013.
In my opinion: okapi > word vector > document vector

中文領域術語提取
溫鈺瑋

範例
×目前此車铣設備由绮發機械提供
?目前此車铣設備由绮發機械提供
×L 固定板會有擺動過大疑慮
?L固定板會有擺動過大疑慮

方法
? Collocation
– 利用Mutual information (簡稱MI) 得知「字跟字」及
「詞跟字」搭配成詞的機率, 詞的內部結合強度
– 例: c = “自然語言處理”, a = “自然語言處”
b = “然語言處理”

方法
? Adaptation
目前此車铣設備由绮發機械提供
b e b e s b e s s s b e b e
CKIP, stanford, jieba…
手動調整
b e s b e b e s b m m e b e
CRF-based DELTA word segmentor
Input : L 固定板會有擺動過大疑慮
Output : L固定板會有擺動過大疑慮

台達資料的知識萃取
莊文立

Information Extraction
? Named Entity Recognition (NER)
– 專有名詞的辨識和分類
? 公司、人物、產品、地點…等等
? Relation Extraction (RE)
– 從文字裡找出named entities之間的關係，例如
? 競爭
? 合作
? 客戶
? 上游廠商
– 通常用(subject,relation,object)三元組來表示

SALES拜訪記錄：
對於BV3418專案價格的了解，欣特協寶姚經理給出的回應是，周總
認為，台達的價格比西門子808低階機種NC控制器的價格高。
? NER
? 西門子/Organization
? 欣特協寶/Organization
? 台達/Organization
? 姚經理/Person
? 周總/Person
? RE
# Subject Relation Object
1 台達 COMPETE_WITH 西門子
2 台達 IS_VENDOR 欣特協寶
3 西門子 IS_VENDOR 欣特協寶
4 欣特協寶 SUBORDINATE 姚經理
5 欣特協寶 SUBORDINATE 周總

Named Entity Recognition
? 資料處理
– 中文需要良好的斷詞結果
– 人工標記
? 模型： Conditional Random Fields (CRF)
– 從每個字的特徵裡，學習專有名詞使用的規律
? 本身的詞、詞性
? 上下文的詞、詞性
? 文法剖析樹
? 搭配用法
? 稱謂、姓氏
? 專有名詞資料庫

Relation Extraction
? 還是需要人工標記 ?
? Deep Learning!
– 讓機器自己發現最適合的表達方法
? Recursive Neural Network
– 順著文法剖析樹往上”爬”
– 每個字用矩陣 +向量表示
? 向量表示本身詞義
? 矩陣表示上下文資訊
– 兩個named entity交會處輸出的向量，放入分類器
1
?3
4
?
5
●
●
●●
●●

●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
Classifier

Future work
? Cross sentence
? Cross document
? Cross language

狠狠撸

Multimedia-text team report_2015-07-31

More Related Content

Multimedia-text team report_2015-07-31