狠狠撸

- An Unsupervised Feature Extraction for Document Clustering - 正田备也 Tomonari MASADA 长崎大学 Nagasaki University [email_address] DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS

Example “ abracadabra ” “ a” (5) “ abra” (2) X “abr ” (2) We only consider the maximal substrings appearing more than once.

Maximal Substrings (1/2) [Okanohara et al. 09] Every substring whose number of occurrences decreases even by adding a single character to its head or tail. Use maximal substrings as “words” Word extraction is not trivial for some languages.

“ Bag of Words” Represent each doc as a vector “ Bag of Words” representation Words = elementary features of docs Non-trivial word extraction X Supervised method (CRF, HMM) ( n j 1 , n j 2 , ..., n jW )

Maximal Substrings (2/2) Efficient extraction [Okanohara et al. 09] Unsupervised extraction Suffix Array BWT Linear time (linear in string length)

???? 在第六次人口普查数据中，值得注意的是中国人口的年龄结构变化。普查结果显示，大陆 31 个省份和现役军人的人口中， 0-14 岁人口占 16.60% ； 15-59 岁占 70.14% ； 60 岁及以上人口占 13.26% ，其中 65 岁及以上人口占 8.87% 。 ???? 同第五次全国人口普查相比， 0-14 岁人口的比重下降 6.29 个百分点， 15-59 岁人口的比重上升 3.36 个百分点， 60 岁及以上人口的比重上升 2.93 个百分点， 65 岁及以上人口的比重上升 1.91 个百分点。 ???? 马建堂表示， 1980 年以后，特别是近十年，老龄人口比重增加，少儿年龄的比重在缩小，“这样的状况，大家都在思索中国的人口政策和计划生育政策。” ???? 他认为，计划生育工作取得了举世瞩目的成就，有效控制了人口过快增长。此外，中国要重视人口发展出现的一些新情况、新变化，坚持计划生育的基本国策，稳定适度低生育的水平，同时兼顾当前和长远，科学研究、认真评估，慎重地、逐步地完善人口计划生育政策，统筹解决人口问题。

工作 12119 表示 12107 于 12027 名 11993 问题 11965 已 11906 后 11854 更 11812 就 11792 从 11486 进行 11180 以 10902 地区 10712 被 10368 要 10209 但 10047 还 9927 目前 9700 总统 9658 公司 9582 北京 9114 向 9041 合作 8982 时 8969 市场 8873 一个 8668 年 8665 最 8570 公司 8453 今年 8365 都 8286 组织 8235 而 8124 前 8073 元 8070 会 8034 举行 7980 出 7849 安全 7788 金融 7705 关注 7600 美元 7480 社会 7432 国 7393 人员 7330 内容 7324 次 7319 发生 7319 建设 7205 世界 7087 至 7086 地 6995 会议 6929 我们 6784 全国 6698 下 6692 报道 6648 关系 6570 增长 6493 通过 6476 政策 6392 认为 6251 有关 6184 价格 6149 能 6119 日本 6102 三 6094 请 6088 机构 6080 应 6014 没有 5997 其 5968 要求 5896 活动 5872 主要 5868 由 5855 部门 5662 已经 5624 影响 5616 加强 5609 投资 5507 新华社 5479 重要 5469 方面 5456 副 5437 来 5412 情况 5377 当天 5356 高 5292 可能 5285 管理 5277 内 5242 流感 5229 其中 5179 朝鲜 5074 危机 5071 银行 5052 甲型 5049 及 5047 时间 5021

为 52758 是 52564 了 51612 月 51262 部 48568 上 47851 对 44310 地 42575 不 42441 政 42422 美 41441 业 41141 经 39598 出 38830 时 38796 者 38659 生 37414 家 36601 作 35987 以 35923 关 35642 全 35352 成 35198 方 35050 要 34279 市 34249 ３ 34237 中国 34144 公 33453 进 33437 个 33222 ５ 32887 华 32497 这 32313 于 32201 民 32125 前 32089 将 32040 0 32037 机 30320 长 29705 工 29641 多 29571 ９ 29539 到 29377 1 29320 加 28963 说 28478 来 28428 内 28382 展 28206 网 28142 电 27906 ８ 27643 合 27615 ６ 27442 ７ 27355 重 27024 记 26608 ４ 26598 员 26551 动 26496 开 26383 法 26338 主 26220 斯 26039 现 25456 区 25352 表 25350 高 25082 能 25007 下 24920 分 24850 总 24822 与 24504 同 24228 学 24012 事 23414 资 23173 定 23161 新华 23160 记者 23138 后 23122 力 22809 实 22796 本 22719 利 22424 产 22368 等 22253 建 22164 济 22124 2 21954 理 21889 金 21865 当 21783 报 21540 美国 21501 体 21429 议 21314 他 21303

?? ?? ????? ???? ?? ??? ???????? ??? ???? ??? ?? ??? ?? ??? ???? ??? ???? . ?? ?? ??? ??? ??? ?? ?? ??? ???? ?? ?? ??? ???? ???? . ?? ??? ????? ?? ??? ?? ??? ???? ???? , 4 ?? ??? ?? , ?? ??? ??? ?? ?? ???? ?? ???? . ?? ??? ? 11 ? 5 ?? (2005~2011) ??? ??? ?? , ? 12 ? 5 ?? (2011~2015) ??? ???? ?? , ???? (SO2) ? ?? ??? ?? ???? ?? ???? . ?? ??? ??? ??? ??? ???? ????? ???? ???? ?? ????? (Green Innovation), ????? ?? ???? , ??????? ?? ? ?? ????? ???? .

?? 243772 ? 230494 ?? 216175 ? 134963 ?? 108662 ??? 105935 ??? 105461 ? 103620 ? 69754 2009 55899 ? 54841 2008 49877 ? 46713 ?? 40253 kr 38759 co 38368 seoul 37869 ? 37689 ??? 36856 ?? 36258 ?? 34117 ?? 32053 ?? 29792 ? 27762 ? 27376 ?? 26445 ? 25405 ? 25397 ?? 24005 1 23754 ?? 21790 ?? 21777 ? 21218 ?? 20939 2 20767 ? 20710 ?? 20161 ??? 19926 ?? 19720 ? 19507 ?? 19081 ?? 18965 ?? 18849 ?? 18775 3 18259 〃 18014 ?? 17979 ?? 17673 ?? 17628 ?? 17503 ?? 17103 ? 16914 ?? 16690 ??? 16601 ?? 16326 ?? 15895 ? 15874 ?? 15841 ?? 15632 ? 15594 ?? 15460 ?? 15443 ?? 15387 ?? 15329 ?? 15232 ?? 14962 ?? 14819 ?? 14807 ?? 14377 ?? 14303 ??? 14071 09 13825 04 13766 4 13744 05 13675 ?? 13556 ?? 13473 07 13412 ?? 13363 08 13356 ?? 13354 03 13345 ???? 13196 06 13105 ?? 12998 ? 12942 ?? 12853 ?? 12657 01 12593 10 12560 ??? 12513 02 12322 ?? 12192 ?? 12130 ? 11998 ? 11950 ?? 11908 ? 11887 ?? 11843 ?? 11704

?? 9864 ?? 9833 ?? 9813 ? 9805 ?? 9778 ?? 9769 ? 9764 ?? 9756 ? 9751 ?? 9727 0 ?? 9707 ?? 9702 ?? 9702 ?? 9692 ? 9687 ?? 9683 ?? 9676 ? 9670 80 9661 ? 9650 ?? 9646 ?? 9635 ? 9585 d 9581 ?? 9545 ? 9536 ? 9508 ?? 9444 ?? 9428 ?? 9421 ?? 9414 ??? 9392 ? 9360 ?? 9351 ?? 9350 ?? 9347 ?? 9344 ??? 9344 ?? 9340 ? 9333 ?? 9332 ?? 9332 ?? 9329 ? 2 9323 ?? 9321 ?? 9300 ? 9299 ? 9288 ?? 9286 ?? 9283 ?? 9272 ?? 9259 ?? 9258 ?? 9255 ?? 9248 ?? 2 9247 ?? 9240 ???? 2 9217 00 ?? 9196 ? 20 9186 ? 200 9182 ? 9174 ?? 20 9171 ?? 200 9167 ???? 20 9166 ???? 200 9162 ?? 9158 ? 9149 ?? 9130 ?? 9127 ? 9115 ? 9108 ?? 9105 ?? 9065 70 9062 ?? 9058 ?? 9017 31 8998 ? 8962 ?? 8951 ? 8922 ? 8910 ? 8861 ?? 8833 ?? 8804 5 ? 8804 ?? 8803 ?? 8800 ?? 8781 ?? 8772 ?? 8749 ?? 8744 ? 8717 ?? 8717 ?? 8715 ?? 8703 ?? 8687 p 8681 ?? 8670 ?? 8667

Our Aim Compare the effectiveness of maximal substrings with that of the words extracted by a supervised method in document clustering

Comparison Procedure maximal substrings words (supervised) document vectors document vectors document clustering document set document set document set

Suffix Array “ abracadabra$” $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$ $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$ 12 11 10 9 8 7 6 5 4 3 2 1 12 11 8 1 4 6 9 2 5 7 10 3 SAIS [Nong et al. 08]

BTW (Burrows-Wheeler Transform) “ abracadabra$” a r d * r c a a a a b b $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$ $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$

Extracting Maximal Substrings ...... ...... .... .... ...... .......... . .... ....._...._......_..._.........._......_.... ..... ....... .. .... . ...... .... ..... ...._...._......_...._.. ..... ... .. .... .. ... . . ... ... .... ..... ..._.._...._...._....._...._..

Extracting Maximal Substrings ....._...._......_..._.........._......_.... ...._...._......_...._.. ..._.._...._...._....._...._.. # # Extract all maximal substrings (MS) Remove all MS containing special chars `#’ White spaces Punctuations

Frequency-based Selection n L : Lowest frequency Remove all features whose frequencies are smaller than n L n H : Highest frequency Remove all features whose frequencies are larger than n H Specify n H by n H = c H x n 1 n 1 : frequency of the most frequent feature

Supervised Word Extraction Korean KLT (a dictionary-based morphological analyzer) [Gang 09] Part-of-speech tagging Not required for our experiment Chinese CRF-based word segmenter (implemented by us) L1-regularized linear CRF [Tsuruoka et al. 09] SIGHAN Bakeoff 2005 [Tseng et al. 05] 0.943 (AS), 0.941 (HK), 0.929 (PK), 0.960 (MSR)

Multinomial Mixtures Multinomial Distributions Documents as word frequency histograms Ignore word token ordering Mixuture of Multinomials One multinomial for each document cluster

Dirichlet Compound Multinomials Problem: word sparseness Overfitting Solution: Dirichlet prior Inference: Bayesian (not MAP) Marginalizing out multinomial parameters Merit: word busrtiness [Madsen et al. 05]

Document Sets SEOUL (in Korean): Web Seoul Newspaper Jan 1, 2008 ~ Sep 30, 2009 52,730 docs Category: Economy , Local issues , Politics , Sports XINHUA (in Chinese): Xinhua Net May 8, 2009 ~ Dec 17, 2009 20,127 docs Category: Economy , International , Politics

Previous Works (1/2) Unsupervised Segmentation [Poon et al. 09] Exhaustive enumeration of segmentation patterns [Mochihashi et al. 09] Bayesian nonparametrics (nested Pitman-Yor) Intricate implementation [Okanohara et al. 09] Maximal substrings ? We adopt this approach!

Previous Works (2/2) Document Classification [Okanohara et al. 09] Document Clustering [Zhang et al. 06] Special subset of substrings [Zhang et al. 04] No quantitative evaluation [Li et al. 08] Using WordNet for feature selection [Chumwatana et al. 10] Small document set

Conclusions Maximal substrings as elementary features of documents Unsupervised extraction Efficient extraction algorithm Acceptable performance in document clustering

Future Work Further improvement Document models customized for maximal substrings “ Word” probability distribution Noisy feature removal Dimensionality reduction

狠狠撸

Documents as a Bag of Maximal Substrings: An Unsupervised Feature Extraction for Document Clustering

More Related Content

Documents as a Bag of Maximal Substrings: An Unsupervised Feature Extraction for Document Clustering