Documents as a Bag of Maximal Substrings: An Unsupervised Feature Extraction for Document Clustering
1. - An Unsupervised Feature Extraction for Document Clustering - 正田 备也 Tomonari MASADA 长崎大学 Nagasaki University [email_address] DOCUMENTS AS A BAG OF MAXIMAL SUBSTRINGS
2. Example “ abracadabra ” “ a” (5) “ abra” (2) X “abr ” (2) We only consider the maximal substrings appearing more than once.
3. Maximal Substrings (1/2) [Okanohara et al. 09] Every substring whose number of occurrences decreases even by adding a single character to its head or tail. Use maximal substrings as “words” Word extraction is not trivial for some languages.
4. “ Bag of Words” Represent each doc as a vector “ Bag of Words” representation Words = elementary features of docs Non-trivial word extraction X Supervised method (CRF, HMM) ( n j 1 , n j 2 , ..., n jW )
5. Maximal Substrings (2/2) Efficient extraction [Okanohara et al. 09] Unsupervised extraction Suffix Array BWT Linear time (linear in string length)
12. Our Aim Compare the effectiveness of maximal substrings with that of the words extracted by a supervised method in document clustering
13. Comparison Procedure maximal substrings words (supervised) document vectors document vectors document clustering document set document set document set
15. BTW (Burrows-Wheeler Transform) “ abracadabra$” a r d * r c a a a a b b $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$ $ a$ ra$ bra$ abra$ dabra$ adabra$ cadabra$ acadabra$ racadabra$ bracadabra$ abracadabra$
17. Extracting Maximal Substrings ....._...._......_..._.........._......_.... ...._...._......_...._.. ..._.._...._...._....._...._.. # # Extract all maximal substrings (MS) Remove all MS containing special chars `#’ White spaces Punctuations
18. Frequency-based Selection n L : Lowest frequency Remove all features whose frequencies are smaller than n L n H : Highest frequency Remove all features whose frequencies are larger than n H Specify n H by n H = c H x n 1 n 1 : frequency of the most frequent feature
19. Supervised Word Extraction Korean KLT (a dictionary-based morphological analyzer) [Gang 09] Part-of-speech tagging Not required for our experiment Chinese CRF-based word segmenter (implemented by us) L1-regularized linear CRF [Tsuruoka et al. 09] SIGHAN Bakeoff 2005 [Tseng et al. 05] 0.943 (AS), 0.941 (HK), 0.929 (PK), 0.960 (MSR)
20. Multinomial Mixtures Multinomial Distributions Documents as word frequency histograms Ignore word token ordering Mixuture of Multinomials One multinomial for each document cluster
24. Document Sets SEOUL (in Korean): Web Seoul Newspaper Jan 1, 2008 ~ Sep 30, 2009 52,730 docs Category: Economy , Local issues , Politics , Sports XINHUA (in Chinese): Xinhua Net May 8, 2009 ~ Dec 17, 2009 20,127 docs Category: Economy , International , Politics
30. Previous Works (1/2) Unsupervised Segmentation [Poon et al. 09] Exhaustive enumeration of segmentation patterns [Mochihashi et al. 09] Bayesian nonparametrics (nested Pitman-Yor) Intricate implementation [Okanohara et al. 09] Maximal substrings ? We adopt this approach!
31. Previous Works (2/2) Document Classification [Okanohara et al. 09] Document Clustering [Zhang et al. 06] Special subset of substrings [Zhang et al. 04] No quantitative evaluation [Li et al. 08] Using WordNet for feature selection [Chumwatana et al. 10] Small document set
32. Conclusions Maximal substrings as elementary features of documents Unsupervised extraction Efficient extraction algorithm Acceptable performance in document clustering
33. Future Work Further improvement Document models customized for maximal substrings “ Word” probability distribution Noisy feature removal Dimensionality reduction