狠狠撸

狠狠撸Share a Scribd company logo
Automatic term extraction of dynamically 
updated text collections for sentiment 
classification into three classes 
Yuliya Rubtsova 
The A.P. Ershov Institute of Informatics Systems 
(IIS)
Applied problems which can be solved 
with sentiment classification 
? consumer reviews study to commercial products for 
businesses;
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Applied problems which can be solved 
with sentiment classification 
? consumer reviews study to commercial products for 
businesses; 
? recommender systems;
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Applied problems which can be solved 
with sentiment classification 
? consumer reviews study to commercial products 
for businesses; 
? recommender systems; 
? Human Machine Interface of a computer system 
which is responsible for adapting the system's 
behavior to the current emotional state of the 
person
Human Machine Interface of a computer system which 
is responsible for adapting the system's behavior to the 
current emotional state of the person 
? psychological and medical diagnosis; 
? safety control by analyzing the behavior of mass 
gatherings; 
? assistance in carrying out investigative measures.
Most common sentiment 
analysis approaches 
Supervised 
machine 
learning 
Dictionaries 
and rules 
Combined 
method
Existing corpora 
? Corpora of reviews which contain user marks 
? Belongs to one subject domain (movies reviews, 
books reviews, gadgets reviews) 
? Corps of news (a few emotional texts)
Filtration 
? Texts containing both positive and negative emotions; 
? Not informative tweets (less than 40 characters long); 
? Copied texts and retweets.
Corpus of short texts consists of 
114 991 – positive texts 
111 923 – negative texts 
107 990 – neutral texts
Corpus of short texts 
Collection type Number of words Number of unique 
words 
Positive messages 1 559 176 150 720 
Negative messages 1 445 517 191 677 
Neutral messages 1 852 995 105 239
Unique terms distribution in relation depending on 
the number of tweets 
0	 
50000	 
100000	 
150000	 
200000	 
250000	 
300000	 
350000	 
400000	 
53	 
8213	 
16461	 
24624	 
32824	 
40999	 
49264	 
57414	 
65571	 
73660	 
81791	 
89882	 
97945	 
106068	 
114238	 
123009	 
131937	 
140682	 
149495	 
158284	 
167136	 
175859	 
184578	 
193442	 
202354	 
211426	 
220117	 
229570	 
238882	 
247995	 
256716	 
265561	 
274244	 
282350	 
Number	of	the	unuque	terms	 
Number	of	texts
Uniformity of used collections 
Words frequency distribution
Most common approaches for 
used for N-grams extracting 
? Manually, using a thesaurus. 
? Term Extraction, based on significance of this term 
for a collection
Data sets characteristics 
? The entire data set is known 
? The entire data set is avaliable 
? The entire data set is static (can’t change during calculation) 
When new document is added, it is necessary to the update the 
document frequency of many terms and all previously generated 
term weights needs recalibration. For N documents in a data 
stream, the computational complexity is O(N2).
Human speech is constantly 
changing => there is a need to 
update emotional dictionaries
Change in vocabulary and 
topics discussed 
Percentage of references to the Olympic theme on all 
12% 
0.50% 
14% 
12% 
10% 
8% 
6% 
4% 
2% 
0% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of references to the vacation theme on all 
0.06% 
0.12% 
0.14% 
0.12% 
0.10% 
0.08% 
0.06% 
0.04% 
0.02% 
0.00% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of using term “Sebyashka” (selfie – rus) on all 
0.00% 
0.02% 
0.03% 
0.02% 
0.02% 
0.01% 
0.01% 
0.00% 
posts 
Febrary August
Filtration 
? Punctuation – commas, colons, quotation marks 
(exclamation marks, question marks and ellipses were 
retained); 
? References to significant personalities and events 
? Proper names; 
? Numerals; 
? All links were replaced with the word "Link" and were taken 
into consideration as a whole; 
? Many dots were replaced with ellipsis.
TF-ICF 
C – number of categories, 
cf – the number of categories in which weighed term is found
TF-IDF 
tf – is the frequency of term occurrence in the collection (positive or 
negative tweets) , 
T – total number of messages in the collections, 
– the number of messages in the positive and negative 
T(ti ) 
collections contained the term
Experiments
Corpus of News texts consists of 
46 339 – positive news 
46 337 – negative news 
46 340 – neutral news
ROMIP mixed collection consists of 
Reviews on books, movies, or digital camera from 
blogs 
543– positive blog texts 
236– negative blog texts 
103– neutral blog texts
Short text collection 
TF-IDF TF-ICF 
Accuracy 95,5981 95,0664 
Precision 0,958092631 0,953112184 
Recall 0,955204837 0,94984672 
F-Measure 0,956646554 0,95147665 
News collection 
TF-IDF TF-ICF 
Accuracy 69,8619 58,1397 
Precision 0,709246342 0,61278022 
Recall 0,698624505 0,581402868 
F-Measure 0,703895355 0,596679322 
ROMIP collection 
TF-IDF TF-ICF 
Accuracy 53,9773 57,9545 
Precision 0,561341047 0,558902611 
Recall 0,5311636 0,535790598 
F-Measure 0,545835539 0,547102625
Results
Experimental results in terms of F-measure 
95.66 
70.39 
54.58 
95.15 
59.68 
54.71 
120 
100 
80 
60 
40 
20 
0 
Short texts News Romip 
TF-IDF 
TF-ICF
The program module allows 
? dynamically update the unigram dictionary, 
recalculate the weight of terms, depending on the 
accessories to the collection; 
? take into account the lexical speech changes in time; 
? investigate new terms entering into active 
vocabulary.
Thank you! 
Presentation: http://www.slideshare.net/mokoron 
Yuliya Rubtsova 
yu.rubtsova@gmail.com 
study.mokoron.com

More Related Content

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

  • 1. Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
  • 2. Applied problems which can be solved with sentiment classification ? consumer reviews study to commercial products for businesses;
  • 4. Applied problems which can be solved with sentiment classification ? consumer reviews study to commercial products for businesses; ? recommender systems;
  • 6. Applied problems which can be solved with sentiment classification ? consumer reviews study to commercial products for businesses; ? recommender systems; ? Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
  • 7. Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person ? psychological and medical diagnosis; ? safety control by analyzing the behavior of mass gatherings; ? assistance in carrying out investigative measures.
  • 8. Most common sentiment analysis approaches Supervised machine learning Dictionaries and rules Combined method
  • 9. Existing corpora ? Corpora of reviews which contain user marks ? Belongs to one subject domain (movies reviews, books reviews, gadgets reviews) ? Corps of news (a few emotional texts)
  • 10. Filtration ? Texts containing both positive and negative emotions; ? Not informative tweets (less than 40 characters long); ? Copied texts and retweets.
  • 11. Corpus of short texts consists of 114 991 – positive texts 111 923 – negative texts 107 990 – neutral texts
  • 12. Corpus of short texts Collection type Number of words Number of unique words Positive messages 1 559 176 150 720 Negative messages 1 445 517 191 677 Neutral messages 1 852 995 105 239
  • 13. Unique terms distribution in relation depending on the number of tweets 0 50000 100000 150000 200000 250000 300000 350000 400000 53 8213 16461 24624 32824 40999 49264 57414 65571 73660 81791 89882 97945 106068 114238 123009 131937 140682 149495 158284 167136 175859 184578 193442 202354 211426 220117 229570 238882 247995 256716 265561 274244 282350 Number of the unuque terms Number of texts
  • 14. Uniformity of used collections Words frequency distribution
  • 15. Most common approaches for used for N-grams extracting ? Manually, using a thesaurus. ? Term Extraction, based on significance of this term for a collection
  • 16. Data sets characteristics ? The entire data set is known ? The entire data set is avaliable ? The entire data set is static (can’t change during calculation) When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
  • 17. Human speech is constantly changing => there is a need to update emotional dictionaries
  • 18. Change in vocabulary and topics discussed Percentage of references to the Olympic theme on all 12% 0.50% 14% 12% 10% 8% 6% 4% 2% 0% posts Febrary August
  • 19. Change in vocabulary and topics discussed Percentage of references to the vacation theme on all 0.06% 0.12% 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% posts Febrary August
  • 20. Change in vocabulary and topics discussed Percentage of using term “Sebyashka” (selfie – rus) on all 0.00% 0.02% 0.03% 0.02% 0.02% 0.01% 0.01% 0.00% posts Febrary August
  • 21. Filtration ? Punctuation – commas, colons, quotation marks (exclamation marks, question marks and ellipses were retained); ? References to significant personalities and events ? Proper names; ? Numerals; ? All links were replaced with the word "Link" and were taken into consideration as a whole; ? Many dots were replaced with ellipsis.
  • 22. TF-ICF C – number of categories, cf – the number of categories in which weighed term is found
  • 23. TF-IDF tf – is the frequency of term occurrence in the collection (positive or negative tweets) , T – total number of messages in the collections, – the number of messages in the positive and negative T(ti ) collections contained the term
  • 25. Corpus of News texts consists of 46 339 – positive news 46 337 – negative news 46 340 – neutral news
  • 26. ROMIP mixed collection consists of Reviews on books, movies, or digital camera from blogs 543– positive blog texts 236– negative blog texts 103– neutral blog texts
  • 27. Short text collection TF-IDF TF-ICF Accuracy 95,5981 95,0664 Precision 0,958092631 0,953112184 Recall 0,955204837 0,94984672 F-Measure 0,956646554 0,95147665 News collection TF-IDF TF-ICF Accuracy 69,8619 58,1397 Precision 0,709246342 0,61278022 Recall 0,698624505 0,581402868 F-Measure 0,703895355 0,596679322 ROMIP collection TF-IDF TF-ICF Accuracy 53,9773 57,9545 Precision 0,561341047 0,558902611 Recall 0,5311636 0,535790598 F-Measure 0,545835539 0,547102625
  • 29. Experimental results in terms of F-measure 95.66 70.39 54.58 95.15 59.68 54.71 120 100 80 60 40 20 0 Short texts News Romip TF-IDF TF-ICF
  • 30. The program module allows ? dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection; ? take into account the lexical speech changes in time; ? investigate new terms entering into active vocabulary.
  • 31. Thank you! Presentation: http://www.slideshare.net/mokoron Yuliya Rubtsova yu.rubtsova@gmail.com study.mokoron.com

Editor's Notes

  1. show that when the document set size is small, the unique term count continues to climb up as the number of documents increases. However, this growth of the unique term count is reduced sharply as the number of documents becomes very large. This observation indicates that if the document collection is sufficiently large, we can expect to see very few new words by adding more documents.
  2. References to significant personalities and events – the attitude towards them may vary over time, but a classifier trained on "old texts" will not be able to adapt quickly;