An automatic term extraction approach for building a vocabulary that is constantly updated. A prepared dictionary is used for sentiment classification into three classes (positive, neutral, negative). In addition, the results of sentiment classification are described and the accuracy of methods based on various weighting schemes is compared. The work also demonstrates the computational complexity of generating representations for N dynamic documents depending on the weighting scheme used.
1 of 31
More Related Content
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
1. Automatic term extraction of dynamically
updated text collections for sentiment
classification into three classes
Yuliya Rubtsova
The A.P. Ershov Institute of Informatics Systems
(IIS)
2. Applied problems which can be solved
with sentiment classification
? consumer reviews study to commercial products for
businesses;
4. Applied problems which can be solved
with sentiment classification
? consumer reviews study to commercial products for
businesses;
? recommender systems;
6. Applied problems which can be solved
with sentiment classification
? consumer reviews study to commercial products
for businesses;
? recommender systems;
? Human Machine Interface of a computer system
which is responsible for adapting the system's
behavior to the current emotional state of the
person
7. Human Machine Interface of a computer system which
is responsible for adapting the system's behavior to the
current emotional state of the person
? psychological and medical diagnosis;
? safety control by analyzing the behavior of mass
gatherings;
? assistance in carrying out investigative measures.
8. Most common sentiment
analysis approaches
Supervised
machine
learning
Dictionaries
and rules
Combined
method
9. Existing corpora
? Corpora of reviews which contain user marks
? Belongs to one subject domain (movies reviews,
books reviews, gadgets reviews)
? Corps of news (a few emotional texts)
10. Filtration
? Texts containing both positive and negative emotions;
? Not informative tweets (less than 40 characters long);
? Copied texts and retweets.
11. Corpus of short texts consists of
114 991 – positive texts
111 923 – negative texts
107 990 – neutral texts
12. Corpus of short texts
Collection type Number of words Number of unique
words
Positive messages 1 559 176 150 720
Negative messages 1 445 517 191 677
Neutral messages 1 852 995 105 239
13. Unique terms distribution in relation depending on
the number of tweets
0
50000
100000
150000
200000
250000
300000
350000
400000
53
8213
16461
24624
32824
40999
49264
57414
65571
73660
81791
89882
97945
106068
114238
123009
131937
140682
149495
158284
167136
175859
184578
193442
202354
211426
220117
229570
238882
247995
256716
265561
274244
282350
Number of the unuque terms
Number of texts
15. Most common approaches for
used for N-grams extracting
? Manually, using a thesaurus.
? Term Extraction, based on significance of this term
for a collection
16. Data sets characteristics
? The entire data set is known
? The entire data set is avaliable
? The entire data set is static (can’t change during calculation)
When new document is added, it is necessary to the update the
document frequency of many terms and all previously generated
term weights needs recalibration. For N documents in a data
stream, the computational complexity is O(N2).
17. Human speech is constantly
changing => there is a need to
update emotional dictionaries
18. Change in vocabulary and
topics discussed
Percentage of references to the Olympic theme on all
12%
0.50%
14%
12%
10%
8%
6%
4%
2%
0%
posts
Febrary August
19. Change in vocabulary and
topics discussed
Percentage of references to the vacation theme on all
0.06%
0.12%
0.14%
0.12%
0.10%
0.08%
0.06%
0.04%
0.02%
0.00%
posts
Febrary August
20. Change in vocabulary and
topics discussed
Percentage of using term “Sebyashka” (selfie – rus) on all
0.00%
0.02%
0.03%
0.02%
0.02%
0.01%
0.01%
0.00%
posts
Febrary August
21. Filtration
? Punctuation – commas, colons, quotation marks
(exclamation marks, question marks and ellipses were
retained);
? References to significant personalities and events
? Proper names;
? Numerals;
? All links were replaced with the word "Link" and were taken
into consideration as a whole;
? Many dots were replaced with ellipsis.
22. TF-ICF
C – number of categories,
cf – the number of categories in which weighed term is found
23. TF-IDF
tf – is the frequency of term occurrence in the collection (positive or
negative tweets) ,
T – total number of messages in the collections,
– the number of messages in the positive and negative
T(ti )
collections contained the term
25. Corpus of News texts consists of
46 339 – positive news
46 337 – negative news
46 340 – neutral news
26. ROMIP mixed collection consists of
Reviews on books, movies, or digital camera from
blogs
543– positive blog texts
236– negative blog texts
103– neutral blog texts
29. Experimental results in terms of F-measure
95.66
70.39
54.58
95.15
59.68
54.71
120
100
80
60
40
20
0
Short texts News Romip
TF-IDF
TF-ICF
30. The program module allows
? dynamically update the unigram dictionary,
recalculate the weight of terms, depending on the
accessories to the collection;
? take into account the lexical speech changes in time;
? investigate new terms entering into active
vocabulary.
show that when the document set size is small, the unique term count continues to climb up as the number of documents increases. However, this growth of the unique term count is reduced sharply as the number of documents becomes very large. This observation indicates that if the document collection is sufficiently large, we can expect to see very few new words by adding more documents.
References to significant personalities and events – the attitude towards them may vary over time, but a classifier trained on "old texts" will not be able to adapt quickly;