際際滷

際際滷Share a Scribd company logo
Automatic term extraction of dynamically 
updated text collections for sentiment 
classification into three classes 
Yuliya Rubtsova 
The A.P. Ershov Institute of Informatics Systems 
(IIS)
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses;
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products for 
businesses; 
 recommender systems;
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Applied problems which can be solved 
with sentiment classification 
 consumer reviews study to commercial products 
for businesses; 
 recommender systems; 
 Human Machine Interface of a computer system 
which is responsible for adapting the system's 
behavior to the current emotional state of the 
person
Human Machine Interface of a computer system which 
is responsible for adapting the system's behavior to the 
current emotional state of the person 
 psychological and medical diagnosis; 
 safety control by analyzing the behavior of mass 
gatherings; 
 assistance in carrying out investigative measures.
Most common sentiment 
analysis approaches 
Supervised 
machine 
learning 
Dictionaries 
and rules 
Combined 
method
Existing corpora 
 Corpora of reviews which contain user marks 
 Belongs to one subject domain (movies reviews, 
books reviews, gadgets reviews) 
 Corps of news (a few emotional texts)
Filtration 
 Texts containing both positive and negative emotions; 
 Not informative tweets (less than 40 characters long); 
 Copied texts and retweets.
Corpus of short texts consists of 
114 991  positive texts 
111 923  negative texts 
107 990  neutral texts
Corpus of short texts 
Collection type Number of words Number of unique 
words 
Positive messages 1 559 176 150 720 
Negative messages 1 445 517 191 677 
Neutral messages 1 852 995 105 239
Unique terms distribution in relation depending on 
the number of tweets 
0	 
50000	 
100000	 
150000	 
200000	 
250000	 
300000	 
350000	 
400000	 
53	 
8213	 
16461	 
24624	 
32824	 
40999	 
49264	 
57414	 
65571	 
73660	 
81791	 
89882	 
97945	 
106068	 
114238	 
123009	 
131937	 
140682	 
149495	 
158284	 
167136	 
175859	 
184578	 
193442	 
202354	 
211426	 
220117	 
229570	 
238882	 
247995	 
256716	 
265561	 
274244	 
282350	 
Number	of	the	unuque	terms	 
Number	of	texts
Uniformity of used collections 
Words frequency distribution
Most common approaches for 
used for N-grams extracting 
 Manually, using a thesaurus. 
 Term Extraction, based on significance of this term 
for a collection
Data sets characteristics 
 The entire data set is known 
 The entire data set is avaliable 
 The entire data set is static (cant change during calculation) 
When new document is added, it is necessary to the update the 
document frequency of many terms and all previously generated 
term weights needs recalibration. For N documents in a data 
stream, the computational complexity is O(N2).
Human speech is constantly 
changing => there is a need to 
update emotional dictionaries
Change in vocabulary and 
topics discussed 
Percentage of references to the Olympic theme on all 
12% 
0.50% 
14% 
12% 
10% 
8% 
6% 
4% 
2% 
0% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of references to the vacation theme on all 
0.06% 
0.12% 
0.14% 
0.12% 
0.10% 
0.08% 
0.06% 
0.04% 
0.02% 
0.00% 
posts 
Febrary August
Change in vocabulary and 
topics discussed 
Percentage of using term Sebyashka (selfie  rus) on all 
0.00% 
0.02% 
0.03% 
0.02% 
0.02% 
0.01% 
0.01% 
0.00% 
posts 
Febrary August
Filtration 
 Punctuation  commas, colons, quotation marks 
(exclamation marks, question marks and ellipses were 
retained); 
 References to significant personalities and events 
 Proper names; 
 Numerals; 
 All links were replaced with the word "Link" and were taken 
into consideration as a whole; 
 Many dots were replaced with ellipsis.
TF-ICF 
C  number of categories, 
cf  the number of categories in which weighed term is found
TF-IDF 
tf  is the frequency of term occurrence in the collection (positive or 
negative tweets) , 
T  total number of messages in the collections, 
 the number of messages in the positive and negative 
T(ti ) 
collections contained the term
Experiments
Corpus of News texts consists of 
46 339  positive news 
46 337  negative news 
46 340  neutral news
ROMIP mixed collection consists of 
Reviews on books, movies, or digital camera from 
blogs 
543 positive blog texts 
236 negative blog texts 
103 neutral blog texts
Short text collection 
TF-IDF TF-ICF 
Accuracy 95,5981 95,0664 
Precision 0,958092631 0,953112184 
Recall 0,955204837 0,94984672 
F-Measure 0,956646554 0,95147665 
News collection 
TF-IDF TF-ICF 
Accuracy 69,8619 58,1397 
Precision 0,709246342 0,61278022 
Recall 0,698624505 0,581402868 
F-Measure 0,703895355 0,596679322 
ROMIP collection 
TF-IDF TF-ICF 
Accuracy 53,9773 57,9545 
Precision 0,561341047 0,558902611 
Recall 0,5311636 0,535790598 
F-Measure 0,545835539 0,547102625
Results
Experimental results in terms of F-measure 
95.66 
70.39 
54.58 
95.15 
59.68 
54.71 
120 
100 
80 
60 
40 
20 
0 
Short texts News Romip 
TF-IDF 
TF-ICF
The program module allows 
 dynamically update the unigram dictionary, 
recalculate the weight of terms, depending on the 
accessories to the collection; 
 take into account the lexical speech changes in time; 
 investigate new terms entering into active 
vocabulary.
Thank you! 
Presentation: http://www.slideshare.net/mokoron 
Yuliya Rubtsova 
yu.rubtsova@gmail.com 
study.mokoron.com

More Related Content

Similar to Automatic term extraction of dynamically updated text collections for sentiment classification into three classes (12)

Semantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of TwitterSemantic Patterns for Sentiment Analysis of Twitter
Semantic Patterns for Sentiment Analysis of Twitter
Knowledge Media Institute - The Open University
Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
AnkitaVerma776806
Omsa
OmsaOmsa
Omsa
skishore119
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadrozny
padatascience
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
Amal Zouaq
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis
csandit
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisLexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Jinho Choi
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Sagar Deogirkar
Twitter sentiment analysis.pptx
Twitter sentiment analysis.pptxTwitter sentiment analysis.pptx
Twitter sentiment analysis.pptx
Rishita Gupta
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
Longhow Lam
Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
AnkitaVerma776806
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
Experiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter ZadroznyExperiences with Sentiment Analysis with Peter Zadrozny
Experiences with Sentiment Analysis with Peter Zadrozny
padatascience
Zouaq wole2013
Zouaq wole2013Zouaq wole2013
Zouaq wole2013
Amal Zouaq
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis
csandit
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment AnalysisLexicon Integrated CNN Models with Attention for Sentiment Analysis
Lexicon Integrated CNN Models with Attention for Sentiment Analysis
Jinho Choi
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Sagar Deogirkar
Twitter sentiment analysis.pptx
Twitter sentiment analysis.pptxTwitter sentiment analysis.pptx
Twitter sentiment analysis.pptx
Rishita Gupta
Xomia_20220602.pptx
Xomia_20220602.pptxXomia_20220602.pptx
Xomia_20220602.pptx
Longhow Lam

More from Yuliya Rubtsova (17)

舒从 仗仂亟舒 舒仄仂仍亠 仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
舒从 仗仂亟舒 舒仄仂仍亠  仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒舒从 仗仂亟舒 舒仄仂仍亠  仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
舒从 仗仂亟舒 舒仄仂仍亠 仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
Yuliya Rubtsova
Entity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsEntity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problems
Yuliya Rubtsova
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
Yuliya Rubtsova
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒 亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
Yuliya Rubtsova
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
Yuliya Rubtsova
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
Yuliya Rubtsova
Mad analyst
Mad analyst   Mad analyst
Mad analyst
Yuliya Rubtsova
仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
 仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳 仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
Yuliya Rubtsova
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Yuliya Rubtsova
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
Yuliya Rubtsova
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
Yuliya Rubtsova
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
Yuliya Rubtsova
Mobile applications market
Mobile applications marketMobile applications market
Mobile applications market
Yuliya Rubtsova
Intranet
IntranetIntranet
Intranet
Yuliya Rubtsova
Networking
NetworkingNetworking
Networking
Yuliya Rubtsova
Usability testing
Usability testingUsability testing
Usability testing
Yuliya Rubtsova
Twitter marketing communications
Twitter marketing communicationsTwitter marketing communications
Twitter marketing communications
Yuliya Rubtsova
舒从 仗仂亟舒 舒仄仂仍亠 仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
舒从 仗仂亟舒 舒仄仂仍亠  仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒舒从 仗仂亟舒 舒仄仂仍亠  仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
舒从 仗仂亟舒 舒仄仂仍亠 仗仂仄仂 仂.亠亠亶 亳仍亳 仂亳舒仍仆亠 亠亳 亟仍 弍亳亰仆亠舒
Yuliya Rubtsova
Entity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problemsEntity-oriented sentiment analysis of tweets: results and problems
Entity-oriented sentiment analysis of tweets: results and problems
Yuliya Rubtsova
Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]Aspect extraction using conditional random fields [SentiRuEval]
Aspect extraction using conditional random fields [SentiRuEval]
Yuliya Rubtsova
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒 亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
亰仄亠磺 亳 于仍舒于亶 亳仍亳 仗舒从亳亠从舒 web-舒仆舒仍亳亳从舒
Yuliya Rubtsova
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
亠仂亟 仗仂仂亠仆亳 从仂仗舒 从仂仂从亳 亠从仂于
Yuliya Rubtsova
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
亠弍 舒仆舒仍亳亳从舒 仆舒 仗舒从亳从亠
Yuliya Rubtsova
仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
 仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳 仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
仍亠亳亶 仗仂 仂仆仂于舒仄 亳仆亠仆亠 仄舒从亠亳仆亞舒 亳 仗仂亳从仂于仂亶 仂仗亳仄亳亰舒亳亳
Yuliya Rubtsova
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Web analytics 于 从舒亳仆从舒 亳 亟亠仆亠亢仆 亰仆舒从舒
Yuliya Rubtsova
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
仂亟于亳亢亠仆亳亠 仄仂弍亳仍仆 仗亳仍仂亢亠仆亳亶 于 AppStore 亳 Google Play
Yuliya Rubtsova
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
丕于亠仍亳亠仆亳亠 从仂仆于亠亳亳 舒亶舒
Yuliya Rubtsova
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
舒从 亳亰 仗仂亠亳亠仍 亟亠仍舒 仗仂从仗舒亠仍
Yuliya Rubtsova
Mobile applications market
Mobile applications marketMobile applications market
Mobile applications market
Yuliya Rubtsova
Twitter marketing communications
Twitter marketing communicationsTwitter marketing communications
Twitter marketing communications
Yuliya Rubtsova

Recently uploaded (20)

GNU Linux - Introduction and Administration.
GNU Linux - Introduction and Administration.GNU Linux - Introduction and Administration.
GNU Linux - Introduction and Administration.
Xavier de Pedro
Protocols for different types of Immunoassay
Protocols for different types of ImmunoassayProtocols for different types of Immunoassay
Protocols for different types of Immunoassay
Kishan Patel
natural producghfhhgfhffft 4sem ppt.pptx
natural producghfhhgfhffft 4sem ppt.pptxnatural producghfhhgfhffft 4sem ppt.pptx
natural producghfhhgfhffft 4sem ppt.pptx
rohitverma43215
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 31-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
guilynharayo
Forensic analysis of the 2012 Aurora theatre mass shootings, Colorado
Forensic analysis of the 2012 Aurora theatre mass shootings, ColoradoForensic analysis of the 2012 Aurora theatre mass shootings, Colorado
Forensic analysis of the 2012 Aurora theatre mass shootings, Colorado
hosangnmims
Actinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Actinobacterium Producing Antimicrobials Against Drug-Resistant BacteriaActinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Actinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Abdulmajid Almasabi
Presentation2 ROHIT Photochemitry 3rd sem.pptx
Presentation2 ROHIT  Photochemitry 3rd sem.pptxPresentation2 ROHIT  Photochemitry 3rd sem.pptx
Presentation2 ROHIT Photochemitry 3rd sem.pptx
rohitverma43215
basic tissuse oral epithelium the classifications and subunits
basic tissuse oral epithelium the classifications and subunitsbasic tissuse oral epithelium the classifications and subunits
basic tissuse oral epithelium the classifications and subunits
jemimahrachel1299
Interproximal reduction using Enamel reduction techniques
Interproximal reduction using Enamel reduction techniquesInterproximal reduction using Enamel reduction techniques
Interproximal reduction using Enamel reduction techniques
SavgunAgrovet
Play whole.in children and adults..en.pdf
Play whole.in children and adults..en.pdfPlay whole.in children and adults..en.pdf
Play whole.in children and adults..en.pdf
mhmahmodian
Different Strategies in Scientific Publishing
Different Strategies in Scientific PublishingDifferent Strategies in Scientific Publishing
Different Strategies in Scientific Publishing
Carlos Baquero
case presentation on LRTI,SEPTIS with MODS
case presentation on LRTI,SEPTIS with MODScase presentation on LRTI,SEPTIS with MODS
case presentation on LRTI,SEPTIS with MODS
nukeshpandey5678
Aerospace_Quiz_Complete.pptx tehbuagiegige
Aerospace_Quiz_Complete.pptx  tehbuagiegigeAerospace_Quiz_Complete.pptx  tehbuagiegige
Aerospace_Quiz_Complete.pptx tehbuagiegige
amuthesh6
Sciences of Europe No 161 (2025)
Sciences of Europe No 161 (2025)Sciences of Europe No 161 (2025)
Sciences of Europe No 161 (2025)
Sciences of Europe
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptxIMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
karishmaduhijod1
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrrEDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EmanEssa14
Importance and Essentials and Necessities of Cell and Molecular Biology
Importance and Essentials and Necessities of Cell and Molecular BiologyImportance and Essentials and Necessities of Cell and Molecular Biology
Importance and Essentials and Necessities of Cell and Molecular Biology
johnfreeguydoe
Breeding Methods in Flower Crops....pptx
Breeding Methods in Flower Crops....pptxBreeding Methods in Flower Crops....pptx
Breeding Methods in Flower Crops....pptx
Ankita Bharti Rai
salting out.pptx. ( precipitation technique )
salting out.pptx. ( precipitation technique )salting out.pptx. ( precipitation technique )
salting out.pptx. ( precipitation technique )
rasihamza154
Polymer Composites Classification, Reinforcements, Matrices,.pptx
Polymer Composites Classification, Reinforcements, Matrices,.pptxPolymer Composites Classification, Reinforcements, Matrices,.pptx
Polymer Composites Classification, Reinforcements, Matrices,.pptx
JinnJinnkiJaddu
GNU Linux - Introduction and Administration.
GNU Linux - Introduction and Administration.GNU Linux - Introduction and Administration.
GNU Linux - Introduction and Administration.
Xavier de Pedro
Protocols for different types of Immunoassay
Protocols for different types of ImmunoassayProtocols for different types of Immunoassay
Protocols for different types of Immunoassay
Kishan Patel
natural producghfhhgfhffft 4sem ppt.pptx
natural producghfhhgfhffft 4sem ppt.pptxnatural producghfhhgfhffft 4sem ppt.pptx
natural producghfhhgfhffft 4sem ppt.pptx
rohitverma43215
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 31-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
1-ANATOMY-2022-INTRODUCTION.pptx and chapter 3
guilynharayo
Forensic analysis of the 2012 Aurora theatre mass shootings, Colorado
Forensic analysis of the 2012 Aurora theatre mass shootings, ColoradoForensic analysis of the 2012 Aurora theatre mass shootings, Colorado
Forensic analysis of the 2012 Aurora theatre mass shootings, Colorado
hosangnmims
Actinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Actinobacterium Producing Antimicrobials Against Drug-Resistant BacteriaActinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Actinobacterium Producing Antimicrobials Against Drug-Resistant Bacteria
Abdulmajid Almasabi
Presentation2 ROHIT Photochemitry 3rd sem.pptx
Presentation2 ROHIT  Photochemitry 3rd sem.pptxPresentation2 ROHIT  Photochemitry 3rd sem.pptx
Presentation2 ROHIT Photochemitry 3rd sem.pptx
rohitverma43215
basic tissuse oral epithelium the classifications and subunits
basic tissuse oral epithelium the classifications and subunitsbasic tissuse oral epithelium the classifications and subunits
basic tissuse oral epithelium the classifications and subunits
jemimahrachel1299
Interproximal reduction using Enamel reduction techniques
Interproximal reduction using Enamel reduction techniquesInterproximal reduction using Enamel reduction techniques
Interproximal reduction using Enamel reduction techniques
SavgunAgrovet
Play whole.in children and adults..en.pdf
Play whole.in children and adults..en.pdfPlay whole.in children and adults..en.pdf
Play whole.in children and adults..en.pdf
mhmahmodian
Different Strategies in Scientific Publishing
Different Strategies in Scientific PublishingDifferent Strategies in Scientific Publishing
Different Strategies in Scientific Publishing
Carlos Baquero
case presentation on LRTI,SEPTIS with MODS
case presentation on LRTI,SEPTIS with MODScase presentation on LRTI,SEPTIS with MODS
case presentation on LRTI,SEPTIS with MODS
nukeshpandey5678
Aerospace_Quiz_Complete.pptx tehbuagiegige
Aerospace_Quiz_Complete.pptx  tehbuagiegigeAerospace_Quiz_Complete.pptx  tehbuagiegige
Aerospace_Quiz_Complete.pptx tehbuagiegige
amuthesh6
Sciences of Europe No 161 (2025)
Sciences of Europe No 161 (2025)Sciences of Europe No 161 (2025)
Sciences of Europe No 161 (2025)
Sciences of Europe
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptxIMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
IMMUNOMODULATORS: IMMUNOSTIMULATION AND IMMUNOSUPPRESSION .pptx
karishmaduhijod1
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrrEDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EDIC Old Exames Q 3.pdfs fefeegh5uyttbtrr
EmanEssa14
Importance and Essentials and Necessities of Cell and Molecular Biology
Importance and Essentials and Necessities of Cell and Molecular BiologyImportance and Essentials and Necessities of Cell and Molecular Biology
Importance and Essentials and Necessities of Cell and Molecular Biology
johnfreeguydoe
Breeding Methods in Flower Crops....pptx
Breeding Methods in Flower Crops....pptxBreeding Methods in Flower Crops....pptx
Breeding Methods in Flower Crops....pptx
Ankita Bharti Rai
salting out.pptx. ( precipitation technique )
salting out.pptx. ( precipitation technique )salting out.pptx. ( precipitation technique )
salting out.pptx. ( precipitation technique )
rasihamza154
Polymer Composites Classification, Reinforcements, Matrices,.pptx
Polymer Composites Classification, Reinforcements, Matrices,.pptxPolymer Composites Classification, Reinforcements, Matrices,.pptx
Polymer Composites Classification, Reinforcements, Matrices,.pptx
JinnJinnkiJaddu

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

  • 1. Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
  • 2. Applied problems which can be solved with sentiment classification consumer reviews study to commercial products for businesses;
  • 4. Applied problems which can be solved with sentiment classification consumer reviews study to commercial products for businesses; recommender systems;
  • 6. Applied problems which can be solved with sentiment classification consumer reviews study to commercial products for businesses; recommender systems; Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
  • 7. Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person psychological and medical diagnosis; safety control by analyzing the behavior of mass gatherings; assistance in carrying out investigative measures.
  • 8. Most common sentiment analysis approaches Supervised machine learning Dictionaries and rules Combined method
  • 9. Existing corpora Corpora of reviews which contain user marks Belongs to one subject domain (movies reviews, books reviews, gadgets reviews) Corps of news (a few emotional texts)
  • 10. Filtration Texts containing both positive and negative emotions; Not informative tweets (less than 40 characters long); Copied texts and retweets.
  • 11. Corpus of short texts consists of 114 991 positive texts 111 923 negative texts 107 990 neutral texts
  • 12. Corpus of short texts Collection type Number of words Number of unique words Positive messages 1 559 176 150 720 Negative messages 1 445 517 191 677 Neutral messages 1 852 995 105 239
  • 13. Unique terms distribution in relation depending on the number of tweets 0 50000 100000 150000 200000 250000 300000 350000 400000 53 8213 16461 24624 32824 40999 49264 57414 65571 73660 81791 89882 97945 106068 114238 123009 131937 140682 149495 158284 167136 175859 184578 193442 202354 211426 220117 229570 238882 247995 256716 265561 274244 282350 Number of the unuque terms Number of texts
  • 14. Uniformity of used collections Words frequency distribution
  • 15. Most common approaches for used for N-grams extracting Manually, using a thesaurus. Term Extraction, based on significance of this term for a collection
  • 16. Data sets characteristics The entire data set is known The entire data set is avaliable The entire data set is static (cant change during calculation) When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
  • 17. Human speech is constantly changing => there is a need to update emotional dictionaries
  • 18. Change in vocabulary and topics discussed Percentage of references to the Olympic theme on all 12% 0.50% 14% 12% 10% 8% 6% 4% 2% 0% posts Febrary August
  • 19. Change in vocabulary and topics discussed Percentage of references to the vacation theme on all 0.06% 0.12% 0.14% 0.12% 0.10% 0.08% 0.06% 0.04% 0.02% 0.00% posts Febrary August
  • 20. Change in vocabulary and topics discussed Percentage of using term Sebyashka (selfie rus) on all 0.00% 0.02% 0.03% 0.02% 0.02% 0.01% 0.01% 0.00% posts Febrary August
  • 21. Filtration Punctuation commas, colons, quotation marks (exclamation marks, question marks and ellipses were retained); References to significant personalities and events Proper names; Numerals; All links were replaced with the word "Link" and were taken into consideration as a whole; Many dots were replaced with ellipsis.
  • 22. TF-ICF C number of categories, cf the number of categories in which weighed term is found
  • 23. TF-IDF tf is the frequency of term occurrence in the collection (positive or negative tweets) , T total number of messages in the collections, the number of messages in the positive and negative T(ti ) collections contained the term
  • 25. Corpus of News texts consists of 46 339 positive news 46 337 negative news 46 340 neutral news
  • 26. ROMIP mixed collection consists of Reviews on books, movies, or digital camera from blogs 543 positive blog texts 236 negative blog texts 103 neutral blog texts
  • 27. Short text collection TF-IDF TF-ICF Accuracy 95,5981 95,0664 Precision 0,958092631 0,953112184 Recall 0,955204837 0,94984672 F-Measure 0,956646554 0,95147665 News collection TF-IDF TF-ICF Accuracy 69,8619 58,1397 Precision 0,709246342 0,61278022 Recall 0,698624505 0,581402868 F-Measure 0,703895355 0,596679322 ROMIP collection TF-IDF TF-ICF Accuracy 53,9773 57,9545 Precision 0,561341047 0,558902611 Recall 0,5311636 0,535790598 F-Measure 0,545835539 0,547102625
  • 29. Experimental results in terms of F-measure 95.66 70.39 54.58 95.15 59.68 54.71 120 100 80 60 40 20 0 Short texts News Romip TF-IDF TF-ICF
  • 30. The program module allows dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection; take into account the lexical speech changes in time; investigate new terms entering into active vocabulary.
  • 31. Thank you! Presentation: http://www.slideshare.net/mokoron Yuliya Rubtsova yu.rubtsova@gmail.com study.mokoron.com

Editor's Notes

  • #14: show that when the document set size is small, the unique term count continues to climb up as the number of documents increases. However, this growth of the unique term count is reduced sharply as the number of documents becomes very large. This observation indicates that if the document collection is sufficiently large, we can expect to see very few new words by adding more documents.
  • #22: References to significant personalities and events the attitude towards them may vary over time, but a classifier trained on "old texts" will not be able to adapt quickly;