This document discusses using natural language processing and machine learning techniques for sentiment analysis. It describes how sentiment analysis can be used to analyze sentiment in texts like reviews, comments, and surveys. It then outlines some challenges with sentiment analysis, such as sarcasm and complex opinions. The document proposes a clustering method to group similar short texts together to reduce sparsity. It describes training a naive Bayes classifier on the clustered texts and evaluating the method on a Twitter dataset, finding an average 1.7% improvement in precision and accuracy over baselines.
2. ? ¡°unbelievably disappointing ¡±
? ¡°Full of zany characters and richly applied satire, and some great
plot twists¡±
? ¡°this is the greatest screwball comedy ever filmed¡±
? ¡° It was pathetic. The worst part about it was the boxing scenes.¡±
? Sentiment Analysis
? Using NLP, statistics, or machine learning methods to extract, identify, or
otherwise characterize the sentiment content of a text unit
? Sometimes called opinion mining, although the emphasis in this case is on
extraction
? Other names: Opinion extraction¡¢Sentiment mining¡¢Subjectivity analysis
2
4. ? Movie: is this review positive or negative?
? Products: what do people think about the new iPhone?
? Public sentiment: how is consumer confidence? Is despair
increasing?
? Politics: what do people think about this candidate or issue?
? Prediction: predict election outcomes or market trends from
sentiment
4
5. ? People express opinions in complex ways
? In opinion texts, lexical content alone can be misleading
? Intra-textual and sub-sentential reversals, negation, topic change
common
? Rhetorical devices/modes such as sarcasm, irony, implication, etc.
5
6. ? Tokenization
? Feature Extraction: n-grams, semantics, syntactic, etc.
? Classification using different classifiers
? Na?ve Bayes
? MaxEnt
? SVM
? Drawback
? Sparsity
? Context independent
S1: I really like this movie
[...0 0 1 1 1 1 1 0 0 ... ]
6
S1: This phone has a good keypad
S2: He will move and leave her for good
7. ? Using clustering algorithm to aggregate short text to form long clusters,
in which each cluster has the same topic and the same sentiment
polarity, to reduce the sparsity of short text representation and keep
interpretation.
S1: it works perfectly! Love this product
S2: very pleased! Super easy to, I love it
S3: I recommend it
it works perfectly love this product very pleased super easy to I recommend
S1: [1 1 1 1 1 1 0 0 0 0 0 0 0]
S2: [0 0 0 1 0 0 1 1 1 1 1 1 0]
S3: [1 0 0 0 0 0 0 0 0 0 0 1 1]
S1+S2+S3: [...0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0...]
7
8. ? Training data labeled with positive and negative polarity
? K-means clustering algorithm is used to cluster positive and
negative text separately.
? K-means, KNN, LDA¡
works perfectly! Love this product
completely useless, return policy
very pleased! Super easy to, I am pleased
was very poor, it has failed
highly recommend it, high recommended!
it totally unacceptable, is so bad
works perfectly! Love this product
very pleased! Super easy to, I am pleased
highly recommend it, high recommended!
completely useless, return policy
was very poor, it has failed
it totally unacceptable, is so bad
Topical clusters
8
9. ? Topical consistency: texts in each cluster have similar topic
? Sparsity reduced: The representation of topical clusters is more
dense than single text
? Easy to apply the idea to other area
9
10. Classifier: Multinomial Naive Bayes
Probabilistic classifier: get the probability of label given a clustered
text
,
1
arg max ( | )
arg max ( ) ( | )
Ci
i
s S
i j
s S j N
s P s C
P s P C s
?
? ? ?
?
? ?
$
( ) sN
P s
N
?
,
,
( , ) 1
( | )
( | ) | |
i j
i j
x V
N C s
P C s
N x s V
?
?
?
??
Bayes¡¯ theory
Independent assumption
10
11. ? Given an unlabeled text , we use Euclidean distance to find the
most similar positive cluster , and the most similar negative
cluster
? The sentiment of , is estimated according to the probabilistic
change of the two clusters when merging with . (vs. KNN)
? This merging operation is called two-stage-merging method, as each
unlabeled text will be merged two times.
0, | ( ) ( ) | | ( ) ( ) |
( )
1, .
m m n n
j
P NC P C P NC P C
f x
otherwise
? ? ? ?
? ? ? ?
? ?
?
mC ?
jx
nC ?
jx
jx
11
12. ? Dataset: Stanford Twitter Sentiment Corpus (STS)
? Baseline: bag-of-unigrams and bigrams without clustering
? Evaluation Metrics: accuracy, precision, recall
? The average precision and accuracy is 1.7% and 1.3% higher than
the baseline method.
Methods Accuracy Precision Recall
Our Method 0.816 0.82 0.813
Bigrams 0.805 0.807 0.802
12
13. ? We introduce a Clustering algorithm based method to reduce
sparsity problem for sentiment classification of short text
? This idea can be applied to other area
? The above method is just a prototype work and some technique can
be used to improve the model, including clustering algorithms,
distributed representation and the two-stage-merging method.
? Future works:
? Expanding this model use top-n similar clusters.
? Use distributed representation.
? Some deep learning model.
13