This document proposes a new "Topic Consensus" measure to evaluate the interpretability of statistical topics discovered via topic modeling. It uses Amazon Mechanical Turk workers to gauge their understanding of topics by asking them to assign scientific abstracts to topics, and compares this to the topic assignments from LDA. It finds Topic Consensus correlates well with existing automated measures of topic quality and can predict the consensus value. The measure provides a new perspective on evaluating discovered topics compared to existing methods.
1 of 29
Download to read offline
More Related Content
Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
1. Text, Topics, and Turkers. Hypertext 2015 1
Text, Topics, and Turkers:
A Consensus Measure for Statistical Topics
Fred Morstatter, J端rgen Pfeffer,
Katja Mayer*, Huan Liu
Arizona State University
Tempe, Arizona, USA
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
*University of Vienna
Vienna, Austria
2. Text, Topics, and Turkers. Hypertext 2015 2
Text
Text is everywhere in research.
Text is huge:
Too much data to read.
How can we understand what is going on in
big text data?
Source Size
Wikipedia 36 million pages
World Wide Web 100+ billion static web pages
Social Media 500 million new tweets/day
3. Text, Topics, and Turkers. Hypertext 2015 3
Topics
Topic Modeling
Latent Dirichlet Allocation (LDA)
Most commonly-used topic modeling algorithm
Discovers topics within a corpus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
4. Text, Topics, and Turkers. Hypertext 2015 4
Topics
LDA
K = 10
Topic ID Words
Topic 1 river, lake, island, mountain, area, park, antarctic, south, mountains, dam
Topic 2 relay, athletics, metres, freestyle, hurdles, ret, divis達o, athletes, bundesliga,
medals
... ...
Topic 10 courcelles, centimeters, mattythewhite, wine, stamps, oko, perennial, stubs,
ovate, greyish
Topic 1 Topic 2
...
Topic 10
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
5. Text, Topics, and Turkers. Hypertext 2015 5
Topics
How can we measure the quality of statistical
topics?
We dont know how well humans can
interpret topics.
Problem: Does their understanding match
what is going on in the corpus?
6. Text, Topics, and Turkers. Hypertext 2015 6
Turkers
One Solution: Crowdsourcing
Example: Amazons Mechanical Turk
Show LDA results to Turkers
Gauge their understanding
How to effectively measure understanding?
7. Text, Topics, and Turkers. Hypertext 2015 7
Turkers
Previous Work: Chang et. al 2009
Word Intrusion
Topic Intrusion
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Word Intrusion
Topic Intrusion
8. Text, Topics, and Turkers. Hypertext 2015 8
Word Intrusion
Show the Turker 6 words in random order
Top 5 words from topic
1 Intruded word
Ask Turker to choose Intruded word
cat dog bird truck horse snake
Topic i:
[Chang et. al 2009]
9. Text, Topics, and Turkers. Hypertext 2015 9
Topic Intrusion
Show the Turker a document
Show the Turker 4 topics
3 most probable topics
1 Intruded topic
Ask Turker to choose Intruded Topic
Documenti
Topic A Topic B Topic C Topic D
[Chang et. al 2009]
10. Text, Topics, and Turkers. Hypertext 2015 10
New Measure: Topic Consensus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Word Intrusion
Topic Intrusion
Complements existing framework
Measures topic quality with corpus.
Topic Consensus
11. Text, Topics, and Turkers. Hypertext 2015 11
Topic Consensus: Intuition
Measures the agreement between topics and
sections they come from.
LDA Distribution Turker Distribution
12. Text, Topics, and Turkers. Hypertext 2015 12
Topic Consensus: Calculation
We are comparing probability distributions.
Jensen-Shannon Divergence.
Turker Distribution LDA Distribution
13. Text, Topics, and Turkers. Hypertext 2015 13
Dataset
Scientific Abstracts
All available abstracts
since 2007.
Classified into three areas:
Social Sciences & Humanities (SH)
Life Sciences (LS)
Physical Sciences (PE)
Ran LDA on this dataset:
K = [10, 25, 50, 100]
185 topics; 4 topic sets.
14. Text, Topics, and Turkers. Hypertext 2015 14
Turkers
One task:
Turkers have 3 + 1 options.
Each task solved 8 times.
16. Text, Topics, and Turkers. Hypertext 2015 16
Other Topic Sets
LDA Topics
Use New York Times dataset from one day.
25 topics, 1 topic set
Hand-Picked Topics
Pure Social Science & Humanities
Sampled words that occur only in these documents.
11 topics, 1 topic set
Random Topics
Randomly choose topics according to word distribution
of corpus.
25 topics, 1 topic set
17. Text, Topics, and Turkers. Hypertext 2015 17
Results
Topic Set
ERC-10
ERC-25
ERC-50
ERC-100
NYT-25
RAND-25
SH-25
18. Text, Topics, and Turkers. Hypertext 2015 18
Overview of the Process
Topic Consensus can reveal new information
about the topics being studied.
Can measure topics from a new perspective.
Can help reveal topic confusion.
Drawbacks:
Expensive
Time Consuming
Scalability
19. Text, Topics, and Turkers. Hypertext 2015 19
Automated Measures
1. Topic Size: Number of tokens assigned to the
topic.
2. Topic Coherence: Probability that the top
words co-occur in documents in the corpus.
3. Topic Coherence Significance: Significance of
Topic Coherence compared to other topics.
4. Normalized Pointwise Mutual Information:
Measures the association between the top
words in the topics.
20. Text, Topics, and Turkers. Hypertext 2015 20
Measures
Herfindahl-Hirschman Index (HHI)
Measures concentration of a market.
Used to find monopolies.
Viewed from two perspectives:
Word Probability HHI5. 6.
Social Sciences Physical Sciences Life Sciences
ERC Section HHI
22. Text, Topics, and Turkers. Hypertext 2015 22
Results - Prediction
Build classifier to predict actual Topic
Consensus value.
Build linear regression model:
Takes automated measures.
Predicts Topic Consensus.
RMSE: 0.12 賊 0.02.
23. Text, Topics, and Turkers. Hypertext 2015 23
Acknowledgements
Members of the DMML lab
Office of Naval Research through grant
N000141410095
LexisNexis and HPCC Systems
24. Text, Topics, and Turkers. Hypertext 2015 24
Conclusion
Introduced a new method for evaluating the
interpretability of statistical topics.
Demonstrated this measure on a real-world
dataset.
Automated this measure for scalability.
25. Text, Topics, and Turkers. Hypertext 2015 25
Future Work
How sensitive are measures to top words?
Word Intrusion uses 5
Topic Intrusion uses 5
Topic Consensus uses 25
How do measures fare on different datasets?
Other measures that can reveal quality topics?
#4: Topic modeling --- text summarization
These algorithms are widely used for
#6: Why do I need to measure these topics?
Finding quality topics
Setting value of K in LDA
Choosing the best topic model (LDA, ...)
#7: We need objective measures to evaluate the quality of topics.
#10: Each document gets a score. Can aggregate to get a sense of the model.
This is a measure of the model, by looking at the document.
#11: The Previous measures are good.
Specifically, we are looking at properties of the corpus.
#12: Sections can be like newspaper
Blue is SPORTS Red is BUSINESS
In reality, no topic is going to purely sports or business. Topics are mixtures over these sections.
We want to know how humans can interpret these mixtures.
Sections can be like Twitter
Blue is protest
Red is
This slide just illustrates the process, Ill get into more details later.
This is a TC calculation for ONE TOPIC
#13: Topic Consensus is calculated as...
K is Kullback-Leibler divergence; M is the middle of the distribution
One side effect of using this measure is that lower scores indicate a better consensus.
#16: If you want good topics you might choose 100...., If you want a good model you might choose 25....
The worst from TC are often stopwords topics
Connection to Word Intrusion
Are they really good topics?
#18: Each bar is a group of topics
Bar in the middle is the median
SH does the best ... This is good!
Random does the worse ... This is also good!
NYT does the worst ... Why?
#19: Is it possible to find a way to address all of these drawbacks?
Explain the remainder of this paper here.
#20: These are methods used throughout the literature to measure topic quality, we repeat them here.