�ݺ�ߣ

Text, Topics, and Turkers. Hypertext 2015 1
Text, Topics, and Turkers:
A Consensus Measure for Statistical Topics
Fred Morstatter†, Jürgen Pfeffer‡,
Katja Mayer*, Huan Liu†
†Arizona State University
Tempe, Arizona, USA
‡Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
*University of Vienna
Vienna, Austria

Text
• Text is everywhere in research.
• Text is huge:
• Too much data to read.
• How can we understand what is going on in
big text data?
Source Size
Wikipedia 36 million pages
World Wide Web 100+ billion static web pages
Social Media 500 million new tweets/day

Topics
• Topic Modeling
• Latent Dirichlet Allocation (LDA)
– Most commonly-used topic modeling algorithm
– Discovers “topics” within a corpus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01

Topics
LDA
K = 10
Topic ID Words
Topic 1 river, lake, island, mountain, area, park, antarctic, south, mountains, dam
Topic 2 relay, athletics, metres, freestyle, hurdles, ret, divisão, athletes, bundesliga,
medals
... ...
Topic 10 courcelles, centimeters, mattythewhite, wine, stamps, oko, perennial, stubs,
ovate, greyish
Topic 1 Topic 2
...
Topic 10
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...

Topics
• How can we measure the quality of statistical
topics?
• We don’t know how well humans can
interpret topics.
• Problem: Does their understanding match
what is going on in the corpus?

Turkers
• One Solution: Crowdsourcing
• Example: Amazon’s Mechanical Turk
– Show LDA results to Turkers
– Gauge their understanding
– How to effectively measure understanding?

Turkers
• Previous Work: Chang et. al 2009
– “Word Intrusion”
– “Topic Intrusion”
Corpus
LDA
K
Topic ID Words
... ...
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
“Word Intrusion”
“Topic Intrusion”

Word Intrusion
• Show the Turker 6 words in random order
– Top 5 words from topic
– 1 “Intruded” word
– Ask Turker to choose “Intruded” word
cat dog bird truck horse snake
Topic i:
[Chang et. al 2009]

Topic Intrusion
• Show the Turker a document
• Show the Turker 4 topics
– 3 most probable topics
– 1 “Intruded” topic
– Ask Turker to choose “Intruded” Topic
Documenti
Topic A Topic B Topic C Topic D
[Chang et. al 2009]

New Measure: Topic Consensus
Corpus
LDA
K
Topic ID Words
... ...
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
“Word Intrusion”
“Topic Intrusion”
• Complements existing framework
• Measures topic quality with corpus.
“Topic Consensus”

Topic Consensus: Intuition
• Measures the agreement between topics and
“sections” they come from.
LDA Distribution Turker Distribution

Topic Consensus: Calculation
• We are comparing probability distributions.
• Jensen-Shannon Divergence.
Turker Distribution LDA Distribution

Dataset
• Scientific Abstracts
• All available abstracts
since 2007.
• Classified into three areas:
– Social Sciences & Humanities (SH)
– Life Sciences (LS)
– Physical Sciences (PE)
• Ran LDA on this dataset:
– K = [10, 25, 50, 100]
– 185 topics; 4 topic sets.

Turkers
• One task:
• Turkers have 3 + 1 options.
• Each task solved 8 times.

Results
Topic Set
ERC-10
ERC-25
ERC-50
ERC-100
new, group, results, plan, class, ...
selection, variation, population,
genetic, natural, ...

Other Topic Sets
• LDA Topics
– Use New York Times dataset from one day.
25 topics, 1 topic set
• Hand-Picked Topics
– Pure “Social Science & Humanities”
• Sampled words that occur only in these documents.
– Random Topics
• Randomly choose topics according to word distribution
of corpus.

Results
Topic Set
ERC-10
ERC-25
ERC-50
ERC-100
NYT-25
RAND-25
SH-25

Overview of the Process
• Topic Consensus can reveal new information
about the topics being studied.
– Can measure topics from a new perspective.
– Can help reveal topic confusion.
• Drawbacks:
– Expensive
– Time Consuming
– Scalability

Automated Measures
1. Topic Size: Number of tokens assigned to the
topic.
2. Topic Coherence: Probability that the top
words co-occur in documents in the corpus.
3. Topic Coherence Significance: Significance of
Topic Coherence compared to other topics.
4. Normalized Pointwise Mutual Information:
Measures the association between the top
words in the topics.

Measures
• Herfindahl-Hirschman Index (HHI)
– Measures concentration of a market.
– Used to find monopolies.
– Viewed from two perspectives:
Word Probability HHI5. 6.
Social Sciences Physical Sciences Life Sciences
ERC Section HHI

Results - Correlation
Automated Measure Correlation
Topic Size -0.532
Topic Coherence -0.584
Topic Coherence Significance -0.788
Normalized Pointwise
Mutual Information
-0.774
HHI (Word Probability) -0.885
HHI (ERC Section) -0.478

Results - Prediction
• Build classifier to predict actual Topic
Consensus value.
• Build linear regression model:
– Takes automated measures.
– Predicts Topic Consensus.
• RMSE: 0.12 ± 0.02.

Acknowledgements
• Members of the DMML lab
• Office of Naval Research through grant
N000141410095
• LexisNexis and HPCC Systems

Conclusion
• Introduced a new method for evaluating the
interpretability of statistical topics.
• Demonstrated this measure on a real-world
dataset.
• Automated this measure for scalability.

Future Work
• How sensitive are measures to top words?
– Word Intrusion uses 5
– Topic Intrusion uses 5
– Topic Consensus uses 25
• How do measures fare on different datasets?
• Other measures that can reveal quality topics?

Auxiliary �ݺ�ߣs

User Demographics
Sex Education Age
First Language Country of Origin

Results – Confusion Matrix

Dataset Statistics

�ݺ�ߣ

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Recommended

More Related Content

What's hot (6)

Similar to Text, Topics, and Turkers: A Consensus Measure for Statistical Topics (20)

Recently uploaded (14)

Text, Topics, and Turkers: A Consensus Measure for Statistical Topics

Editor's Notes