�ݺ�ߣ

Identifying Content for
Planned Events Across
Social Media Sites
Hila Becker, Dan Iter, Mor Naaman, Luis Gravano

Event Content in Social Media

2

MIKE CLARKE/AFP/Getty Images

3

Source: Tweets
from Tahrir, edited
by Nadia Idle and
Alex Nunns
4

 Challenges:
 Wide variety of topics, not all related to events
(e.g., personal status updates, every-day mundane
conversations)
 Unconventional text: abbreviations, typos
 Large-scale, rapidly produced content

 Opportunities:
 Content generated in real-time, as events happen
 Rich context features (e.g., time, location)
 Users’ perspective

6


 Planned Event is a real-world occurrence with
Content Discovery

corresponding published event record consisting
Known

of:
 Title, describing the subject of the event
 The time at which the event is planned to occur
Unknown

7

Identifying Content for Planned
Events
 Identify planned event documents given
known event information
 User-contributed planned event records
 LastFM Events
 EventBrite
 Facebook Events
 Structured features (e.g., title, time, location)

 Challenging identification scenario
 Known event information is often inaccurate or
incomplete
 Social media documents are brief and noisy

8

Planned Event Record

Title

Descriptio
n

Date/Time

Venue

City

9

Approach for Planned Event Content
Identification
 Two-step query formulation strategy
 Precision-oriented queries using known event
features
 Recall-oriented queries using retrieved content
from precision-oriented queries

 Leverage cross-site content
 Identify event documents on each site
individually
 Use event documents on one site to retrieve
additional event documents on a different site

10

Query Formulation Strategies:
Precision-oriented Queries
 Combined event record features
 Phrase, bag-of-words, stop word elimination
 Examples: [“title”+”venue”], [title-no-
stopwords+”city”]

 Restricted document creation time
 Why is this hard?
 Specific titles: “Celebrate Brooklyn! Opening Night
Gala & Concert with Andrew Bird”
 General titles: “Opening Night Concert”

11

Query Formulation Strategies:
Precision-oriented Queries Demo

12

Query Formulation Strategies: Recall-
oriented Queries
 Generated using “high-precision” results
from precision-oriented queries
 Frequency Analysis
 Frequent terms in the event’s retrieved content
 Infrequent terms in Web documents
 Limited to 100 candidate queries

 Term Extraction
Identify meaningful event-related concepts

13

Query Selection Strategies
 Problem: potentially large set of generated
queries
 Select top candidate queries
 Specificity: favor longer queries
 Temporal profile:
120
100
80
60
40
20
0
6/7/11 6/8/11 6/9/11 6/10/11 6/11/11 6/12/11 6/13/11
[andrew bird concert] [state farm insurance]

14

Leveraging Cross-Site Content
 Build precision-oriented
queries using planned event
features
…
 Use precision-oriented
queries to retrieve data
from:
 Twitter
 Flickr
 YouTube

 Build recall-oriented queries
using data from:
 Each site individually
 All sites collectively
15

Experimental Settings
 60 planned events from
EventBrite, LastFM, LinkedIn, and Facebook
 Corresponding social media documents
 Retrieved from Twitter, Flickr, and YouTube
 Ranked according to similarity to event record

 Techniques
 Precision: only precision-oriented queries
 MS: precision- and recall-oriented queries selected
using Microsoft n-gram probability score
 TR/RTR: precision- and recall-oriented queries selected
using ratio of document frequency around the time of
the event to document frequency in larger time
window

16

Evaluation
 How do our queries compare with human-
generated queries for the event?
 How good are our queries?
 How good are the results retrieved by our
queries?

17

How good are our queries?
Would the query match documents related
to the event? 1 = not likely, 5 = certainly
5

4.5

4
MS
3.5 TR
3 RTR
MS-TR
2.5
MS-RTR
2
Precision
1.5

1
Twi er Flickr YouTube All Precision

18

Can our queries retrieve relevant
results?
 Rank retrieved results
 Based on similarity to event record
 Using multi-feature similarity metric (Becker et al.
WSDM’10)

 Evaluate relevance of documents
 NDCG
 Averaged over all events that had some retrieved
results

 Consider event coverage

19

NDCG Performance on Twitter

1
0.95
Twi er-MS
0.9
0.85
NDCG

0.8 Twi er-RTR

0.75
0.7 Precision
0.65
0.6
5 10 15 20
Number of Documents k

NDCG scores for top-k Twitter documents retrieved by
Precision-oriented queries (Precision), and query strategies
using Twitter data (Twitter-RTR, Twitter-MS).
20

Cross-Site NDCG Performance
1.1
1 4 4
0.9
0.8 5 5 Precision
0.7
NDCG

0.6 39 36 34 34
Twi er-MS
0.5
0.4
0.3 YouTube-MS
7
0.2 9 8 8
0.1
0
0 5 10 15 20 25
Number of Documents k
NDCG scores for top-k YouTube documents retrieved by
Precision-oriented queries (Precision), and query strategies
using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
21

Conclusions
 Developed a two-step query-oriented
solution for planned event content
identification
 User contributed event records
 Multiple social media sites

 Identified diverse event content:
photos, videos, and tweets
 Showed how event content from one site
can be used to enhance event content
identification on other sites

22

Future Work
 Leverage explicit links
 From event records to documents
 Between documents from different social media
sites

 Sub-event content analysis
 Event timeline construction

23

�ݺ�ߣ

Hila wsdm12-final

More Related Content

Hila wsdm12-final

Editor's Notes