The document describes a two-step approach for identifying content related to planned events across multiple social media sites. First, precision-oriented queries are formulated using exact event details. Then recall-oriented queries are generated using terms from retrieved precision query results. Experiments show this approach can identify diverse event-related content from sites like Twitter, Flickr and YouTube. Content from one site also helps retrieve more content from other sites. The approach provides an effective way to discover event content at scale from various social media sources.
6. Event Content in Social Media
Challenges:
Wide variety of topics, not all related to events
(e.g., personal status updates, every-day mundane
conversations)
Unconventional text: abbreviations, typos
Large-scale, rapidly produced content
Opportunities:
Content generated in real-time, as events happen
Rich context features (e.g., time, location)
Users perspective
6
7. Event Content in Social Media
Planned Event is a real-world occurrence with
Content Discovery
corresponding published event record consisting
Known
of:
Title, describing the subject of the event
The time at which the event is planned to occur
Unknown
7
8. Identifying Content for Planned
Events
Identify planned event documents given
known event information
User-contributed planned event records
LastFM Events
EventBrite
Facebook Events
Structured features (e.g., title, time, location)
Challenging identification scenario
Known event information is often inaccurate or
incomplete
Social media documents are brief and noisy
8
10. Approach for Planned Event Content
Identification
Two-step query formulation strategy
Precision-oriented queries using known event
features
Recall-oriented queries using retrieved content
from precision-oriented queries
Leverage cross-site content
Identify event documents on each site
individually
Use event documents on one site to retrieve
additional event documents on a different site
10
11. Query Formulation Strategies:
Precision-oriented Queries
Combined event record features
Phrase, bag-of-words, stop word elimination
Examples: [title+venue], [title-no-
stopwords+city]
Restricted document creation time
Why is this hard?
Specific titles: Celebrate Brooklyn! Opening Night
Gala & Concert with Andrew Bird
General titles: Opening Night Concert
11
13. Query Formulation Strategies: Recall-
oriented Queries
Generated using high-precision results
from precision-oriented queries
Frequency Analysis
Frequent terms in the events retrieved content
Infrequent terms in Web documents
Limited to 100 candidate queries
Term Extraction
Identify meaningful event-related concepts
13
15. Leveraging Cross-Site Content
Build precision-oriented
queries using planned event
features
Use precision-oriented
queries to retrieve data
from:
Twitter
Flickr
YouTube
Build recall-oriented queries
using data from:
Each site individually
All sites collectively
15
16. Experimental Settings
60 planned events from
EventBrite, LastFM, LinkedIn, and Facebook
Corresponding social media documents
Retrieved from Twitter, Flickr, and YouTube
Ranked according to similarity to event record
Techniques
Precision: only precision-oriented queries
MS: precision- and recall-oriented queries selected
using Microsoft n-gram probability score
TR/RTR: precision- and recall-oriented queries selected
using ratio of document frequency around the time of
the event to document frequency in larger time
window
16
17. Evaluation
How do our queries compare with human-
generated queries for the event?
How good are our queries?
How good are the results retrieved by our
queries?
17
18. How good are our queries?
Would the query match documents related
to the event? 1 = not likely, 5 = certainly
5
4.5
4
MS
3.5 TR
3 RTR
MS-TR
2.5
MS-RTR
2
Precision
1.5
1
Twi er Flickr YouTube All Precision
18
19. Can our queries retrieve relevant
results?
Rank retrieved results
Based on similarity to event record
Using multi-feature similarity metric (Becker et al.
WSDM10)
Evaluate relevance of documents
NDCG
Averaged over all events that had some retrieved
results
Consider event coverage
19
20. NDCG Performance on Twitter
1
0.95
Twi er-MS
0.9
0.85
NDCG
0.8 Twi er-RTR
0.75
0.7 Precision
0.65
0.6
5 10 15 20
Number of Documents k
NDCG scores for top-k Twitter documents retrieved by
Precision-oriented queries (Precision), and query strategies
using Twitter data (Twitter-RTR, Twitter-MS).
20
21. Cross-Site NDCG Performance
1.1
1 4 4
0.9
0.8 5 5 Precision
0.7
NDCG
0.6 39 36 34 34
Twi er-MS
0.5
0.4
0.3 YouTube-MS
7
0.2 9 8 8
0.1
0
0 5 10 15 20 25
Number of Documents k
NDCG scores for top-k YouTube documents retrieved by
Precision-oriented queries (Precision), and query strategies
using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
21
22. Conclusions
Developed a two-step query-oriented
solution for planned event content
identification
User contributed event records
Multiple social media sites
Identified diverse event content:
photos, videos, and tweets
Showed how event content from one site
can be used to enhance event content
identification on other sites
22
23. Future Work
Leverage explicit links
From event records to documents
Between documents from different social media
sites
Sub-event content analysis
Event timeline construction
23
Editor's Notes
Users often share information about events in a variety of forms on different social media sites
Social media provides many challenges ad opportunities for identifying event information
Explain that we work in real-time (for the most part) and say we divide the space into unknown and know identification scenarios, then mention the type of even we focus on for each. Also briefly mention that as we discuss in the thesis, these are not disjoint
on average, queries generated by this strategy are expected to retrieve some results for their associated event.
This is averaged over all events that had some results. How many events had some results? Precision 22% of events , Twitter RTR 76% of events