際際滷

際際滷Share a Scribd company logo
Identifying Content for
Planned Events Across
Social Media Sites
Hila Becker, Dan Iter, Mor Naaman, Luis Gravano
Event Content in Social Media




                                2
MIKE CLARKE/AFP/Getty Images

                               3
Source: Tweets
from Tahrir, edited
by Nadia Idle and
Alex Nunns
                      4
5
Event Content in Social Media
 Challenges:
  Wide variety of topics, not all related to events
   (e.g., personal status updates, every-day mundane
   conversations)
  Unconventional text: abbreviations, typos
  Large-scale, rapidly produced content

 Opportunities:
  Content generated in real-time, as events happen
  Rich context features (e.g., time, location)
  Users perspective


                                                       6
Event Content in Social Media


                               Planned Event is a real-world occurrence with
Content Discovery




                                corresponding published event record consisting
                    Known




                                of:
                                 Title, describing the subject of the event
                                 The time at which the event is planned to occur
                    Unknown




                                                                                    7
Identifying Content for Planned
   Events
 Identify planned event documents given
 known event information
  User-contributed planned event records
    LastFM Events
    EventBrite
    Facebook Events
  Structured features (e.g., title, time, location)

 Challenging identification scenario
  Known event information is often inaccurate or
   incomplete
  Social media documents are brief and noisy

                                                       8
Planned Event Record

  Title


Descriptio
   n


Date/Time


 Venue


   City



                                 9
Approach for Planned Event Content
   Identification
 Two-step query formulation strategy
  Precision-oriented queries using known event
   features
  Recall-oriented queries using retrieved content
   from precision-oriented queries

 Leverage cross-site content
  Identify event documents on each site
   individually
  Use event documents on one site to retrieve
   additional event documents on a different site

                                                     10
Query Formulation Strategies:
   Precision-oriented Queries
 Combined event record features
   Phrase, bag-of-words, stop word elimination
   Examples: [title+venue], [title-no-
    stopwords+city]

 Restricted document creation time
 Why is this hard?
   Specific titles: Celebrate Brooklyn! Opening Night
    Gala & Concert with Andrew Bird
   General titles: Opening Night Concert


                                                          11
Query Formulation Strategies:
Precision-oriented Queries Demo




                                  12
Query Formulation Strategies: Recall-
   oriented Queries
 Generated using high-precision results
 from precision-oriented queries
 Frequency Analysis
   Frequent terms in the events retrieved content
   Infrequent terms in Web documents
   Limited to 100 candidate queries

 Term Extraction
  Identify meaningful event-related concepts



                                                      13
Query Selection Strategies
 Problem: potentially large set of generated
 queries
 Select top candidate queries
  Specificity: favor longer queries
  Temporal profile:
          120
          100
           80
           60
           40
           20
            0
                6/7/11   6/8/11     6/9/11       6/10/11   6/11/11    6/12/11   6/13/11
                         [andrew bird concert]        [state farm insurance]


                                                                                          14
Leveraging Cross-Site Content
 Build precision-oriented
  queries using planned event
  features
                                  
 Use precision-oriented
  queries to retrieve data
  from:
   Twitter
   Flickr
   YouTube

 Build recall-oriented queries
  using data from:
   Each site individually
   All sites collectively
                                      15
Experimental Settings
 60 planned events from
  EventBrite, LastFM, LinkedIn, and Facebook
 Corresponding social media documents
   Retrieved from Twitter, Flickr, and YouTube
   Ranked according to similarity to event record

 Techniques
   Precision: only precision-oriented queries
   MS: precision- and recall-oriented queries selected
    using Microsoft n-gram probability score
   TR/RTR: precision- and recall-oriented queries selected
    using ratio of document frequency around the time of
    the event to document frequency in larger time
    window

                                                              16
Evaluation
 How do our queries compare with human-
 generated queries for the event?
 How good are our queries?
 How good are the results retrieved by our
 queries?




                                              17
How good are our queries?
Would the query match documents related
 to the event? 1 = not likely, 5 = certainly
  5

 4.5

  4
                                                      MS
 3.5                                                  TR
  3                                                   RTR
                                                      MS-TR
 2.5
                                                      MS-RTR
  2
                                                      Precision
 1.5

  1
        Twi er   Flickr   YouTube   All   Precision

                                                                  18
Can our queries retrieve relevant
    results?
 Rank retrieved results
   Based on similarity to event record
   Using multi-feature similarity metric (Becker et al.
    WSDM10)

 Evaluate relevance of documents
   NDCG
   Averaged over all events that had some retrieved
    results

 Consider event coverage

                                                           19
NDCG Performance on Twitter

         1
       0.95
                                                           Twi er-MS
        0.9
       0.85
NDCG




        0.8                                                Twi er-RTR

       0.75
        0.7                                                Precision
       0.65
        0.6
                 5          10          15        20
                          Number of Documents k

       NDCG scores for top-k Twitter documents retrieved by
       Precision-oriented queries (Precision), and query strategies
       using Twitter data (Twitter-RTR, Twitter-MS).
                                                                        20
Cross-Site NDCG Performance
       1.1
         1                         4          4
       0.9
       0.8       5       5                              Precision
       0.7
NDCG




       0.6       39      36        34         34
                                                        Twi er-MS
       0.5
       0.4
       0.3                                              YouTube-MS
                                              7
       0.2       9       8         8
       0.1
         0
             0   5       10        15         20   25
                      Number of Documents k
NDCG scores for top-k YouTube documents retrieved by
Precision-oriented queries (Precision), and query strategies
using data from Twitter (Twitter-MS) and YouTube (YouTube MS).
                                                                     21
Conclusions
 Developed a two-step query-oriented
 solution for planned event content
 identification
   User contributed event records
   Multiple social media sites

 Identified diverse event content:
 photos, videos, and tweets
 Showed how event content from one site
 can be used to enhance event content
 identification on other sites

                                           22
Future Work
 Leverage explicit links
   From event records to documents
   Between documents from different social media
    sites

 Sub-event content analysis
 Event timeline construction




                                                    23

More Related Content

Hila wsdm12-final

  • 1. Identifying Content for Planned Events Across Social Media Sites Hila Becker, Dan Iter, Mor Naaman, Luis Gravano
  • 2. Event Content in Social Media 2
  • 4. Source: Tweets from Tahrir, edited by Nadia Idle and Alex Nunns 4
  • 5. 5
  • 6. Event Content in Social Media Challenges: Wide variety of topics, not all related to events (e.g., personal status updates, every-day mundane conversations) Unconventional text: abbreviations, typos Large-scale, rapidly produced content Opportunities: Content generated in real-time, as events happen Rich context features (e.g., time, location) Users perspective 6
  • 7. Event Content in Social Media Planned Event is a real-world occurrence with Content Discovery corresponding published event record consisting Known of: Title, describing the subject of the event The time at which the event is planned to occur Unknown 7
  • 8. Identifying Content for Planned Events Identify planned event documents given known event information User-contributed planned event records LastFM Events EventBrite Facebook Events Structured features (e.g., title, time, location) Challenging identification scenario Known event information is often inaccurate or incomplete Social media documents are brief and noisy 8
  • 9. Planned Event Record Title Descriptio n Date/Time Venue City 9
  • 10. Approach for Planned Event Content Identification Two-step query formulation strategy Precision-oriented queries using known event features Recall-oriented queries using retrieved content from precision-oriented queries Leverage cross-site content Identify event documents on each site individually Use event documents on one site to retrieve additional event documents on a different site 10
  • 11. Query Formulation Strategies: Precision-oriented Queries Combined event record features Phrase, bag-of-words, stop word elimination Examples: [title+venue], [title-no- stopwords+city] Restricted document creation time Why is this hard? Specific titles: Celebrate Brooklyn! Opening Night Gala & Concert with Andrew Bird General titles: Opening Night Concert 11
  • 13. Query Formulation Strategies: Recall- oriented Queries Generated using high-precision results from precision-oriented queries Frequency Analysis Frequent terms in the events retrieved content Infrequent terms in Web documents Limited to 100 candidate queries Term Extraction Identify meaningful event-related concepts 13
  • 14. Query Selection Strategies Problem: potentially large set of generated queries Select top candidate queries Specificity: favor longer queries Temporal profile: 120 100 80 60 40 20 0 6/7/11 6/8/11 6/9/11 6/10/11 6/11/11 6/12/11 6/13/11 [andrew bird concert] [state farm insurance] 14
  • 15. Leveraging Cross-Site Content Build precision-oriented queries using planned event features Use precision-oriented queries to retrieve data from: Twitter Flickr YouTube Build recall-oriented queries using data from: Each site individually All sites collectively 15
  • 16. Experimental Settings 60 planned events from EventBrite, LastFM, LinkedIn, and Facebook Corresponding social media documents Retrieved from Twitter, Flickr, and YouTube Ranked according to similarity to event record Techniques Precision: only precision-oriented queries MS: precision- and recall-oriented queries selected using Microsoft n-gram probability score TR/RTR: precision- and recall-oriented queries selected using ratio of document frequency around the time of the event to document frequency in larger time window 16
  • 17. Evaluation How do our queries compare with human- generated queries for the event? How good are our queries? How good are the results retrieved by our queries? 17
  • 18. How good are our queries? Would the query match documents related to the event? 1 = not likely, 5 = certainly 5 4.5 4 MS 3.5 TR 3 RTR MS-TR 2.5 MS-RTR 2 Precision 1.5 1 Twi er Flickr YouTube All Precision 18
  • 19. Can our queries retrieve relevant results? Rank retrieved results Based on similarity to event record Using multi-feature similarity metric (Becker et al. WSDM10) Evaluate relevance of documents NDCG Averaged over all events that had some retrieved results Consider event coverage 19
  • 20. NDCG Performance on Twitter 1 0.95 Twi er-MS 0.9 0.85 NDCG 0.8 Twi er-RTR 0.75 0.7 Precision 0.65 0.6 5 10 15 20 Number of Documents k NDCG scores for top-k Twitter documents retrieved by Precision-oriented queries (Precision), and query strategies using Twitter data (Twitter-RTR, Twitter-MS). 20
  • 21. Cross-Site NDCG Performance 1.1 1 4 4 0.9 0.8 5 5 Precision 0.7 NDCG 0.6 39 36 34 34 Twi er-MS 0.5 0.4 0.3 YouTube-MS 7 0.2 9 8 8 0.1 0 0 5 10 15 20 25 Number of Documents k NDCG scores for top-k YouTube documents retrieved by Precision-oriented queries (Precision), and query strategies using data from Twitter (Twitter-MS) and YouTube (YouTube MS). 21
  • 22. Conclusions Developed a two-step query-oriented solution for planned event content identification User contributed event records Multiple social media sites Identified diverse event content: photos, videos, and tweets Showed how event content from one site can be used to enhance event content identification on other sites 22
  • 23. Future Work Leverage explicit links From event records to documents Between documents from different social media sites Sub-event content analysis Event timeline construction 23

Editor's Notes

  1. Users often share information about events in a variety of forms on different social media sites
  2. Social media provides many challenges ad opportunities for identifying event information
  3. Explain that we work in real-time (for the most part) and say we divide the space into unknown and know identification scenarios, then mention the type of even we focus on for each. Also briefly mention that as we discuss in the thesis, these are not disjoint
  4. on average, queries generated by this strategy are expected to retrieve some results for their associated event.
  5. This is averaged over all events that had some results. How many events had some results? Precision 22% of events , Twitter RTR 76% of events