際際滷

際際滷Share a Scribd company logo
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano   Mor Naaman Columbia University   Rutgers University
Social Media Sites Host Many  Event Documents Photo-sharing: Flickr  Video-sharing: YouTube  Social networking: Facebook   Event= something that occurs at a certain time in a certain place  [Yang et al. 99]  Popular, widely known events Presidential Inauguration, Thanksgiving Day Parade Smaller events, without traditional news coverage Local food drive, street fair  Social media documents for All Points West festival, Liberty State Park, New Jersey, 8/8/08
Identifying Events and Associated Social Media Documents Applications Event search and browsing Local search  General approach:  group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
Event Identification: Challenges Uneven data quality Missing, short, uninformative text   but revealing structured context available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event information Unknown number of events Necessary for many clustering algorithms Difficult to estimate
Clustering Social Media Documents Social media document representation Social media document similarity Social media document clustering Clustering task: definition Ensemble algorithm: combining multiple clustering results Preliminary evaluation
Social Media Document Representation Title Description Tags Date/Time Location All-Text
Social Media Document Similarity Text: tf-idf weights, cosine similarity Title Description Tags Date/Time Location All-Text Title Description Tags Date/Time-Keywords Location-Proximity All-Text Location-Keywords Date/Time-Proximity time Location: geo-coordinate proximity A A A B B B Time: proximity in minutes
Social Media Document Clustering Framework Document feature representation Social media documents Event clusters
Clustering: Ensemble Algorithm  Consensus Function: combine ensemble  similarities W title W tags W time f(C,W) C title C tags C time Ensemble clustering solution Learned in a training step
Clustering: Measuring Quality Homogeneous clusters   Complete clusters Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and ground truth
Experimental Setup Data: >270K Flickr photos Event labels from Yahoo!s upcoming event database Split into 3 parts for training/validation/testing  Clusterers: single pass algorithm with centroid similarity Weighing scheme: Normalized Mutual Information (NMI) scores on validation set Consensus function: weighted average of clusterers binary predictions Final prediction step: single pass clustering algorithm
Preliminary Evaluation Results Individual clusterer performance Highest NMI: Tags, All-Text Lowest NMI: Description, Title Ensemble performance, compared against all individual clusterers Highest overall performance in terms of NMI More homogenous clusters: each event is spread over fewer clusters Details in paper
Document similarity metric Ensemble approach  Weight assignment Choice of clusterers Train a classifier to predict document similarity Features correspond to similarity scores All-text, title, tags, time, location, etc. Numeric values in [0,1] State-of-the-art classifiers: SVM, Logistic Regression,  Future Work: Alternative Choices
Future Work: Alternative Choices Final clustering step Apply graph partitioning algorithms Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets Flickr LastFM,  YouTube Exploit social network connections
Conclusions Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media documents Employed various social media similarity metrics Developed a weighted ensemble clustering approach Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs

More Related Content

Event Identification in Social Media

  • 1. EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
  • 2. Social Media Sites Host Many Event Documents Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook Event= something that occurs at a certain time in a certain place [Yang et al. 99] Popular, widely known events Presidential Inauguration, Thanksgiving Day Parade Smaller events, without traditional news coverage Local food drive, street fair Social media documents for All Points West festival, Liberty State Park, New Jersey, 8/8/08
  • 3. Identifying Events and Associated Social Media Documents Applications Event search and browsing Local search General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
  • 4. Event Identification: Challenges Uneven data quality Missing, short, uninformative text but revealing structured context available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event information Unknown number of events Necessary for many clustering algorithms Difficult to estimate
  • 5. Clustering Social Media Documents Social media document representation Social media document similarity Social media document clustering Clustering task: definition Ensemble algorithm: combining multiple clustering results Preliminary evaluation
  • 6. Social Media Document Representation Title Description Tags Date/Time Location All-Text
  • 7. Social Media Document Similarity Text: tf-idf weights, cosine similarity Title Description Tags Date/Time Location All-Text Title Description Tags Date/Time-Keywords Location-Proximity All-Text Location-Keywords Date/Time-Proximity time Location: geo-coordinate proximity A A A B B B Time: proximity in minutes
  • 8. Social Media Document Clustering Framework Document feature representation Social media documents Event clusters
  • 9. Clustering: Ensemble Algorithm Consensus Function: combine ensemble similarities W title W tags W time f(C,W) C title C tags C time Ensemble clustering solution Learned in a training step
  • 10. Clustering: Measuring Quality Homogeneous clusters Complete clusters Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and ground truth
  • 11. Experimental Setup Data: >270K Flickr photos Event labels from Yahoo!s upcoming event database Split into 3 parts for training/validation/testing Clusterers: single pass algorithm with centroid similarity Weighing scheme: Normalized Mutual Information (NMI) scores on validation set Consensus function: weighted average of clusterers binary predictions Final prediction step: single pass clustering algorithm
  • 12. Preliminary Evaluation Results Individual clusterer performance Highest NMI: Tags, All-Text Lowest NMI: Description, Title Ensemble performance, compared against all individual clusterers Highest overall performance in terms of NMI More homogenous clusters: each event is spread over fewer clusters Details in paper
  • 13. Document similarity metric Ensemble approach Weight assignment Choice of clusterers Train a classifier to predict document similarity Features correspond to similarity scores All-text, title, tags, time, location, etc. Numeric values in [0,1] State-of-the-art classifiers: SVM, Logistic Regression, Future Work: Alternative Choices
  • 14. Future Work: Alternative Choices Final clustering step Apply graph partitioning algorithms Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets Flickr LastFM, YouTube Exploit social network connections
  • 15. Conclusions Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media documents Employed various social media similarity metrics Developed a weighted ensemble clustering approach Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs

Editor's Notes

  • #3: Social media sites host a variety of event information We use the traditional event definition from the event detection literature, stating that an event is, In particular, we consider events that range from
  • #4: Our goal is to This could facilitate application such as as can be seen in this image, similar to news aggregation sites but for events, including a variety of rich media content Our approach for even identification uses clustering to group similar event documents, such that
  • #5: Social media data quality is uneven When developing our approach, we had to consider the scalability of our algorithms as there exists a vast amount of social media event data on the web
  • #6: Define and motivate the event identification task
  • #8: We have different notions of similarity for different types of features We have to come up with a principled way to combine these different notions into a single similarity
  • #10: We can cluster out document collection according to the variety of feature reps. discussed, each would have its own
  • #12: Mention briefly where the event IDs came from
  • #14: Ill add a note on where these experiments stand
  • #15: Backup slide