Hila Becker, Mor Naaman, Luis Gravano , "Event Identification in Social Media", in Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB '09), 2009.
1 of 15
Downloaded 27 times
More Related Content
Event Identification in Social Media
1. EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University
2. Social Media Sites Host Many Event Documents Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook Event= something that occurs at a certain time in a certain place [Yang et al. 99] Popular, widely known events Presidential Inauguration, Thanksgiving Day Parade Smaller events, without traditional news coverage Local food drive, street fair Social media documents for All Points West festival, Liberty State Park, New Jersey, 8/8/08
3. Identifying Events and Associated Social Media Documents Applications Event search and browsing Local search General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents
4. Event Identification: Challenges Uneven data quality Missing, short, uninformative text but revealing structured context available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event information Unknown number of events Necessary for many clustering algorithms Difficult to estimate
5. Clustering Social Media Documents Social media document representation Social media document similarity Social media document clustering Clustering task: definition Ensemble algorithm: combining multiple clustering results Preliminary evaluation
6. Social Media Document Representation Title Description Tags Date/Time Location All-Text
7. Social Media Document Similarity Text: tf-idf weights, cosine similarity Title Description Tags Date/Time Location All-Text Title Description Tags Date/Time-Keywords Location-Proximity All-Text Location-Keywords Date/Time-Proximity time Location: geo-coordinate proximity A A A B B B Time: proximity in minutes
8. Social Media Document Clustering Framework Document feature representation Social media documents Event clusters
9. Clustering: Ensemble Algorithm Consensus Function: combine ensemble similarities W title W tags W time f(C,W) C title C tags C time Ensemble clustering solution Learned in a training step
10. Clustering: Measuring Quality Homogeneous clusters Complete clusters Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and ground truth
11. Experimental Setup Data: >270K Flickr photos Event labels from Yahoo!s upcoming event database Split into 3 parts for training/validation/testing Clusterers: single pass algorithm with centroid similarity Weighing scheme: Normalized Mutual Information (NMI) scores on validation set Consensus function: weighted average of clusterers binary predictions Final prediction step: single pass clustering algorithm
12. Preliminary Evaluation Results Individual clusterer performance Highest NMI: Tags, All-Text Lowest NMI: Description, Title Ensemble performance, compared against all individual clusterers Highest overall performance in terms of NMI More homogenous clusters: each event is spread over fewer clusters Details in paper
13. Document similarity metric Ensemble approach Weight assignment Choice of clusterers Train a classifier to predict document similarity Features correspond to similarity scores All-text, title, tags, time, location, etc. Numeric values in [0,1] State-of-the-art classifiers: SVM, Logistic Regression, Future Work: Alternative Choices
14. Future Work: Alternative Choices Final clustering step Apply graph partitioning algorithms Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets Flickr LastFM, YouTube Exploit social network connections
15. Conclusions Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media documents Employed various social media similarity metrics Developed a weighted ensemble clustering approach Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs
Editor's Notes
#3: Social media sites host a variety of event information We use the traditional event definition from the event detection literature, stating that an event is, In particular, we consider events that range from
#4: Our goal is to This could facilitate application such as as can be seen in this image, similar to news aggregation sites but for events, including a variety of rich media content Our approach for even identification uses clustering to group similar event documents, such that
#5: Social media data quality is uneven When developing our approach, we had to consider the scalability of our algorithms as there exists a vast amount of social media event data on the web
#6: Define and motivate the event identification task
#8: We have different notions of similarity for different types of features We have to come up with a principled way to combine these different notions into a single similarity
#10: We can cluster out document collection according to the variety of feature reps. discussed, each would have its own
#12: Mention briefly where the event IDs came from
#14: Ill add a note on where these experiments stand