�ݺ�ߣ

EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University

Social Media Sites Host Many “Event” Documents Photo-sharing: Flickr Video-sharing: YouTube Social networking: Facebook “ Event”= something that occurs at a certain time in a certain place [Yang et al. ’99] Popular, widely known events Presidential Inauguration, Thanksgiving Day Parade Smaller events, without traditional news coverage Local food drive, street fair … Social media documents for “All Points West” festival, Liberty State Park, New Jersey, 8/8/08

Identifying Events and Associated Social Media Documents Applications Event search and browsing Local search … General approach: group similar documents via clustering Each cluster corresponds to one event and its associated social media documents

Event Identification: Challenges Uneven data quality Missing, short, uninformative text … but revealing structured context available: tags, date/time, geo-coordinates Scalability Dynamic data stream of event information Unknown number of events Necessary for many clustering algorithms Difficult to estimate

Clustering Social Media Documents Social media document representation Social media document similarity Social media document clustering Clustering task: definition Ensemble algorithm: combining multiple clustering results Preliminary evaluation

Social Media Document Representation Title Description Tags Date/Time Location All-Text

Social Media Document Similarity Text: tf-idf weights, cosine similarity Title Description Tags Date/Time Location All-Text Title Description Tags Date/Time-Keywords Location-Proximity All-Text Location-Keywords Date/Time-Proximity time Location: geo-coordinate proximity A A A B B B Time: proximity in minutes

Social Media Document Clustering Framework Document feature representation Social media documents Event clusters

Clustering: Ensemble Algorithm Consensus Function: combine ensemble similarities W title W tags W time f(C,W) C title C tags C time Ensemble clustering solution Learned in a training step

Clustering: Measuring Quality Homogeneous clusters ✔ ✔ Complete clusters Metric: Normalized Mutual Information (NMI) Shared information between clustering solution and “ground truth”

Experimental Setup Data: >270K Flickr photos Event labels from Yahoo!’s “upcoming” event database Split into 3 parts for training/validation/testing Clusterers: single pass algorithm with centroid similarity Weighing scheme: Normalized Mutual Information (NMI) scores on validation set Consensus function: weighted average of clusterers’ binary predictions Final prediction step: single pass clustering algorithm

Preliminary Evaluation Results Individual clusterer performance Highest NMI: Tags, All-Text Lowest NMI: Description, Title Ensemble performance, compared against all individual clusterers Highest overall performance in terms of NMI More homogenous clusters: each event is spread over fewer clusters Details in paper

Document similarity metric Ensemble approach Weight assignment Choice of clusterers Train a classifier to predict document similarity Features correspond to similarity scores All-text, title, tags, time, location, etc. Numeric values in [0,1] State-of-the-art classifiers: SVM, Logistic Regression, … Future Work: Alternative Choices

Future Work: Alternative Choices Final clustering step Apply graph partitioning algorithms Requires estimating the number of clusters Evaluation metrics: beyond NMI Datasets Flickr LastFM, YouTube Exploit social network connections

Conclusions Identified events and their corresponding social media documents Proposed a clustering solution Leveraged different representations of social media documents Employed various social media similarity metrics Developed a weighted ensemble clustering approach Reported preliminary results of our event identification approach on a large-scale dataset of Flickr photographs

�ݺ�ߣ

Event Identification in Social Media

More Related Content

Event Identification in Social Media

Editor's Notes