The document describes an architecture for aggregating news articles and associated comments from multiple sources. It involves crawling RSS feeds to track new articles, fetching comments for each article from APIs like Facebook and Disqus, and collecting related tweets. The data is stored and processed to enable querying comments by topic, inferring hot topics, and analyzing sentiment about news across comment sources. Real-time Twitter data would also be filtered and classified.
2. Recognizing I have a problem
Addicted to news
My typical browsing pattern :
Google News
Hey that's interesting
Read one article
Read all the comments about the news in all the
journals especially those where I know I'll disagree with
the general opinion
12. Thanks
Platynereis dumerilii
- PhD in statistics
(Cambridge University)
- Master's in CS
(french Grande Ecole)
- Master's in Bio engineering
(french Grande Ecole)
Jean-Baptiste Pettit
13. Why is it interesting
Every comment is associated with article topics from the
title
Possibility to query all the comments for a particular
topic
Possibility to infer hot topics
Possibility to estimate people's mood about the news
Mixing different datasources
Comments will be queried on a regular basis
Twitter feed will be streamed
14. The data
Keep in touch with latest articles by crawling RSS feeds
(XML but it's ok)
For new articles get comments for 1 day via
Facebook comments API
Disqus API
JSON
Twitter feed associated with the articles (streaming API)
JSON
Data is easy to engineer if needed