際際滷

際際滷Share a Scribd company logo
Data Science in the
Newsroom
Geetu Ambwani
Principal Data Scientist
geetu.ambwani@huffingtonpost.com
MLconf NYC, April 2016
What is the Huffington Post?
Founded May 2005
Ranking among Digital-only news websites 1
Cross-platform monthly unique visitors Over 187 Million
Number of articles per day Over 500
Number of international editions 15
Bloggers Over 100,000
News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Massive Blogging Network:
More than 100K bloggers across the globe
News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Google Site Rank
News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Biggest Social publisher
News Industry - Challenges
How Can Data Help ?
Ad campaigns
International editionsSocial media promotion
Editors
User-experience
Blog moderators
Reporters
HuffPost Studio
Content Lifecycle
DistributionCreation Consumption
Content Creation: How Can Data Help ?
 Tools to help surface, discover trends in different parts of the web
 Content Enhancement with multimedia based on semantic matching (images, slideshows, videos)
 Optimizing headlines/images (RobinHood Platform)
Content Gap: Production Versus Consumption
Content Consumption: How Can Data Help?
Know Your Audience
 User Cohorts:
 Social Traffic versus FrontPage Clickers consume different content
 Desktop Vs Mobile consumption
 Recommendations/Personalization
 Can we use data to inform product design and interface ?
 Rearrange share buttons based on traffic origin (Facebook vs Pinterest)
Content Lifecycle
DistributionCreation Consumption
Content Distribution: Can Data Help ?
 Peoples attention is increasingly concentrated on social streams
 More traffic to publishers from social than any other way
 Are Distributed Platforms the new home page ?
 Facebook Instant, Apple News, Snapchat Discover, Google Amp
 Messenger Bots
 You need to be where your audience is:
 Identify the content mix that is maximally engaging on an external platform
 Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat
Discover)
Content Distribution: Can Data Help ?
 HuffPost produces 1000 articles a day - which of these do we promote ?
 Article PVs follow a very skewed distribution of success
 Only 1% of our articles > 100k PVs
 Content performs differently on different networks.
 Can we predict the articles that will get traction in advance so
 We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat
Discover)
 Target for premium/high value ads to maximize revenue
 Populate Recommendation Widgets
Content Distribution: Can Data Help ?
Challenges
 Histogram of traffic distribution - highly skewed.
 The very act of promoting something causes a bump in traffic.
 Data normalization - how long do want to wait before predicting ?
 Very imbalanced data set
Our Approach
 Random Forest classifier.
 Multiple success criteria
 Historical examples of (+) and (-) articles. Downsampling.
 Different normalization thresholds
 Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;
Slackbot for the social promotion team
 20% lift in PVs per predicted article
 20% lift in PVs per predicted article
Conclusion
A Data Driven Newsroom today means
 More than just keeping track of clicks and shares
 Using predictive analytics to drive product and content placement
Machine Learning will be a key driver for success with the advent of distributed
content
Thanks !
MachineLearning@HuffPost

More Related Content

Data science in the newsroom

  • 1. Data Science in the Newsroom Geetu Ambwani Principal Data Scientist geetu.ambwani@huffingtonpost.com MLconf NYC, April 2016
  • 2. What is the Huffington Post? Founded May 2005 Ranking among Digital-only news websites 1 Cross-platform monthly unique visitors Over 187 Million Number of articles per day Over 500 Number of international editions 15 Bloggers Over 100,000
  • 3. News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space. Massive Blogging Network: More than 100K bloggers across the globe
  • 4. News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space. Google Site Rank
  • 5. News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space. Biggest Social publisher
  • 6. News Industry - Challenges
  • 7. How Can Data Help ?
  • 8. Ad campaigns International editionsSocial media promotion Editors User-experience Blog moderators Reporters HuffPost Studio
  • 10. Content Creation: How Can Data Help ? Tools to help surface, discover trends in different parts of the web Content Enhancement with multimedia based on semantic matching (images, slideshows, videos) Optimizing headlines/images (RobinHood Platform)
  • 11. Content Gap: Production Versus Consumption
  • 12. Content Consumption: How Can Data Help? Know Your Audience User Cohorts: Social Traffic versus FrontPage Clickers consume different content Desktop Vs Mobile consumption Recommendations/Personalization Can we use data to inform product design and interface ? Rearrange share buttons based on traffic origin (Facebook vs Pinterest)
  • 14. Content Distribution: Can Data Help ? Peoples attention is increasingly concentrated on social streams More traffic to publishers from social than any other way Are Distributed Platforms the new home page ? Facebook Instant, Apple News, Snapchat Discover, Google Amp Messenger Bots You need to be where your audience is: Identify the content mix that is maximally engaging on an external platform Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat Discover)
  • 15. Content Distribution: Can Data Help ? HuffPost produces 1000 articles a day - which of these do we promote ? Article PVs follow a very skewed distribution of success Only 1% of our articles > 100k PVs Content performs differently on different networks. Can we predict the articles that will get traction in advance so We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat Discover) Target for premium/high value ads to maximize revenue Populate Recommendation Widgets
  • 16. Content Distribution: Can Data Help ? Challenges Histogram of traffic distribution - highly skewed. The very act of promoting something causes a bump in traffic. Data normalization - how long do want to wait before predicting ? Very imbalanced data set Our Approach Random Forest classifier. Multiple success criteria Historical examples of (+) and (-) articles. Downsampling. Different normalization thresholds Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;
  • 17. Slackbot for the social promotion team 20% lift in PVs per predicted article
  • 18. 20% lift in PVs per predicted article
  • 19. Conclusion A Data Driven Newsroom today means More than just keeping track of clicks and shares Using predictive analytics to drive product and content placement Machine Learning will be a key driver for success with the advent of distributed content