This document discusses how Huffington Post uses data science and machine learning to improve their content creation, consumption, and distribution. It provides an overview of Huffington Post's size and trends in digital publishing. It then explains how data helps with content creation by discovering trends, enhancing content, and optimizing headlines/images. Data also helps understand audiences to improve recommendations and personalization. Finally, it discusses how machine learning can help predict and promote the articles most likely to gain traction on different distribution channels like Facebook to maximize views and revenue.
1 of 20
Download to read offline
More Related Content
Data science in the newsroom
1. Data Science in the
Newsroom
Geetu Ambwani
Principal Data Scientist
geetu.ambwani@huffingtonpost.com
MLconf NYC, April 2016
2. What is the Huffington Post?
Founded May 2005
Ranking among Digital-only news websites 1
Cross-platform monthly unique visitors Over 187 Million
Number of articles per day Over 500
Number of international editions 15
Bloggers Over 100,000
3. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Massive Blogging Network:
More than 100K bloggers across the globe
4. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Google Site Rank
5. News Industry - Trends
HuffPost has consistently been an innovator in the digital publishing space.
Biggest Social publisher
10. Content Creation: How Can Data Help ?
Tools to help surface, discover trends in different parts of the web
Content Enhancement with multimedia based on semantic matching (images, slideshows, videos)
Optimizing headlines/images (RobinHood Platform)
12. Content Consumption: How Can Data Help?
Know Your Audience
User Cohorts:
Social Traffic versus FrontPage Clickers consume different content
Desktop Vs Mobile consumption
Recommendations/Personalization
Can we use data to inform product design and interface ?
Rearrange share buttons based on traffic origin (Facebook vs Pinterest)
14. Content Distribution: Can Data Help ?
Peoples attention is increasingly concentrated on social streams
More traffic to publishers from social than any other way
Are Distributed Platforms the new home page ?
Facebook Instant, Apple News, Snapchat Discover, Google Amp
Messenger Bots
You need to be where your audience is:
Identify the content mix that is maximally engaging on an external platform
Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat
Discover)
15. Content Distribution: Can Data Help ?
HuffPost produces 1000 articles a day - which of these do we promote ?
Article PVs follow a very skewed distribution of success
Only 1% of our articles > 100k PVs
Content performs differently on different networks.
Can we predict the articles that will get traction in advance so
We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat
Discover)
Target for premium/high value ads to maximize revenue
Populate Recommendation Widgets
16. Content Distribution: Can Data Help ?
Challenges
Histogram of traffic distribution - highly skewed.
The very act of promoting something causes a bump in traffic.
Data normalization - how long do want to wait before predicting ?
Very imbalanced data set
Our Approach
Random Forest classifier.
Multiple success criteria
Historical examples of (+) and (-) articles. Downsampling.
Different normalization thresholds
Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;
17. Slackbot for the social promotion team
20% lift in PVs per predicted article
19. Conclusion
A Data Driven Newsroom today means
More than just keeping track of clicks and shares
Using predictive analytics to drive product and content placement
Machine Learning will be a key driver for success with the advent of distributed
content