狠狠撸

狠狠撸Share a Scribd company logo
@joosterman
Diverse recommendations
from a vast archive
Jasper Oosterman, Data Scientist
@joosterman@joosterman
“Making quality journalism
accessible to everyone”
@joosterman
the COMPANY
2014
2016
2015
2019
202(0|1)
@joosterman
the NUMBERS
160 issues, 100+ NL
10 million euro to publishers
12M articles, 4000/day
80M articles read
1400 hours
40 employees
@joosterman
the TEAM BACKGROUND
● MSc Delft University of Technology
○ Web Information Systems
● PhD Delft University of Technology
○ Crowdsourced Knowledge Generation
● Data Scientist @ Sanoma (3Y)
○ Nu.nl, Viva, Donald Duck
● Data Scientist @ Blendle (2Y )
○ Alexander Kl?pping
HYLKE
ALEXIS
JASPER
@joosterman
XML
Goal: One article format
the SYSTEM
PDF RSS
Manual
clipping
Convert
@joosterman
Goal: Understanding articles
● Subject matter
○ Topics (economy, health, ...)
○ Named entities (Bitcoin, Max Verstappen, ...)
● Article style
○ Complexity (easy, complex)
○ Feel (positive, negative)
○ Type (Interviews, Opinion Piece)
● Article usefulness
○ What would Editorial do
○ Evergreen
○ Newsiness
the SYSTEM
● Document represenation
○ BoW
○ Stylometry
○ Provenance (author, issue)
● Models
○ Spacy for named entities
○ Mostly sklearn
○ Random Forests
● Enrichment process
○ Python workers on K8s
○ Kafka-based communication
○ Autoscaling inside K8s cluster
○ 1s after ingestion
@joosterman
the TRUTH
@joosterman
Getting more user value out of the content
Current: navigational suggestions
the CHALLENGE
Framing Aspects
Viewpoint Diversity
The reality is too complex to be fully understood. Therefore,
every article contains a specific frame on an issue
@joosterman
Main problem (1)
We are heading for a second corona wave
Forces that create or contribute to problem (2) + Evaluation of these forces (3)
RIVM is stubborn about mouth masks (-) and
Mayors of Amsterdam and Rotterdam take responsibility (+)
Possible solutions to the problem (4)
There must be a national duty for mouth masks
the EXAMPLE
@joosterman
MSc thesis work of Mats Mulder
Conceptually:
1. Enrich the article with elements
corresponding to each framing aspect
2. Calculate a distance matrix between each
pair
3. Rerank using Maximal Marginal Relevance.
Take top 3.
the RECS
@joosterman
● Datasets around 4 topics (Corona, Big Tech, BLM, U.S. Election), 50 articles
● Recs baseline: term-based relevance, i.e. most similar, lamda=1
● Recs variant: viewpoint divers, i.e most divers, lambda=0
● Identical section title, articles not personalized
● Over 2000 (cherry-picked) users
● 12 days
● 24 recommendations
the EXPERIMENT
@joosterman
● Did the diversity method work? YES
The average viewpoint diversity scores across all topics increased from 0.55 to 0.79 for an
increasing level of diversity in the MMR algorithm
● Did users consume more or less recommendations? NO
we did not ?nd a signi?cant di?erence between the two user groups in terms of click-through
rate per recommended article. The same result holds per topic.
● Did users complete more or less opened recommended articles? NO
We found no signi?cant di?erence in terms of completion rate for the two user groups
----
● Multiple presentation properties, such as the inclusion of a thumbnail image and the favourite
count, were shown to have a signi?cant in?uence on the click-through rate of recommendations
the RESULTS
@joosterman
the FUTURE
@joosterman
the QUESTIONS

More Related Content

Blendle: Diverse recommendations from a vast archive

  • 1. @joosterman Diverse recommendations from a vast archive Jasper Oosterman, Data Scientist
  • 4. @joosterman the NUMBERS 160 issues, 100+ NL 10 million euro to publishers 12M articles, 4000/day 80M articles read 1400 hours 40 employees
  • 5. @joosterman the TEAM BACKGROUND ● MSc Delft University of Technology ○ Web Information Systems ● PhD Delft University of Technology ○ Crowdsourced Knowledge Generation ● Data Scientist @ Sanoma (3Y) ○ Nu.nl, Viva, Donald Duck ● Data Scientist @ Blendle (2Y ) ○ Alexander Kl?pping HYLKE ALEXIS JASPER
  • 6. @joosterman XML Goal: One article format the SYSTEM PDF RSS Manual clipping Convert
  • 7. @joosterman Goal: Understanding articles ● Subject matter ○ Topics (economy, health, ...) ○ Named entities (Bitcoin, Max Verstappen, ...) ● Article style ○ Complexity (easy, complex) ○ Feel (positive, negative) ○ Type (Interviews, Opinion Piece) ● Article usefulness ○ What would Editorial do ○ Evergreen ○ Newsiness the SYSTEM ● Document represenation ○ BoW ○ Stylometry ○ Provenance (author, issue) ● Models ○ Spacy for named entities ○ Mostly sklearn ○ Random Forests ● Enrichment process ○ Python workers on K8s ○ Kafka-based communication ○ Autoscaling inside K8s cluster ○ 1s after ingestion
  • 9. @joosterman Getting more user value out of the content Current: navigational suggestions the CHALLENGE Framing Aspects Viewpoint Diversity The reality is too complex to be fully understood. Therefore, every article contains a specific frame on an issue
  • 10. @joosterman Main problem (1) We are heading for a second corona wave Forces that create or contribute to problem (2) + Evaluation of these forces (3) RIVM is stubborn about mouth masks (-) and Mayors of Amsterdam and Rotterdam take responsibility (+) Possible solutions to the problem (4) There must be a national duty for mouth masks the EXAMPLE
  • 11. @joosterman MSc thesis work of Mats Mulder Conceptually: 1. Enrich the article with elements corresponding to each framing aspect 2. Calculate a distance matrix between each pair 3. Rerank using Maximal Marginal Relevance. Take top 3. the RECS
  • 12. @joosterman ● Datasets around 4 topics (Corona, Big Tech, BLM, U.S. Election), 50 articles ● Recs baseline: term-based relevance, i.e. most similar, lamda=1 ● Recs variant: viewpoint divers, i.e most divers, lambda=0 ● Identical section title, articles not personalized ● Over 2000 (cherry-picked) users ● 12 days ● 24 recommendations the EXPERIMENT
  • 13. @joosterman ● Did the diversity method work? YES The average viewpoint diversity scores across all topics increased from 0.55 to 0.79 for an increasing level of diversity in the MMR algorithm ● Did users consume more or less recommendations? NO we did not ?nd a signi?cant di?erence between the two user groups in terms of click-through rate per recommended article. The same result holds per topic. ● Did users complete more or less opened recommended articles? NO We found no signi?cant di?erence in terms of completion rate for the two user groups ---- ● Multiple presentation properties, such as the inclusion of a thumbnail image and the favourite count, were shown to have a signi?cant in?uence on the click-through rate of recommendations the RESULTS