The talk I gave on using Machine Learning to solve quality problems at Quora. This was a part of the "Be Nice, Be Respectful: Protecting online spaces with applied machine learning" workshop at Quora in September 2017
1 of 28
More Related Content
Maintaining high quality user generated content through machine learning
1. Maintaining High Quality User-Generated Content
Through Machine Learning
Nikhil Dandekar
Quora: Nikhil-Dandekar
Twitter: @nikhilbd
Paula Griffin
Quora: Paula-Griffin-1
Twitter: @paulajgriffin
2. What is Quora?
Quora is a platform to ask
questions, get useful
answers, and share what
you know with the world.
4. Not everyone is Peter Norvig.
¢ Biggest challenges of any user-generated-content site are quality and moderation
¢ Two (mostly distinct) sets of users to deal with
$ Bad actors trying to cause harm
$ Well-meaning users who miss the mark
7. Growing challenges
¢ Millions of questions, answers, users, and topics
$ More incentives for bad actors
$ More users who aren¨t familiar with Quora norms
¢ Without active effort, quality gets worse as we scale
¢ We need solutions that get better as our content grows
9. Writing the rulebook
¢ First step: deciding what you want on your platform
¢ ^Be Nice, Be Respectful ̄ policy since before our public launch in 2010
$ No hate speech
$ No harassment
$ No retaliation
¢ Almost all other policies flow from ^being helpful ̄ to someone viewing the page
$ Don¨t write joke answers
$ Tag content with appropriate topics
10. Enforcing the rules
¢ Users can report content and users for violating Quora¨s policies
¢ Starting out: manual review of all reports
¢ Problems:
$ Many man-hours needed to review all reports
$ Low reporting rates
$ The worst part: someone actually has to see the bad content
11. Enforcing the rules at scale
¢ Heuristics and machine learning help us reduce the burden of handling user reports, and
can proactively identify bad content
$ Deal with reported content faster and more cheaply
$ Catch spam, harassment, and other problems before other users see it
$ Automatically fix formatting and grammar in some cases
¢ Benefits of scale:
$ More content ★ more choice of good content
$ Ongoing feedback from human review systems
$ More data to train our models
13. ML Models for quality
¢ Questions: Adult detection, Question quality classification,
Duplicate questions detector, Overly personal question detector,
Question autocorrection etc.
¢ Answers + Comments: Adult detection, Answer ranking for
questions, Answer collapsing, BNBR classifier, Harassment classifier,
Spam classifier etc.
¢ Topics: Duplicate Topics detector, Bad Topic classifier etc.
¢ Users: Bad actor detection, Bad user-credentials classifier, Fake
name detection, User-topic bio classifier etc.
¢ Classifiers on other content types, e.g. answer wikis.
Machine Learning for quality: Overview
14. Machine Learning for quality: Overview
Algorithms
¢ RNNs (LSTMs/GRUs) and other deep networks,
Gradient Boosted Decision Trees, Random Forests,
Logistic Regression, LambdaMART, k-means and other
clustering techniques, k-NNs, PageRank etc.
Libraries
¢ Tensorflow, Keras, Sklearn, Xgboost, LightGBM,
FastText, RankLib, NTLK, spaCy etc.
15. Machine Learning model decision flow
Content
ML model
High-confidence
decision?
Take automatic action Ask a human to verify the action
NoYes
16. ¢ Some examples of this decision flow:
$ Spam detection
$ BNBR violation detection
$ Question quality classifier
$ Duplicate question detection
$ ...and more
¢ The more nuanced and sensitive the decision, the
more the need for human verification
ML decision flow examples
17. Machine Learning data feedback loop
Training
data
Run model
on content
User actions
Human reviews
Train
Models
19. ¢ Users often ask questions with grammatical and spelling errors
¢ Example:
$ Which coin/token is next big thing in crypto currencies? And why?
$ Which coin/token is the next big thing in cryptocurrencies? Why?
¢ These are good questions, but the lack of correct phrasing hurts them
$ Less likely to be answered by experts
$ Harder to catch duplicate questions
$ Can hurt the perception of ^quality ̄ of Quora
^Bad ̄ questions on Quora
20. ^Bad ̄ questions on Quora
¢ Types of errors in questions
$ Grammatical errors, e.g., ^How I can ... ̄
$ Spelling mistakes
$ Missing preposition or article
$ Wrong/missing punctuation
$ Wrong capitalization
$ etc.
¢ Can we use Machine Learning to automatically correct these questions?
¢ Started off as an ^offroad ̄ hack-week project
¢ Since shipped
22. ¢ Frame this problem similar to the machine translation
problem
¢ Final Model:
$ Sequence-to-sequence, character-level RNN (GRU)
with attention
Automatic question correction: Model
23. Automatic question correction: Model
¢ Model Details:
$ Sequence to sequence (encoder-decoder) model
$ Character-level
$ GRUs (Gated Recurrent Units)
$ Attention-based
$ Bidirectional
$ Beam search for decoding
¢ Tried solving the subproblems individually, but didn¨t work as
well
24. ¢ Training
$ Training data: Pairs of [bad question, corrected question]
$ Tensorflow, on a single box with GPUs
$ Training time: 2-3 hours
¢ Serving:
$ Tensorflow, GPU-based serving
$ Latency: <500 ms p99
¢ Run on new questions added to Quora
Automatic question correction: System Details
26. ¢ Checks for BNBR violations on questions, answers,
comments.
¢ Binary classifier
¢ Training data:
$ Positive: Confirmed BNBR violations
$ Negative: False BNBR reports, other good content
¢ Model: NN with 1 hidden layer (fastText)
¢ Same ML decision flow as before
BNBR classification
27. ¢ Quality is one of the most important problems we face at Quora
¢ There are various systems to maintain quality, and we need to use all of them in order to keep up
¢ Machine Learning solutions helps us maintain quality at scale
$ ...but you can¨t totally bypass human efforts
In conclusion