際際滷

際際滷Share a Scribd company logo
Maintaining High Quality User-Generated Content
Through Machine Learning
Nikhil Dandekar
Quora: Nikhil-Dandekar
Twitter: @nikhilbd
Paula Griffin
Quora: Paula-Griffin-1
Twitter: @paulajgriffin
What is Quora?
Quora is a platform to ask
questions, get useful
answers, and share what
you know with the world.
Incredible answers from credible sources
Not everyone is Peter Norvig.
¢ Biggest challenges of any user-generated-content site are quality and moderation
¢ Two (mostly distinct) sets of users to deal with
$ Bad actors trying to cause harm
$ Well-meaning users who miss the mark
Bad actors
Well-meaning users
Growing challenges
¢ Millions of questions, answers, users, and topics
$ More incentives for bad actors
$ More users who aren¨t familiar with Quora norms
¢ Without active effort, quality gets worse as we scale
¢ We need solutions that get better as our content grows
Solving these problems together
Writing the rulebook
¢ First step: deciding what you want on your platform
¢ ^Be Nice, Be Respectful ̄ policy since before our public launch in 2010
$ No hate speech
$ No harassment
$ No retaliation
¢ Almost all other policies flow from ^being helpful ̄ to someone viewing the page
$ Don¨t write joke answers
$ Tag content with appropriate topics
Enforcing the rules
¢ Users can report content and users for violating Quora¨s policies
¢ Starting out: manual review of all reports
¢ Problems:
$ Many man-hours needed to review all reports
$ Low reporting rates
$ The worst part: someone actually has to see the bad content
Enforcing the rules at scale
¢ Heuristics and machine learning help us reduce the burden of handling user reports, and
can proactively identify bad content
$ Deal with reported content faster and more cheaply
$ Catch spam, harassment, and other problems before other users see it
$ Automatically fix formatting and grammar in some cases
¢ Benefits of scale:
$ More content ★ more choice of good content
$ Ongoing feedback from human review systems
$ More data to train our models
Maintaining high content quality using
Machine Learning
ML Models for quality
¢ Questions: Adult detection, Question quality classification,
Duplicate questions detector, Overly personal question detector,
Question autocorrection etc.
¢ Answers + Comments: Adult detection, Answer ranking for
questions, Answer collapsing, BNBR classifier, Harassment classifier,
Spam classifier etc.
¢ Topics: Duplicate Topics detector, Bad Topic classifier etc.
¢ Users: Bad actor detection, Bad user-credentials classifier, Fake
name detection, User-topic bio classifier etc.
¢ Classifiers on other content types, e.g. answer wikis.
Machine Learning for quality: Overview
Machine Learning for quality: Overview
Algorithms
¢ RNNs (LSTMs/GRUs) and other deep networks,
Gradient Boosted Decision Trees, Random Forests,
Logistic Regression, LambdaMART, k-means and other
clustering techniques, k-NNs, PageRank etc.
Libraries
¢ Tensorflow, Keras, Sklearn, Xgboost, LightGBM,
FastText, RankLib, NTLK, spaCy etc.
Machine Learning model decision flow
Content
ML model
High-confidence
decision?
Take automatic action Ask a human to verify the action
NoYes
¢ Some examples of this decision flow:
$ Spam detection
$ BNBR violation detection
$ Question quality classifier
$ Duplicate question detection
$ ...and more
¢ The more nuanced and sensitive the decision, the
more the need for human verification
ML decision flow examples
Machine Learning data feedback loop
Training
data
Run model
on content
User actions
Human reviews
Train
Models
Case study: Question quality and automatic
question correction
¢ Users often ask questions with grammatical and spelling errors
¢ Example:
$ Which coin/token is next big thing in crypto currencies? And why?
$ Which coin/token is the next big thing in cryptocurrencies? Why?
¢ These are good questions, but the lack of correct phrasing hurts them
$ Less likely to be answered by experts
$ Harder to catch duplicate questions
$ Can hurt the perception of ^quality ̄ of Quora
^Bad ̄ questions on Quora
^Bad ̄ questions on Quora
¢ Types of errors in questions
$ Grammatical errors, e.g., ^How I can ... ̄
$ Spelling mistakes
$ Missing preposition or article
$ Wrong/missing punctuation
$ Wrong capitalization
$ etc.
¢ Can we use Machine Learning to automatically correct these questions?
¢ Started off as an ^offroad ̄ hack-week project
¢ Since shipped
Automatic question correction: research
¢ Frame this problem similar to the machine translation
problem
¢ Final Model:
$ Sequence-to-sequence, character-level RNN (GRU)
with attention
Automatic question correction: Model
Automatic question correction: Model
¢ Model Details:
$ Sequence to sequence (encoder-decoder) model
$ Character-level
$ GRUs (Gated Recurrent Units)
$ Attention-based
$ Bidirectional
$ Beam search for decoding
¢ Tried solving the subproblems individually, but didn¨t work as
well
¢ Training
$ Training data: Pairs of [bad question, corrected question]
$ Tensorflow, on a single box with GPUs
$ Training time: 2-3 hours
¢ Serving:
$ Tensorflow, GPU-based serving
$ Latency: <500 ms p99
¢ Run on new questions added to Quora
Automatic question correction: System Details
Automatic question correction: Results
¢ Checks for BNBR violations on questions, answers,
comments.
¢ Binary classifier
¢ Training data:
$ Positive: Confirmed BNBR violations
$ Negative: False BNBR reports, other good content
¢ Model: NN with 1 hidden layer (fastText)
¢ Same ML decision flow as before
BNBR classification
¢ Quality is one of the most important problems we face at Quora
¢ There are various systems to maintain quality, and we need to use all of them in order to keep up
¢ Machine Learning solutions helps us maintain quality at scale
$ ...but you can¨t totally bypass human efforts
In conclusion
Thank you!
Nikhil Dandekar
Quora: Nikhil-Dandekar
Twitter: @nikhilbd
Paula Griffin
Quora: Paula-Griffin-1
Twitter: @paulajgriffin

More Related Content

Maintaining high quality user generated content through machine learning

  • 1. Maintaining High Quality User-Generated Content Through Machine Learning Nikhil Dandekar Quora: Nikhil-Dandekar Twitter: @nikhilbd Paula Griffin Quora: Paula-Griffin-1 Twitter: @paulajgriffin
  • 2. What is Quora? Quora is a platform to ask questions, get useful answers, and share what you know with the world.
  • 3. Incredible answers from credible sources
  • 4. Not everyone is Peter Norvig. ¢ Biggest challenges of any user-generated-content site are quality and moderation ¢ Two (mostly distinct) sets of users to deal with $ Bad actors trying to cause harm $ Well-meaning users who miss the mark
  • 7. Growing challenges ¢ Millions of questions, answers, users, and topics $ More incentives for bad actors $ More users who aren¨t familiar with Quora norms ¢ Without active effort, quality gets worse as we scale ¢ We need solutions that get better as our content grows
  • 9. Writing the rulebook ¢ First step: deciding what you want on your platform ¢ ^Be Nice, Be Respectful ̄ policy since before our public launch in 2010 $ No hate speech $ No harassment $ No retaliation ¢ Almost all other policies flow from ^being helpful ̄ to someone viewing the page $ Don¨t write joke answers $ Tag content with appropriate topics
  • 10. Enforcing the rules ¢ Users can report content and users for violating Quora¨s policies ¢ Starting out: manual review of all reports ¢ Problems: $ Many man-hours needed to review all reports $ Low reporting rates $ The worst part: someone actually has to see the bad content
  • 11. Enforcing the rules at scale ¢ Heuristics and machine learning help us reduce the burden of handling user reports, and can proactively identify bad content $ Deal with reported content faster and more cheaply $ Catch spam, harassment, and other problems before other users see it $ Automatically fix formatting and grammar in some cases ¢ Benefits of scale: $ More content ★ more choice of good content $ Ongoing feedback from human review systems $ More data to train our models
  • 12. Maintaining high content quality using Machine Learning
  • 13. ML Models for quality ¢ Questions: Adult detection, Question quality classification, Duplicate questions detector, Overly personal question detector, Question autocorrection etc. ¢ Answers + Comments: Adult detection, Answer ranking for questions, Answer collapsing, BNBR classifier, Harassment classifier, Spam classifier etc. ¢ Topics: Duplicate Topics detector, Bad Topic classifier etc. ¢ Users: Bad actor detection, Bad user-credentials classifier, Fake name detection, User-topic bio classifier etc. ¢ Classifiers on other content types, e.g. answer wikis. Machine Learning for quality: Overview
  • 14. Machine Learning for quality: Overview Algorithms ¢ RNNs (LSTMs/GRUs) and other deep networks, Gradient Boosted Decision Trees, Random Forests, Logistic Regression, LambdaMART, k-means and other clustering techniques, k-NNs, PageRank etc. Libraries ¢ Tensorflow, Keras, Sklearn, Xgboost, LightGBM, FastText, RankLib, NTLK, spaCy etc.
  • 15. Machine Learning model decision flow Content ML model High-confidence decision? Take automatic action Ask a human to verify the action NoYes
  • 16. ¢ Some examples of this decision flow: $ Spam detection $ BNBR violation detection $ Question quality classifier $ Duplicate question detection $ ...and more ¢ The more nuanced and sensitive the decision, the more the need for human verification ML decision flow examples
  • 17. Machine Learning data feedback loop Training data Run model on content User actions Human reviews Train Models
  • 18. Case study: Question quality and automatic question correction
  • 19. ¢ Users often ask questions with grammatical and spelling errors ¢ Example: $ Which coin/token is next big thing in crypto currencies? And why? $ Which coin/token is the next big thing in cryptocurrencies? Why? ¢ These are good questions, but the lack of correct phrasing hurts them $ Less likely to be answered by experts $ Harder to catch duplicate questions $ Can hurt the perception of ^quality ̄ of Quora ^Bad ̄ questions on Quora
  • 20. ^Bad ̄ questions on Quora ¢ Types of errors in questions $ Grammatical errors, e.g., ^How I can ... ̄ $ Spelling mistakes $ Missing preposition or article $ Wrong/missing punctuation $ Wrong capitalization $ etc. ¢ Can we use Machine Learning to automatically correct these questions? ¢ Started off as an ^offroad ̄ hack-week project ¢ Since shipped
  • 22. ¢ Frame this problem similar to the machine translation problem ¢ Final Model: $ Sequence-to-sequence, character-level RNN (GRU) with attention Automatic question correction: Model
  • 23. Automatic question correction: Model ¢ Model Details: $ Sequence to sequence (encoder-decoder) model $ Character-level $ GRUs (Gated Recurrent Units) $ Attention-based $ Bidirectional $ Beam search for decoding ¢ Tried solving the subproblems individually, but didn¨t work as well
  • 24. ¢ Training $ Training data: Pairs of [bad question, corrected question] $ Tensorflow, on a single box with GPUs $ Training time: 2-3 hours ¢ Serving: $ Tensorflow, GPU-based serving $ Latency: <500 ms p99 ¢ Run on new questions added to Quora Automatic question correction: System Details
  • 26. ¢ Checks for BNBR violations on questions, answers, comments. ¢ Binary classifier ¢ Training data: $ Positive: Confirmed BNBR violations $ Negative: False BNBR reports, other good content ¢ Model: NN with 1 hidden layer (fastText) ¢ Same ML decision flow as before BNBR classification
  • 27. ¢ Quality is one of the most important problems we face at Quora ¢ There are various systems to maintain quality, and we need to use all of them in order to keep up ¢ Machine Learning solutions helps us maintain quality at scale $ ...but you can¨t totally bypass human efforts In conclusion
  • 28. Thank you! Nikhil Dandekar Quora: Nikhil-Dandekar Twitter: @nikhilbd Paula Griffin Quora: Paula-Griffin-1 Twitter: @paulajgriffin