ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Crowdsourcing a News Query Classification DatasetRichard McCreadie, Craig Macdonald & IadhOunis0
Introduction What is news query classification and why would we build a dataset to examine it? Binary classification task performed by Web search enginesUp to 10% of queries may be news-related [Bar-Ilan et al, 2009]Have workers judge Web search queries as news-related or notNews-RelatedNews ResultsgunmanNon-News-RelatedWeb Search EngineUserWeb Search Results1
IntroductionBut: News-relatedness is subjective
 Workers can easily `game` the task
 News Queries change over time for  query in task return Random(Yes,No)end loop2News-related?Sea Creature?Query: Octopus?Work Cup Predications?How can we overcome these difficulties to create a high quality dataset for news query classification?
Introduction                                              (1--3)
Dataset Construction Methodology    (4--14)
Research Questions and Setting          (15--17)
Experiments and Results                       (18--20)
Conclusions                                               (21)Talk Outline3
Dataset Construction Methodology How can we go about building a news query classification dataset?Sample Queries from the MSN May 2006 Query LogCreate Gold judgments to validate the workersPropose additional content to tackle the temporal nature of news queries and prototype interfaces to evaluate this content on a small test setCreate the final labels using the best setting and interfaceEvaluate in terms of agreementEvaluate against `experts`4
Dataset Construction Methodology Sampling Queries: Create 2 query sets sampled from the MSN May 2006 Query-logPoisson Sampling One for testing (testset)Fast crowdsourcing turn around timeVery low cost One for the final dataset (fullset)10x queries Only labelled once5Date       Time     Query 2006-05-01 00:00:08 What is May Day?  2006-05-08 14:43:42 protest in Puerto ricoTestset Queries        : 91Fullset Queries        : 1206
Dataset Construction Methodology How to check our workers are not `gaming’ the system?Gold Judgments (honey-pot)Small set (5%) of queriesCatch out bad workers early in the task`Cherry-picked` unambiguous queries Focus on news-related queriesMultiple workers per query3 workers Majority result6Date       Time     Query                   Validation2006-05-01 00:00:08 What is May Day?        No2006-05-08 14:43:42 protest in Puerto rico  Yes
Dataset Construction Methodology How to counter temporal nature of news queries? Workers need to know what the news stories of the time were . . . But likely will not remember what the main stories during May 2006 Idea: add extra information to interface News Headlines News Summaries Web Search results Prototype Interfaces Use small testset to keep costs and turn-around time low See which works the best7
Interfaces : BasicWhat the workers need to doClarify news-relatedness8Query and DateBinary labelling
Interfaces : Headline12 news headlines from the New York Times . . . Will the workers bother to read these?9
Interfaces : HeadlineInline12 news headlines from the New York Times . . . 10Maybe headlines are not enough?
Interfaces : HeadlineSummarynews headlinenews summary . . . Query: Tigers of Tamil?11
Interfaces : LinkSupportedLinks to three major search engines . . . Triggers a search containing the query and its dateAlso get some additional feedback from workers12
Dataset Construction Methodology How do we evaluate our the quality of our labels?Agreement between the three workers per query The more the workers agree, the more confident that we can be that our resulting majority label is correct  Compare with `expert’ (me) judgments See how many of the queries that the workers judged news-related match the ground truth13Date       Time     Query                   Worker Expert2006-05-05 07:31:23 abcnews                 Yes    No2006-05-08 14:43:42 protest in Puerto rico  Yes    Yes
Introduction                                              (1--3)
Dataset Construction Methodology    (4--14)

More Related Content

Similar to Crowdsourcing a News Query Classification Dataset (20)

Maxdiff webinar_10_19_10
 Maxdiff webinar_10_19_10 Maxdiff webinar_10_19_10
Maxdiff webinar_10_19_10
QuestionPro
Ìý
Amec Barcelona Summit Speech 2010
Amec Barcelona Summit Speech 2010Amec Barcelona Summit Speech 2010
Amec Barcelona Summit Speech 2010
Angela Jeffrey & Associates
Ìý
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
Scott Abel
Ìý
London coi september 26th, 2011 - marshall sponder
London coi  september 26th, 2011 - marshall sponderLondon coi  september 26th, 2011 - marshall sponder
London coi september 26th, 2011 - marshall sponder
Marshall Sponder
Ìý
PQF Overview
PQF OverviewPQF Overview
PQF Overview
Martin Hutchings
Ìý
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Optimizely
Ìý
Advanced Analysis Presentation
Advanced Analysis PresentationAdvanced Analysis Presentation
Advanced Analysis Presentation
Semphonic
Ìý
Ogx match bi
Ogx match biOgx match bi
Ogx match bi
Ramita Vig
Ìý
Match OGX_BI
Match OGX_BIMatch OGX_BI
Match OGX_BI
Ramita Vig
Ìý
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
Ken Krugler
Ìý
Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)
Prashanth Raj
Ìý
The Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in KuaishouThe Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in Kuaishou
Jay (Jianqiang) Wang
Ìý
Asset finance systems projects guide 101
Asset finance systems projects guide 101Asset finance systems projects guide 101
Asset finance systems projects guide 101
David Pedreno
Ìý
Search analytics what why how - By Otis Gospodnetic
 Search analytics what why how - By Otis Gospodnetic  Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
lucenerevolution
Ìý
Benchmarking for facility professionals - Ifma foundation whitepaper
Benchmarking for facility professionals - Ifma foundation whitepaperBenchmarking for facility professionals - Ifma foundation whitepaper
Benchmarking for facility professionals - Ifma foundation whitepaper
Muriel Walter
Ìý
SurveyAnalytics MaxDiff Webinar ºÝºÝߣs
SurveyAnalytics MaxDiff Webinar ºÝºÝߣsSurveyAnalytics MaxDiff Webinar ºÝºÝߣs
SurveyAnalytics MaxDiff Webinar ºÝºÝߣs
QuestionPro
Ìý
SurveyAnalytics MaxDiff Webinar
SurveyAnalytics MaxDiff WebinarSurveyAnalytics MaxDiff Webinar
SurveyAnalytics MaxDiff Webinar
QuestionPro
Ìý
Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis GospodneticSearch analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
lucenerevolution
Ìý
Into AB experiments
Into AB experimentsInto AB experiments
Into AB experiments
Deven
Ìý
Presentation finding the perfect database
Presentation finding the perfect databasePresentation finding the perfect database
Presentation finding the perfect database
TechSoup
Ìý
Maxdiff webinar_10_19_10
 Maxdiff webinar_10_19_10 Maxdiff webinar_10_19_10
Maxdiff webinar_10_19_10
QuestionPro
Ìý
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
2023 Challenges and Opportunities Impacting Technical Documentation Team Capa...
Scott Abel
Ìý
London coi september 26th, 2011 - marshall sponder
London coi  september 26th, 2011 - marshall sponderLondon coi  september 26th, 2011 - marshall sponder
London coi september 26th, 2011 - marshall sponder
Marshall Sponder
Ìý
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Triple Your Experiment Velocity by Integrating Optimizely with Your Data Ware...
Optimizely
Ìý
Advanced Analysis Presentation
Advanced Analysis PresentationAdvanced Analysis Presentation
Advanced Analysis Presentation
Semphonic
Ìý
Ogx match bi
Ogx match biOgx match bi
Ogx match bi
Ramita Vig
Ìý
Match OGX_BI
Match OGX_BIMatch OGX_BI
Match OGX_BI
Ramita Vig
Ìý
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
Ken Krugler
Ìý
Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)Yelp Data Set Challenge (What drives restaurant ratings?)
Yelp Data Set Challenge (What drives restaurant ratings?)
Prashanth Raj
Ìý
The Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in KuaishouThe Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in Kuaishou
Jay (Jianqiang) Wang
Ìý
Asset finance systems projects guide 101
Asset finance systems projects guide 101Asset finance systems projects guide 101
Asset finance systems projects guide 101
David Pedreno
Ìý
Search analytics what why how - By Otis Gospodnetic
 Search analytics what why how - By Otis Gospodnetic  Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
lucenerevolution
Ìý
Benchmarking for facility professionals - Ifma foundation whitepaper
Benchmarking for facility professionals - Ifma foundation whitepaperBenchmarking for facility professionals - Ifma foundation whitepaper
Benchmarking for facility professionals - Ifma foundation whitepaper
Muriel Walter
Ìý
SurveyAnalytics MaxDiff Webinar ºÝºÝߣs
SurveyAnalytics MaxDiff Webinar ºÝºÝߣsSurveyAnalytics MaxDiff Webinar ºÝºÝߣs
SurveyAnalytics MaxDiff Webinar ºÝºÝߣs
QuestionPro
Ìý
SurveyAnalytics MaxDiff Webinar
SurveyAnalytics MaxDiff WebinarSurveyAnalytics MaxDiff Webinar
SurveyAnalytics MaxDiff Webinar
QuestionPro
Ìý
Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis GospodneticSearch analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
lucenerevolution
Ìý
Into AB experiments
Into AB experimentsInto AB experiments
Into AB experiments
Deven
Ìý
Presentation finding the perfect database
Presentation finding the perfect databasePresentation finding the perfect database
Presentation finding the perfect database
TechSoup
Ìý

Crowdsourcing a News Query Classification Dataset

  • 1. Crowdsourcing a News Query Classification DatasetRichard McCreadie, Craig Macdonald & IadhOunis0
  • 2. Introduction What is news query classification and why would we build a dataset to examine it? Binary classification task performed by Web search enginesUp to 10% of queries may be news-related [Bar-Ilan et al, 2009]Have workers judge Web search queries as news-related or notNews-RelatedNews ResultsgunmanNon-News-RelatedWeb Search EngineUserWeb Search Results1
  • 4. Workers can easily `game` the task
  • 5. News Queries change over time for query in task return Random(Yes,No)end loop2News-related?Sea Creature?Query: Octopus?Work Cup Predications?How can we overcome these difficulties to create a high quality dataset for news query classification?
  • 6. Introduction (1--3)
  • 8. Research Questions and Setting (15--17)
  • 10. Conclusions (21)Talk Outline3
  • 11. Dataset Construction Methodology How can we go about building a news query classification dataset?Sample Queries from the MSN May 2006 Query LogCreate Gold judgments to validate the workersPropose additional content to tackle the temporal nature of news queries and prototype interfaces to evaluate this content on a small test setCreate the final labels using the best setting and interfaceEvaluate in terms of agreementEvaluate against `experts`4
  • 12. Dataset Construction Methodology Sampling Queries: Create 2 query sets sampled from the MSN May 2006 Query-logPoisson Sampling One for testing (testset)Fast crowdsourcing turn around timeVery low cost One for the final dataset (fullset)10x queries Only labelled once5Date Time Query 2006-05-01 00:00:08 What is May Day? 2006-05-08 14:43:42 protest in Puerto ricoTestset Queries : 91Fullset Queries : 1206
  • 13. Dataset Construction Methodology How to check our workers are not `gaming’ the system?Gold Judgments (honey-pot)Small set (5%) of queriesCatch out bad workers early in the task`Cherry-picked` unambiguous queries Focus on news-related queriesMultiple workers per query3 workers Majority result6Date Time Query Validation2006-05-01 00:00:08 What is May Day? No2006-05-08 14:43:42 protest in Puerto rico Yes
  • 14. Dataset Construction Methodology How to counter temporal nature of news queries? Workers need to know what the news stories of the time were . . . But likely will not remember what the main stories during May 2006 Idea: add extra information to interface News Headlines News Summaries Web Search results Prototype Interfaces Use small testset to keep costs and turn-around time low See which works the best7
  • 15. Interfaces : BasicWhat the workers need to doClarify news-relatedness8Query and DateBinary labelling
  • 16. Interfaces : Headline12 news headlines from the New York Times . . . Will the workers bother to read these?9
  • 17. Interfaces : HeadlineInline12 news headlines from the New York Times . . . 10Maybe headlines are not enough?
  • 18. Interfaces : HeadlineSummarynews headlinenews summary . . . Query: Tigers of Tamil?11
  • 19. Interfaces : LinkSupportedLinks to three major search engines . . . Triggers a search containing the query and its dateAlso get some additional feedback from workers12
  • 20. Dataset Construction Methodology How do we evaluate our the quality of our labels?Agreement between the three workers per query The more the workers agree, the more confident that we can be that our resulting majority label is correct Compare with `expert’ (me) judgments See how many of the queries that the workers judged news-related match the ground truth13Date Time Query Worker Expert2006-05-05 07:31:23 abcnews Yes No2006-05-08 14:43:42 protest in Puerto rico Yes Yes
  • 21. Introduction (1--3)
  • 23. Research Questions and Setting (15--17)
  • 25. Conclusions (21)Talk Outline14
  • 26. Experimental Setup Research QuestionsHow do our interface and setting effect the quality of our labelsBaseline quality? How bad is it?How much can the honey-pot bring?How about our extra information {headlines,summaries,result rankings}Can we create a good quality dataset?Agreement?Vs ground truthtestsetfullset15
  • 29. Judgments per query : 3
  • 30. Costs (per interface) Basic $1.30 Headline $4.59HeadlineInline $4.59HeadlineSummary $5.56LinkSupported $8.7816 Restrictions USA Workers Only 70% gold judgment cutoff MeasuresCompare with ground truthPrecision, Recall, Accuracy over our expert ground truth Worker agreementFree-marginal multirater Kappa (Κfree)Fleissmultirater Kappa (Κfleiss)
  • 31. Introduction (1--3)
  • 33. Research Questions and Setting (15--17)
  • 35. Conclusions (21)Talk Outline17
  • 36. Baseline and Validation18 What is the effect of validation?
  • 37. How is our Baseline?Validation is very important:32% of judgments were rejected Basic InterfaceAccuracy : Combined Measure (assumes that the workers labelled non-news-related queries correctly)Recall : The % of all news-related queries that the workers labelled correctlyAs expected the baseline is fairly poor, i.e. Agreement between workers per query is low (25-50%)20% of those were completed VERY quickly: Bots?Kfree : Kappa agreement assuming that workers would label randomly Precision : The % of queries labelled as news-related that agree with our ground truth. . . and new usersKfleiss : Kappa agreement assuming that workers will label according to class distributionWatch out for bursty judging
  • 38. Adding Additional Information By providing additional news-related information does label quality increase and which is the best interface? Answer: Yes, as shown by the performance increase The LinkSuported interface provides the highest performance19. . . but putting the information with each query causes workers to just match the textWeb results provide just as much information as headlinesMore information increases performanceAgreement .We can help users by providing more informationHeadlineInlineHeadlineHeadlineSumaryLinkSupportedBasic
  • 39. Labelling the FullSet20Link-Supported Interface We now label the fullset 1204 queries Gold Judgments LinkSupported Interface Are the resulting labels of sufficient quality?
  • 40. High recall and agreement indicate that the labels are of high qualityPrecision . Workers finding other queries to be news-relatedRecall Workers got all of the news-related queries right!Agreement .Workers maybe learning the task?Majority of work done by 3 userstestsetfullset
  • 41. Conclusions & Best PracticesCrowdsourcing is useful for building a news-query classification dataset
  • 42. We are confident that our dataset is reliable since agreement is highBest PracticesOnline worker validation is paramount, catch out bots and lazy workers to improve agreement Provide workers with additional information to help improve labelling quality Workers can learn, running large single jobs may allow workers to become better at the taskQuestions?21

Editor's Notes

  • #6: Poisson sampling – literature says is representative
  • #16: At beginningOverriding hypothesis
  • #17: Too late$2
  • #19: Precision : the number of
  • #20: Worker overlap is only 1/3
  • #21: Precision : the number of
  • #22: More detail and link backwardsGraph missingHow confident in the test collection – looks reliable