ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
User Intent and
Assessor Disagreement in
Web Search Evaluation
Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi
Information Retrieval Evaluation
IR system

Evaluation
Main Tests and Analysis
1. Test crowd judges and trained judges on inter-assessor
agreement and user (i.e. click) agreement
¨C
¨C

Single judging UI
Pairwise judging UI

2. When clicks show a strong preference, analyse judge quality
3. When clicks indicate substitutability, analyse judge quality
Click-based properties of web pages
Intent
Similarity

Judge groups
Crowd

Trained
Judges

Evaluation measures

Click
Preference
Strength

Interassessor
Agreement

UserAssessor
Agreement*

Relevance judgments
Pairwise
UI

Single
UI
*Click-agreement in paper
Click-based Properties of
Web Pages

Click
Preference
Strength

Intent
Similarity
(Dupe)*

? Sample (q,u,v) where urls u and v are adjacent, one or both
are clicked, and we have seen both orders (uv and vu)

? Click Preference Strength

? Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures
Experiment Setup

Crowd

Trained
Judges

Single
UI

Pairwise
UI
Interassessor
Agreement

Method of Analysis

UserAssessor
Agreement

? Inter-assessor agreement
? Fleiss kappa

? User-assessor agreement
? Based on directional agreement between judgment-based preference
and click-based preference over pairs of URLs
Def

Example case

Agree

JURL1 > JURL2

& CURL1 > CURL2

Disagree

JURL1 > JURL2

& CURL1 < CURL2

Undetected

JURL1 = JURL2

& CURL1 < CURL2
What is the relationship between inter-assessor agreement and
agreement with web users (click-agreement) for crowd and editorial
judges in different judging modes?

RESULTS 1
Interassessor
Agreement

Results 1
Inter-assessor
Agreement
Crowd workers
Editorial judges

User-assessor
Agreement (%)
Crowd workers
Editorial judges

UserAssessor
Agreement

? Trained judges agree
better with each
0.24
0.29
other and with users
0.51
0.57
than crowd
? Pairwise UI leads to
better agreement
Single UI
Pairwise UI
than single UI
? Inter-assessor
45 ¨C 27 ¨C 28 56 ¨C 24 ¨C 20
agreement does NOT
58 ¨C 21 ¨C 21 66 ¨C 18 ¨C 16
mean user-assessor
Agree ¨C Undetected ¨C Disagree
agreement
Single UI

Pairwise UI
When web users show a strong preference for a result, do we see a
change in inter-assessor agreement or in click-agreement for editorial or
crowd judges?

RESULTS 2
Click-based Properties of
Web Pages

Click
Preference
Strength

Intent
Similarity
(Dupe)*

? Sample (q,u,v) where urls u and v are adjacent, one or both
are clicked, and we have seen both orders (uv and vu)

? Click Preference Strength

? Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures
Interassessor
Agreement

Click
Preference
Strength

Y axis

X axis

Editorial

Crowd

Single

Results 2a
Pairwise

? No relationship for
crowd
? Positive trend for
trained judges: They
agree more with each
other as Puv increases,
esp. for high click
volume URL pairs (50k,
red line)
UserAssessor
Agreement
Y axis

Results 2b

Crowd
Editorial

? With higher Puv, all
judges agree better
with web users
(positive trends)
? Pairwise judging
induces judging
patterns for crowd
that are more similar
to editorial judges¡¯

Single

Pairwise

Click
Preference
Strength
X axis
When two documents are detected as satisfying similar intents, do we
see a change in inter-assessor agreement or click-agreement for editorial
or crowd judges?

RESULTS 3
Interassessor
Agreement
Y axis

Results 3a

Editorial

Crowd

Single

Intent
Similarity
(Dupe)
X axis

Pairwise

? Positive trend, except PC:
judges agree with each
other more on more
redundant (dupe) pages
? Crowd judges¡¯ interassessor agreement has
no clear relationship
with Dupe score
UserAssessor
Agreement
Y axis

Results 3b

Crowd
Editorial

? Positive trend, except SC
? Pairwise UI exposes
properties of web pages
that can improve
judging quality when
faced with more
interchangeable
documents, leading to
better agreement with
web users (even if not
with other judges)

Single

Pairwise

Intent
Similarity
(Dupe)
X axis
Conclusions

Interassessor
Agreement

Userassessor
Agreement

? Different assessment procedure ?
Different properties
? Trained judges beat crowd judges
? Pairwise UI beats single UI on both interassessor and user-assessor agreement
? Note: Specific to our method of sampling
adjacent URLs?
? Open issue: Optimizing your assessment
procedure

Click
Preference
Intent
Similarity
Pairwise
UI
Single
UI
Trained
Judges
Crowd

More Related Content

Similar to Kazai cikm2013-intent (13)

A Recommender System Sensitive to Intransitive Choice and Preference Reversals
A Recommender System Sensitive to Intransitive Choice and Preference ReversalsA Recommender System Sensitive to Intransitive Choice and Preference Reversals
A Recommender System Sensitive to Intransitive Choice and Preference Reversals
csandit
?
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kalle
?
Ccr a content collaborative reciprocal recommender for online dating
Ccr a content collaborative reciprocal recommender for online datingCcr a content collaborative reciprocal recommender for online dating
Ccr a content collaborative reciprocal recommender for online dating
Sean Chiu
?
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
idoguy
?
An Engaging Click ... or how can user engagement measurement inform web searc...
An Engaging Click ... or how can user engagement measurement inform web searc...An Engaging Click ... or how can user engagement measurement inform web searc...
An Engaging Click ... or how can user engagement measurement inform web searc...
Mounia Lalmas-Roelleke
?
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender Systems
MegaVjohnson
?
Expectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application DomainExpectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application Domain
IJERA Editor
?
Expectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application DomainExpectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application Domain
IJERA Editor
?
A survey on recommendation system
A survey on recommendation systemA survey on recommendation system
A survey on recommendation system
iosrjce
?
I017654651
I017654651I017654651
I017654651
IOSR Journals
?
Dynamic interaction in decision support
Dynamic interaction in decision supportDynamic interaction in decision support
Dynamic interaction in decision support
sharmichandru
?
Website Strategy
Website StrategyWebsite Strategy
Website Strategy
Michael Ling
?
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
MD. ABU TALHA
?
A Recommender System Sensitive to Intransitive Choice and Preference Reversals
A Recommender System Sensitive to Intransitive Choice and Preference ReversalsA Recommender System Sensitive to Intransitive Choice and Preference Reversals
A Recommender System Sensitive to Intransitive Choice and Preference Reversals
csandit
?
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kammerer How The Interface Design Influences Users Spontaneous Trustworthines...
Kalle
?
Ccr a content collaborative reciprocal recommender for online dating
Ccr a content collaborative reciprocal recommender for online datingCcr a content collaborative reciprocal recommender for online dating
Ccr a content collaborative reciprocal recommender for online dating
Sean Chiu
?
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
idoguy
?
An Engaging Click ... or how can user engagement measurement inform web searc...
An Engaging Click ... or how can user engagement measurement inform web searc...An Engaging Click ... or how can user engagement measurement inform web searc...
An Engaging Click ... or how can user engagement measurement inform web searc...
Mounia Lalmas-Roelleke
?
Evaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender SystemsEvaluating Collaborative Filtering Recommender Systems
Evaluating Collaborative Filtering Recommender Systems
MegaVjohnson
?
Expectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application DomainExpectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application Domain
IJERA Editor
?
Expectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application DomainExpectations for Electronic Debate Platforms as a Function of Application Domain
Expectations for Electronic Debate Platforms as a Function of Application Domain
IJERA Editor
?
A survey on recommendation system
A survey on recommendation systemA survey on recommendation system
A survey on recommendation system
iosrjce
?
Dynamic interaction in decision support
Dynamic interaction in decision supportDynamic interaction in decision support
Dynamic interaction in decision support
sharmichandru
?
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
CrowdsouRS: A Crowdsourced Reputation System for Identifying Deceptive Web-co...
MD. ABU TALHA
?

Recently uploaded (20)

EaseUS Partition Master Crack 2025 + Serial Key
EaseUS Partition Master Crack 2025 + Serial KeyEaseUS Partition Master Crack 2025 + Serial Key
EaseUS Partition Master Crack 2025 + Serial Key
kherorpacca127
?
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarterQ4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
MariaBarbaraPaglinaw
?
UiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and OpportunitiesUiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and Opportunities
DianaGray10
?
Fl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free DownloadFl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free Download
kherorpacca127
?
Both Feet on the Ground - Generative Artificial Intelligence
Both Feet on the Ground - Generative Artificial IntelligenceBoth Feet on the Ground - Generative Artificial Intelligence
Both Feet on the Ground - Generative Artificial Intelligence
Pete Nieminen
?
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog GavraReplacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
ScyllaDB
?
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & TipsTrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc
?
Technology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptxTechnology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptx
kaylagaze
?
UiPath Document Understanding - Generative AI and Active learning capabilities
UiPath Document Understanding - Generative AI and Active learning capabilitiesUiPath Document Understanding - Generative AI and Active learning capabilities
UiPath Document Understanding - Generative AI and Active learning capabilities
DianaGray10
?
UiPath Automation Developer Associate Training Series 2025 - Session 2
UiPath Automation Developer Associate Training Series 2025 - Session 2UiPath Automation Developer Associate Training Series 2025 - Session 2
UiPath Automation Developer Associate Training Series 2025 - Session 2
DianaGray10
?
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
ScyllaDB
?
DevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdfDevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdf
Justin Reock
?
Endpoint Backup: 3 Reasons MSPs Ignore It
Endpoint Backup: 3 Reasons MSPs Ignore ItEndpoint Backup: 3 Reasons MSPs Ignore It
Endpoint Backup: 3 Reasons MSPs Ignore It
MSP360
?
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Jonathan Bowen
?
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOTSMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
TanmaiArni
?
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
Tsuyoshi Hirayama
?
Field Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci ResearchField Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci Research
Vipin Mishra
?
Q4 2024 Earnings and Investor Presentation
Q4 2024 Earnings and Investor PresentationQ4 2024 Earnings and Investor Presentation
Q4 2024 Earnings and Investor Presentation
Dropbox
?
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar PatturajInside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
?
Transform Your Future with Front-End Development Training
Transform Your Future with Front-End Development TrainingTransform Your Future with Front-End Development Training
Transform Your Future with Front-End Development Training
Vtechlabs
?
EaseUS Partition Master Crack 2025 + Serial Key
EaseUS Partition Master Crack 2025 + Serial KeyEaseUS Partition Master Crack 2025 + Serial Key
EaseUS Partition Master Crack 2025 + Serial Key
kherorpacca127
?
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarterQ4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
Q4_TLE-7-Lesson-6-Week-6.pptx 4th quarter
MariaBarbaraPaglinaw
?
UiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and OpportunitiesUiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and Opportunities
DianaGray10
?
Fl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free DownloadFl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free Download
kherorpacca127
?
Both Feet on the Ground - Generative Artificial Intelligence
Both Feet on the Ground - Generative Artificial IntelligenceBoth Feet on the Ground - Generative Artificial Intelligence
Both Feet on the Ground - Generative Artificial Intelligence
Pete Nieminen
?
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog GavraReplacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
ScyllaDB
?
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & TipsTrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc
?
Technology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptxTechnology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptx
kaylagaze
?
UiPath Document Understanding - Generative AI and Active learning capabilities
UiPath Document Understanding - Generative AI and Active learning capabilitiesUiPath Document Understanding - Generative AI and Active learning capabilities
UiPath Document Understanding - Generative AI and Active learning capabilities
DianaGray10
?
UiPath Automation Developer Associate Training Series 2025 - Session 2
UiPath Automation Developer Associate Training Series 2025 - Session 2UiPath Automation Developer Associate Training Series 2025 - Session 2
UiPath Automation Developer Associate Training Series 2025 - Session 2
DianaGray10
?
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
ScyllaDB
?
DevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdfDevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdf
Justin Reock
?
Endpoint Backup: 3 Reasons MSPs Ignore It
Endpoint Backup: 3 Reasons MSPs Ignore ItEndpoint Backup: 3 Reasons MSPs Ignore It
Endpoint Backup: 3 Reasons MSPs Ignore It
MSP360
?
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Formal Methods: Whence and Whither? [Martin Fr?nzle Festkolloquium, 2025]
Jonathan Bowen
?
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOTSMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
SMART SENTRY CYBER THREAT INTELLIGENCE IN IIOT
TanmaiArni
?
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
DAO UTokyo 2025 DLT mass adoption case studies IBM Tsuyoshi Hirayama (ƽɽÒã)
Tsuyoshi Hirayama
?
Field Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci ResearchField Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci Research
Vipin Mishra
?
Q4 2024 Earnings and Investor Presentation
Q4 2024 Earnings and Investor PresentationQ4 2024 Earnings and Investor Presentation
Q4 2024 Earnings and Investor Presentation
Dropbox
?
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar PatturajInside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
Inside Freshworks' Migration from Cassandra to ScyllaDB by Premkumar Patturaj
ScyllaDB
?
Transform Your Future with Front-End Development Training
Transform Your Future with Front-End Development TrainingTransform Your Future with Front-End Development Training
Transform Your Future with Front-End Development Training
Vtechlabs
?

Kazai cikm2013-intent

  • 1. User Intent and Assessor Disagreement in Web Search Evaluation Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi
  • 3. Main Tests and Analysis 1. Test crowd judges and trained judges on inter-assessor agreement and user (i.e. click) agreement ¨C ¨C Single judging UI Pairwise judging UI 2. When clicks show a strong preference, analyse judge quality 3. When clicks indicate substitutability, analyse judge quality
  • 4. Click-based properties of web pages Intent Similarity Judge groups Crowd Trained Judges Evaluation measures Click Preference Strength Interassessor Agreement UserAssessor Agreement* Relevance judgments Pairwise UI Single UI *Click-agreement in paper
  • 5. Click-based Properties of Web Pages Click Preference Strength Intent Similarity (Dupe)* ? Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu) ? Click Preference Strength ? Dupe score (Radlinski et al. WSDM 2011) *Paper has two other intent similarity measures
  • 7. Interassessor Agreement Method of Analysis UserAssessor Agreement ? Inter-assessor agreement ? Fleiss kappa ? User-assessor agreement ? Based on directional agreement between judgment-based preference and click-based preference over pairs of URLs Def Example case Agree JURL1 > JURL2 & CURL1 > CURL2 Disagree JURL1 > JURL2 & CURL1 < CURL2 Undetected JURL1 = JURL2 & CURL1 < CURL2
  • 8. What is the relationship between inter-assessor agreement and agreement with web users (click-agreement) for crowd and editorial judges in different judging modes? RESULTS 1
  • 9. Interassessor Agreement Results 1 Inter-assessor Agreement Crowd workers Editorial judges User-assessor Agreement (%) Crowd workers Editorial judges UserAssessor Agreement ? Trained judges agree better with each 0.24 0.29 other and with users 0.51 0.57 than crowd ? Pairwise UI leads to better agreement Single UI Pairwise UI than single UI ? Inter-assessor 45 ¨C 27 ¨C 28 56 ¨C 24 ¨C 20 agreement does NOT 58 ¨C 21 ¨C 21 66 ¨C 18 ¨C 16 mean user-assessor Agree ¨C Undetected ¨C Disagree agreement Single UI Pairwise UI
  • 10. When web users show a strong preference for a result, do we see a change in inter-assessor agreement or in click-agreement for editorial or crowd judges? RESULTS 2
  • 11. Click-based Properties of Web Pages Click Preference Strength Intent Similarity (Dupe)* ? Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu) ? Click Preference Strength ? Dupe score (Radlinski et al. WSDM 2011) *Paper has two other intent similarity measures
  • 12. Interassessor Agreement Click Preference Strength Y axis X axis Editorial Crowd Single Results 2a Pairwise ? No relationship for crowd ? Positive trend for trained judges: They agree more with each other as Puv increases, esp. for high click volume URL pairs (50k, red line)
  • 13. UserAssessor Agreement Y axis Results 2b Crowd Editorial ? With higher Puv, all judges agree better with web users (positive trends) ? Pairwise judging induces judging patterns for crowd that are more similar to editorial judges¡¯ Single Pairwise Click Preference Strength X axis
  • 14. When two documents are detected as satisfying similar intents, do we see a change in inter-assessor agreement or click-agreement for editorial or crowd judges? RESULTS 3
  • 15. Interassessor Agreement Y axis Results 3a Editorial Crowd Single Intent Similarity (Dupe) X axis Pairwise ? Positive trend, except PC: judges agree with each other more on more redundant (dupe) pages ? Crowd judges¡¯ interassessor agreement has no clear relationship with Dupe score
  • 16. UserAssessor Agreement Y axis Results 3b Crowd Editorial ? Positive trend, except SC ? Pairwise UI exposes properties of web pages that can improve judging quality when faced with more interchangeable documents, leading to better agreement with web users (even if not with other judges) Single Pairwise Intent Similarity (Dupe) X axis
  • 17. Conclusions Interassessor Agreement Userassessor Agreement ? Different assessment procedure ? Different properties ? Trained judges beat crowd judges ? Pairwise UI beats single UI on both interassessor and user-assessor agreement ? Note: Specific to our method of sampling adjacent URLs? ? Open issue: Optimizing your assessment procedure Click Preference Intent Similarity Pairwise UI Single UI Trained Judges Crowd

Editor's Notes

  • #3: ? A standard practice of evaluating IR effectiveness is¡­? Relevance labels are subjective, leading to assessor disagreements¡­? Recently, preference judging... ¡­more natural user task ¡­higher inter-assessor agreement levels ¡­increased measurement sensitivity. ¡­desirable with the increasing adoption of crowdsourcing? Various reports on assessor disagreement, but little work has been done on characterizing disagreement, e.g.:? Why preference judging reduces assessor disagreement? ? Does better agreement among assessors also means better agreement with user satisfaction, ¡­user clicks? ? What is the role of user intent in judging behavior? E.g., judges may agree better when rating pairs of documents that satisfy more similar or more diverse intents.
  • #5: Experiment setup¡­examine the relationship between assessor disagreement and click based measures: click preference strength and user intent similarityfor judgments collected from editorial judges and crowd workers using absolute and preference based methodsMeasuring both inter-assessor and user-assessor agreementNote that user-assessor agreement is referred to as click-agreement in the paper.
  • #6: Click preference is defined as the proportion of times web users prefer one search result, URL u, over another, URL v, for a given query, by solely clicking on u even though both results are observed. In order to identify the cases where both u and v are observed by the user, we focus only on the cases where u and v are presented in consecutiverank positions (regardless of the order in which they are presented) and where at least one result (u or v) is clicked by the user. We then compute the click preference strength of u over v, (Puv), as the proportion of times only u is clicked by the user minus the proportion of times only v is clicked. We use c?uv to denote the number of times when the two results were shown with u immediately above v (e.g., u at rank 2 and v at rank 3), where u was clicked and v was not clicked. Similarly, let cu?v be the number of times v was clicked and u was not, and c?u?v be the number of times when both results were clicked.Intent similarity is measured by 3 different metrics in the paper. Here we only present the dupe score.
  • #7: Data:We use three months of click logs (Sept - Nov 2011) and select pairs of URLs that were shown to users in adjacent rank positions, where both orderings appeared in different impressions, e.g., u ranked above v in some impressions and v ranked above u in others, and where at least one of the two results was clicked.Our final sample set consists of 1,068 (q, u, v) tuples with 830 unique queries, 1,757 unique URLs and 1,915 unique query-URL pairs.¡­pairwise judging experiments: randomly swap the two URLs¡­single judging task, we separate the 1,068 pairs of URLs in our sample set into 1,915 unique query-URL pairs¡­ We collect relevance labels using three different HIT designs¡­ sequential pairwise is not shown.
  • #10: ³§±ð±ô´Ú-±ð³æ±è±ô²¹²Ô²¹³Ù´Ç°ù²â¡­
  • #13: X = click pref scoreY=kappaTrend lines are the message.. They show whether judges agree better with each other as click-preference strength increases, i.e., when there is stronger user preference for one URLCrowd workers do not show any relationshipFor trained judges, we see that they agree with each other more when users have stronger preference for one URL, especially for pairs of URLs that have a lot of traffic
  • #14: All judges agree better with user signals with increasing click preference scorePairwise UI helps to boost judge-user agreement and the relationship with preference strength is strongerTrained judges demonstrate a stronger relationship, especially for high traffic URLs
  • #16: This is just dup score, see paper for other intent similarity scores¡­Weak relationships, barely positive trend, except pairwise-crowd (PC)No clear relationship for crowd, they do not agree with each other any more when judging URLs that are near dupes of each other or URLs that are not-dupes
  • #17: The dupe score highlights the difference between the single and pairwise judging methods: in singleUI-crowd (SC), the more interchangeable the pairs of web pages are, the more likely that crowd workers disagree with user clicks, while this trend flips for the pairwise UI.the pairwise UI exposes properties of the web pages that can then improve judging quality when faced with more interchangeable documents, leading to better agreement with web users (even if not with other assessors) ¨C see prev slide.
  • #18: Sampling: url pairs that were both shown to users next to each other in the ranking -&gt; for informational these will be close calls, since ranked close