�ݺ�ߣ

Kohei Shinden, Atsuki Maruta, Makoto P. Kato
University of Tsukuba
KASYS at the NTCIR-15 WWW-3 Task

? NTCIR-15 WWW-3 Task
? Ad-hoc document retrieval tasks for web documents
Background 2
? Proposed search model using BERT (Birch)
? Yilmaz et al: Cross-Domain Modeling of Sentence-level
Evidence for Document Retrieval, EMNLP 2019
? BERT has been successfully applied to a broad range of NLP
tasks including document ranking tasks.

? Applying a sentence-level relevance estimator learned by QA and
microblog search datasets to ad-hoc document retrieval
Birch (Yilmaz et al, 2019) 3
1. The sentence-level relevance estimator is obtained by fine-tuning the
pre-trained BERT model with QA and microblog search data.
2. Calculate BM25 scores and BERT scores for query and document sentences.
3. Weighted sum of the BM25 and the score of the highest BERT-score
sentence in the document.
Pre-trained
BERT Model
BERT
Sentence-Level
Relevance Judgements
Model
Halloween Pictures
Datasets
Trick or Treat...
0.7
Children get candy...
0.3
Pumpkin sweets...
0.1
0.4
BERT + BM25 = 0.6
BM25
Score
BERT
Score Sentences Document
Fine-tune

? Weighted sum of the BM25 and the score of the highest
BERT-scoring sentence in the document
? Assuming that the most relevant sentences in a document are
good indicators of the document-level relevance [1]
? ?BM25(?): The BM25 score of document ?
? ?BERT(??): The sentence relevance of the top ?-th sentence obtained by BERT
? ?? : The hyper-parameter ?? is to be tuned with a validation set
Details of Birch 4
[1] Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019

Preliminary Experiment Details 5
? Preliminary experiments to select datasets and
hyper-parameters suitable for ranking web documents
Train Validation
NTCIR-14 WWW-2
Test Collection
(with its original qrels)
Robust04 MS MARCO TREC CAR TREC MB
Model
MB ? ?
Model
CAR ? ?
Model
MS MARCO ? ?
Model
CAR �� MB ? ? ?
Model
MS MARCO �� MB ? ? ?
The checkmarks represent the data set used for training.

MSMARCO �� MB is the best.
Thus, we submitted runs based on
MS MARCO �� MB and CAR �� MB.
Preliminary Experiment Results & Discussion 6
? Evaluated the prediction results of Birch models
? Top k sentences: Uses the k-sentence with the highest BERT score for ranking
0.3098 0.3112 0.3103
0.3266 0.3312 0.3318
0
0.1
0.2
0.3
0.4
0.5
BM25 MB CAR MS MARCO CAR �� MB MS MARCO �� MB
nDCG@10
Baseline Top 1 sentence Top 2 sentences Top 3 sentences

? MSMARCO��MB is the best. The CAR��MB model also achieved similar scores.
? The reason why MS MARCO and TREC CAR?s results are better probably
because they are web documents retrieval and have a large amount of data.
? BERT is also valid for web document retrieval.
Official Evaluation Results & Discussion 7
? Achieved the best performances in terms of
nDCG, Q and iRBU among all the participants.
KASYS-E-CO-NEW-1:
- MS MARCO��MB
- Top 3 sentences
KASYS-E-CO-NEW-4:
- MS MARCO��MB
- Top 2 sentences
KASYS-E-CO-NEW-5:
- CAR��MB
- Top 3 sentences
0.6935 0.7123
0.7959
0.9389
0
0.2
0.4
0.6
0.8
1
nDCG Q ERR iRBU
Baseline KASYS-E-CO-NEW-1
KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

? Achieved the best performances in terms of
nDCG, Q and iRBU among all the participants.
? The effectiveness of BERT in ad hoc web document
retrieval tasks was verified.
? MSMARCO��MB is the best.
The CAR��MB model also
achieved similar scores.
? BERT is also valid for
web document retrieval.
Summary of NEW Runs 8
KASYS-E-CO-NEW-1:
- MS MARCO��MB
- Top 3 sentences
KASYS-E-CO-NEW-5:
- CAR��MB
- Top 3 sentences
KASYS-E-CO-NEW-4:
- MS MARCO��MB
- Top 2 sentences
0.6935 0.7123
0.7959
0.9389
0
0.2
0.4
0.6
0.8
1
nDCG Q ERR iRBU
Baseline KASYS-E-CO-NEW-1
KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

Replicating and reproducing the THUIR runs
at the NTCIR 14 WWW-2 Task
Whether the results between models are consistent with each result.
THUIR KASYS(ours)
Abstract of REP runs 10
BM25 BM25
LambdaMART
(learning-to-rank model)
LambdaMART
(learning-to-rank model)
<
<
?

Replication Procedure 1 11
disney
switch
Canon
�E�E
Clueweb
Collection
Ranked by
BM25
algorithm
input output
Disney shop
Tokyo Disney
resort
Disney
official
�E�E
Ranked web documents
1st
2nd
3rd
input
Feature
extracting
program
Extracted eight features
Extracting tf, idf,
docement length, BM25,
LMIR as features
Up to BM25
LamdbaMART from here
WWW-2 and WWW-3 topics
honda
Pokemon
ice age
�E�E

?MQ Track : A dataset of the relevance of a topic and a document.
Replication Procedure 2 12
Re-ranked web document
Extraction
feature
program
qid:001 1:0.2 �E
qid:001 1:0.5 �E
qid:001 1:0.1 �E
qid:001 1:0.9 �E
output
�E�E
Extracted features from document
LambdaMART
input
MQ Track WWW-1 test
collection
train validate
Disney
official
Disney shop
Tokyo Disney
resort
1st
2nd
3rd
�E�E
output

? Features for learning to rank
? TF, IDF, TF-IDF, document length, BM25 score, and three
language-model-based IR scores
? The differences from original paper
? Although THUIR extracted the features from four fields (whole
document, anchor text, title, and URL), we extracted the features
from only the whole document
? Normalization is used by maximum and minimum values because
the normalization of features was not described in the original
paper
Implementation Details 13

0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
LamdbaMART BM25
Ours Original
0.3
0.31
0.32
0.33
0.34
0.35
0.36
Preliminary Evaluation Results with Original WWW-2 qrels 14
0.28
0.29
0.3
0.31
0.32
0.33
0.34
Ours Original
nDCG@10 Q@10 nERR@10
? Our results is lower than original results
? LambdaMART results were above BM25 for all evaluation metrics
? Succeeded in reproducing the run
Ours Original

Official Evaluation Results 15
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
nDCG Q ERR iRBU
WWW-3 official result
LambdaMART BM25
? BM25 results were above LambdaMART for all evaluation metrics
? Failed to reproduce the run
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
nDCG Q ERR iRBU
WWW-2 official result
LambdaMART BM25

? In the original paper, LambdaMART gave better results than
BM25, but on the contrary, our BM25 result was better than
LambdaMART
? We failed to replicate and reproduce the original paper
Conclusion 16
Suggestions
? In web search tasks, more effective to extract features from
all fields
? Better to clarify the method of normalization in a paper

NEW runs
? Achieved the best performances in terms of nDCG, Q and iRBU among
all the participants
? The effectiveness of BERT in ad hoc web document retrieval tasks
was verified.
? MSMARCO��MB is the best. The CAR��MB model also achieved similar scores.
? BERT is also valid for web document retrieval.
REP runs
? In the original paper, LambdaMART gave better results than BM25,
but on the contrary, our BM25 result was better than LambdaMART
? We failed to replicate and reproduce the original paper
Summary of All Runs 17

�ݺ�ߣ

KASYS at the NTCIR-15 WWW-3 Task

Recommended

More Related Content

Similar to KASYS at the NTCIR-15 WWW-3 Task (20)

Recently uploaded (20)

KASYS at the NTCIR-15 WWW-3 Task