5. 信息检索中的倒排索引简单示例
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
Caesar
Query 1
简单介绍一个单词查询的过程:
如果进行短语查询,需要位置信息。
字典文件:二分查找或 B 树索
引
#24: *.invlookup文件
totalCount----term在集合中出现的总次数
documentCount----集合中包含该term的文档总数
length----length of data
segment----segment number
offset----term对应的反转文档列表在*.ivl文件中的偏移
SegmentOffset----offset within an individual file segment
*.ivl文件
tid----term id.
df----document frequent,集合中包含该term的文档总数。
diff---- lastid-begin
length----压缩后的数据长度。
docid----document ID,按照从大到小的顺序排列。
tf----term frequent,term在文档中出现的次数。
location----term在文档中出现的位置。
*.dtlookup文件
offset-对应文档信息在*.dt文件的偏移
len文档长度,包括停止词(如果countStopWords参数为true)record----24Byte? offset为INT64数据类型,长度为8Byte,若去掉该字段,长度为12Byte,若将类型改为int,则为16Byte
totalLen文档总长度,包括停止词
Num mgrid for terminfolist, df for docinfolist
*.dt文件
did(4Byte)|----文档ID
length(4Byte)|----文档长度,包括停止词(如果countStopWords参数为true)
len(4Byte)|----compressed termlist length
#25: #define max_index 3
#define max_level 32
/* number of index block levels */
#define max_segment 127
/* max number of file segments */
#define max_data_in_index_lc 128
/* longest record that can go in index block */