6. 2. 倒排索引的构造过程
Col l ect i on
Par ser
Modi f i ed t okens
I ndexer
I nver t ed I ndex
解析器。根据不同的文档采
用不同的解析器对其解析。
其中包括分词器( Tokeni zer )
和语言模块( l i ngui st i c
modul es) 如大小写折叠( case
f ol di ng) 处理, ( st opper ) 停
用词处理,以及( st emmer ) 词
根化处理等.
索引器。主要是生成
( di ct i onar y) 字典文件和
( post i ngs f i l e) 记录文件组成
的倒排索引,其中包括( sor t ) 排
序操作。对于大规模的文档的处
理, 包括中间文件的合并( mer ge)
操作以及对字典文件和记录文件
的压缩( i ndex compr ess) 操作。
7. 2. 倒排索引的构造过程
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
构造无词语位置信息倒排索引的简单示例:
8. 3. 倒排索引的构造方法
硬件基础回顾 :
2007 年系统的典型参数规格指标
(寻道时间指磁头重一个位置到另一个位置的时间。每字节传输时间指数据从外存到内存的时间比例)
symbol statistic value
s 平均寻道时间 5 ms = 5 x 10?3
s
b 每字节传输时间 0.02 μs = 2
x10?8
s
处理器的时钟频率 109
s?1
p 低层次处理时间 0.01 μs = 10?8
s
(e.g., compare & swap a word)
主要内存大小 several GB
外存空间大小 1 TB or more
20. 相关资料:
1. Efficient Single-Pass Index Construction for Text
Databases Steen Heinz , Justin Zobel , 2003 。
2. Chapter 5 of 《 Managing Gigabytes:
Compressing and Indexing Documents and
Images 》 2nd
Morgan Kaufmann, San Francisco,
California , 1999 。
#7: Token: an instance of a sequence of characters, Each such token is now a candidate for an index entry, after further processing
Term: a (normalized) word type, which is an entry in our IR system dictionary
#10: Sort-based indexing
Na?ve in-memory inversion
Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks!)
Single-Pass In-Memory Indexing
No global dictionary
Generate separate dictionary for each block
Don’t sort postings
Accumulate postings in postings lists as they occur
Distributed indexing using MapReduce
Dynamic indexing: Multiple indices, logarithmic merge
#14: (2个块文件(待合并的记录表(postings list))从外存加载到内存中。在内存中合并(合并好的记录表)然后写回外存。为了方便理解,用词语(term)代替词语号(termID)).
图中用di来表示第i个文档。全部的块文件同时合并,对10个块文件读需要一些小的读缓冲区和对于最终合并好的索引写入外存需要一个写缓存区。在每次迭代中,选择最小的词语号(termID),这些词语号没有被处理以及没有使用优先队列或类似的数据结构。词语号(termID)所对应的全部的记录表(postings list)被读取和合并。合并好的记录表写回外存中。必要时每一个读缓存区被文件所重新填满。
But it is more efficient to do a n-way merge, where you are reading from all blocks simultaneously
Providing you read decent-sized chunks of each block into memory and then write out a decent-sized output chunk, then you’re not killed by disk seeks