狠狠撸

蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎

银平 pkufranky@gmail.com
2010-06-07

Outline

? Overview
? Scrapy – python爬虫框架
? Sphinx – C++全文搜索引擎
? demo – scrapy + sphinx实现小说搜索引擎

Overview - 搜索引擎/爬虫分类

? 搜索引擎
o 通用搜索引擎
o 垂直搜索引擎
o 资源型垂直搜索引擎
? 爬虫
o 通用爬虫
o 专用爬虫

Overview - 搜索引擎

? 分词
? 倒排索引
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-
building-an-inverted-index-1.html

Scrapy – python爬虫框架

? Architecture
? Built-in middlewares
? Extensions
? 从网页中提取数据

Architecture
? Components
o Scrapy Engine
o Scheduler
o Downloader
o Spider
o Item Pipeline
o Middlewares
? Event-driven networking: twisted

Built-in middlewares

? Downloader middlewares
o DefaultHeadersMiddleware
o HttpAuthMiddleware
o HttpCacheMiddleware
o RedirectMiddleware
o RetryMiddleware
? Spider middlewares
o DepthMiddleware
o RefererMiddleware
? Scheduler middlewares
o DuplicatesFilterMiddleware

Extensions

? 特性
o Scrapy启动时加载的普通class
o 监听各种signal (engine_started, item_scraped,
item_dropped)
? Built-in extensions
o CoreStats
o WebConsole
o …

从网页中提取数据

? CrawlSpider: Rule/Matcher/callback
? 使用XPath进行提取
? Scrapy shell
? Parsley: a selector language, superset of XPath and css3 (
内存泄露)
li.main>a/@href

Sphinx – C++全文搜索引擎

? Sphinx特性
? Sphinx组件
? 索引
? 搜索
? SphinxSE: mysql存储引擎

Sphinx特性
? high indexing speed (upto 10 MB/sec on modern CPUs);
? high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
? high scalability (upto 100 GB of text, upto 100 M documents on a single
CPU);
? provides good relevance ranking through combination of phrase proximity
ranking and statistical (BM25) ranking;
? provides distributed searching capabilities;
? provides document exceprts generation;
? provides searching from within MySQL through pluggable storage engine;
? supports boolean, phrase, and word proximity queries;
? supports multiple full-text fields per document (upto 32 by default);
? supports multiple additional attributes per document (ie. groups, timestamps,
etc);
? supports stopwords;
? supports both single-byte encodings and UTF-8;
? supports English stemming, Russian stemming, and Soundex for morphology;
? supports MySQL natively (MyISAM and InnoDB tables are both supported);
? supports PostgreSQL natively.

Sphinx组件

? indexer (binary)
? searchd (binary)
? search (binary)
? sphinxapi (api libraries for PHP, Python, Perl, Ruby)
? spelldump
? indextool

索引

? 数据源: 数据库, xml, 等等。
o 表的每一行视为一篇文档,
o 可在配置中指定哪些列需要进行索引
? 属性：表的某些列可被指定为文档的属性，不被索引，但可
用来做过滤和排序

索引(2)

索引配置的片段

sql_query = SELECT id, title, content,
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date

过滤和排序应用示例

// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );

// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );

// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );

搜索 – 匹配模式

匹配模式
o SPH_MATCH_ALL
o SPH_MATCH_ANY
o SPH_MATCH_PHRASE
o SPH_MATCH_BOOLEAN
o SPH_MATCH_EXTENDED2
最灵活的SPH_MATCH_EXTENDED2
hello | world
hello | -world
@name hello @intro world
"hello world"
aaa << bbb << ccc
"hello world foo"~10
"the world is a wonderful place"/3
"hello world" @title "example program"~5 @body python -(php|perl) @* code

搜索 – 排序模式

? SPH_SORT_RELEVANCE
? SPH_SORT_EXTENDED
@weight DESC, price ASC, @id DESC

? SPH_SORT_EXPR
$cl->SetSortMode ( SPH_SORT_EXPR,
"@weight + ( user_karma + ln(pageviews) )*0.1" );

搜索 – 分布式搜索

? 横向划分数据，分别进行索引
? 在主searchd上配置分布式索引
? 主searchd发送请求到各个从searchd，合并返回的结果，并
最终返回
? cluster中的每个searchd都可作为主searchd, 进行负载均衡

搜索 – SphinxQL: 使用sql语法进行搜索

? searchd实现了mysql的网络协议
? 可将searchd当做mysql服务器使用，通过mysql client连接

SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske
SELECT * FROM test1 WHERE MATCH('"test doc"/3')
SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION
ranker=bm25, max_matches=3000

SphinxSE: mysql存储引擎

特点
? 类似InnoDB, MyISAM, 需要编译进mysql
? 本身不存储数据，而是与searchd通信来获取数据
优点
? 任何语言都可使用，而naive api只支持几种语言
? 当搜索结果需要在mysql端进一步处理时，效率更高 (JOIN,
mysql-like filtering)

Sphinx vs. xapian

Sphinx
? searchd提供搜索服务
? 不用自己实现indexer，不用写C++代码，仅通过配置就能实
现索引和搜索
? 分布式搜索

xapian
? 类似lucene，api直接访问索引文件进行搜索
? 得自己实现indexer
? 可定制性强 (豆瓣从sphinx切到xapian)

demo – scrapy + sphinx实现搜索引擎

以爬取，索引，搜索起点小说为例，实现一个小说搜索引擎.

demo的代码可从github下载:

git clone git://github.com/pkufranky/sedemo-indexer.git
git clone git://github.com/pkufranky/sedemo-spider.git

? 使用scrapy实现爬虫
? 使用sphinx实现索引和搜索
? 实现搜索前端

具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/

狠狠撸

蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎

More Related Content

蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎