狠狠撸

狠狠撸Share a Scribd company logo
蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎

   银平 pkufranky@gmail.com
        2010-06-07
Outline

?   Overview
?   Scrapy – python爬虫框架
?   Sphinx – C++全文搜索引擎
?   demo – scrapy + sphinx实现小说搜索引擎
Overview - 搜索引擎/爬虫分类

? 搜索引擎
  o 通用搜索引擎
  o 垂直搜索引擎
  o 资源型垂直搜索引擎
? 爬虫
  o 通用爬虫
  o 专用爬虫
Overview - 搜索引擎

 ? 分词
 ? 倒排索引
http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-
building-an-inverted-index-1.html
Scrapy – python爬虫框架

?   Architecture
?   Built-in middlewares
?   Extensions
?   从网页中提取数据
Architecture
? Components
  o Scrapy Engine
  o Scheduler
  o Downloader
  o Spider
  o Item Pipeline
  o Middlewares
? Event-driven networking: twisted
Architecture
Built-in middlewares

? Downloader middlewares
  o DefaultHeadersMiddleware
  o HttpAuthMiddleware
  o HttpCacheMiddleware
  o RedirectMiddleware
  o RetryMiddleware
? Spider middlewares
  o DepthMiddleware
  o RefererMiddleware
? Scheduler middlewares
  o DuplicatesFilterMiddleware
Extensions

? 特性
  o Scrapy启动时加载的普通class
  o 监听各种signal (engine_started, item_scraped,
    item_dropped)
? Built-in extensions
  o CoreStats
  o WebConsole
  o …
从网页中提取数据

?  CrawlSpider: Rule/Matcher/callback
?  使用XPath进行提取
?  Scrapy shell
?  Parsley: a selector language, superset of XPath and css3 (
   内存泄露)
li.main>a/@href
Sphinx – C++全文搜索引擎

?   Sphinx特性
?   Sphinx组件
?   索引
?   搜索
?   SphinxSE: mysql存储引擎
Sphinx特性
? high indexing speed (upto 10 MB/sec on modern CPUs);
? high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
? high scalability (upto 100 GB of text, upto 100 M documents on a single
  CPU);
? provides good relevance ranking through combination of phrase proximity
  ranking and statistical (BM25) ranking;
? provides distributed searching capabilities;
? provides document exceprts generation;
? provides searching from within MySQL through pluggable storage engine;
? supports boolean, phrase, and word proximity queries;
? supports multiple full-text fields per document (upto 32 by default);
? supports multiple additional attributes per document (ie. groups, timestamps,
  etc);
? supports stopwords;
? supports both single-byte encodings and UTF-8;
? supports English stemming, Russian stemming, and Soundex for morphology;
? supports MySQL natively (MyISAM and InnoDB tables are both supported);
? supports PostgreSQL natively.
Sphinx组件

?   indexer (binary)
?   searchd (binary)
?   search (binary)
?   sphinxapi (api libraries for PHP, Python, Perl, Ruby)
?   spelldump
?   indextool
索引

? 数据源: 数据库, xml, 等等。
  o 表的每一行视为一篇文档,
  o 可在配置中指定哪些列需要进行索引
? 属性:表的某些列可被指定为文档的属性,不被索引,但可
  用来做过滤和排序
索引(2)

索引配置的片段

sql_query = SELECT id, title, content, 
  author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date

过滤和排序应用示例

// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );

// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );

// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
搜索 – 匹配模式

匹配模式
     o   SPH_MATCH_ALL
     o   SPH_MATCH_ANY
     o   SPH_MATCH_PHRASE
     o   SPH_MATCH_BOOLEAN
     o   SPH_MATCH_EXTENDED2
最灵活的SPH_MATCH_EXTENDED2
hello | world
hello | -world
@name hello @intro world
"hello world"
aaa << bbb << ccc
"hello world foo"~10
"the world is a wonderful place"/3
"hello world" @title "example program"~5 @body python -(php|perl) @* code
搜索 – 排序模式

? SPH_SORT_RELEVANCE
? SPH_SORT_EXTENDED
@weight DESC, price ASC, @id DESC

? SPH_SORT_EXPR
$cl->SetSortMode ( SPH_SORT_EXPR,
  "@weight + ( user_karma + ln(pageviews) )*0.1" );
搜索 – 分布式搜索

? 横向划分数据,分别进行索引
? 在主searchd上配置分布式索引
? 主searchd发送请求到各个从searchd,合并返回的结果,并
  最终返回
? cluster中的每个searchd都可作为主searchd, 进行负载均衡
搜索 – SphinxQL: 使用sql语法进行搜索

? searchd实现了mysql的网络协议
? 可将searchd当做mysql服务器使用,通过mysql client连接

SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske
SELECT * FROM test1 WHERE MATCH('"test doc"/3')
SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION
ranker=bm25, max_matches=3000
SphinxSE: mysql存储引擎

特点
? 类似InnoDB, MyISAM, 需要编译进mysql
? 本身不存储数据,而是与searchd通信来获取数据
优点
? 任何语言都可使用,而naive api只支持几种语言
? 当搜索结果需要在mysql端进一步处理时,效率更高 (JOIN,
  mysql-like filtering)
Sphinx vs. xapian

Sphinx
? searchd提供搜索服务
? 不用自己实现indexer,不用写C++代码,仅通过配置就能实
  现索引和搜索
? 分布式搜索

xapian
 ? 类似lucene,api直接访问索引文件进行搜索
 ? 得自己实现indexer
 ? 可定制性强 (豆瓣从sphinx切到xapian)
demo – scrapy + sphinx实现搜索引擎

以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎.

demo的代码可从github下载:

git clone git://github.com/pkufranky/sedemo-indexer.git
git clone git://github.com/pkufranky/sedemo-spider.git

? 使用scrapy实现爬虫
? 使用sphinx实现索引和搜索
? 实现搜索前端

具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/

More Related Content

蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎

  • 1. 蝉肠谤补辫测+蝉辫丑颈苍虫搭建搜索引擎 银平 pkufranky@gmail.com 2010-06-07
  • 2. Outline ? Overview ? Scrapy – python爬虫框架 ? Sphinx – C++全文搜索引擎 ? demo – scrapy + sphinx实现小说搜索引擎
  • 3. Overview - 搜索引擎/爬虫分类 ? 搜索引擎 o 通用搜索引擎 o 垂直搜索引擎 o 资源型垂直搜索引擎 ? 爬虫 o 通用爬虫 o 专用爬虫
  • 4. Overview - 搜索引擎 ? 分词 ? 倒排索引 http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at- building-an-inverted-index-1.html
  • 5. Scrapy – python爬虫框架 ? Architecture ? Built-in middlewares ? Extensions ? 从网页中提取数据
  • 6. Architecture ? Components o Scrapy Engine o Scheduler o Downloader o Spider o Item Pipeline o Middlewares ? Event-driven networking: twisted
  • 8. Built-in middlewares ? Downloader middlewares o DefaultHeadersMiddleware o HttpAuthMiddleware o HttpCacheMiddleware o RedirectMiddleware o RetryMiddleware ? Spider middlewares o DepthMiddleware o RefererMiddleware ? Scheduler middlewares o DuplicatesFilterMiddleware
  • 9. Extensions ? 特性 o Scrapy启动时加载的普通class o 监听各种signal (engine_started, item_scraped, item_dropped) ? Built-in extensions o CoreStats o WebConsole o …
  • 10. 从网页中提取数据 ? CrawlSpider: Rule/Matcher/callback ? 使用XPath进行提取 ? Scrapy shell ? Parsley: a selector language, superset of XPath and css3 ( 内存泄露) li.main>a/@href
  • 11. Sphinx – C++全文搜索引擎 ? Sphinx特性 ? Sphinx组件 ? 索引 ? 搜索 ? SphinxSE: mysql存储引擎
  • 12. Sphinx特性 ? high indexing speed (upto 10 MB/sec on modern CPUs); ? high search speed (avg query is under 0.1 sec on 2-4 GB text collections); ? high scalability (upto 100 GB of text, upto 100 M documents on a single CPU); ? provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking; ? provides distributed searching capabilities; ? provides document exceprts generation; ? provides searching from within MySQL through pluggable storage engine; ? supports boolean, phrase, and word proximity queries; ? supports multiple full-text fields per document (upto 32 by default); ? supports multiple additional attributes per document (ie. groups, timestamps, etc); ? supports stopwords; ? supports both single-byte encodings and UTF-8; ? supports English stemming, Russian stemming, and Soundex for morphology; ? supports MySQL natively (MyISAM and InnoDB tables are both supported); ? supports PostgreSQL natively.
  • 13. Sphinx组件 ? indexer (binary) ? searchd (binary) ? search (binary) ? sphinxapi (api libraries for PHP, Python, Perl, Ruby) ? spelldump ? indextool
  • 14. 索引 ? 数据源: 数据库, xml, 等等。 o 表的每一行视为一篇文档, o 可在配置中指定哪些列需要进行索引 ? 属性:表的某些列可被指定为文档的属性,不被索引,但可 用来做过滤和排序
  • 15. 索引(2) 索引配置的片段 sql_query = SELECT id, title, content, author_id, forum_id, post_date FROM my_forum_posts sql_attr_uint = author_id sql_attr_uint = forum_id sql_attr_timestamp = post_date 过滤和排序应用示例 // only search posts by author whose ID is 123 $cl->SetFilter ( "author_id", array ( 123 ) ); // only search posts in sub-forums 1, 3 and 7 $cl->SetFilter ( "forum_id", array ( 1,3,7 ) ); // sort found posts by posting date in descending order $cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
  • 16. 搜索 – 匹配模式 匹配模式 o SPH_MATCH_ALL o SPH_MATCH_ANY o SPH_MATCH_PHRASE o SPH_MATCH_BOOLEAN o SPH_MATCH_EXTENDED2 最灵活的SPH_MATCH_EXTENDED2 hello | world hello | -world @name hello @intro world "hello world" aaa << bbb << ccc "hello world foo"~10 "the world is a wonderful place"/3 "hello world" @title "example program"~5 @body python -(php|perl) @* code
  • 17. 搜索 – 排序模式 ? SPH_SORT_RELEVANCE ? SPH_SORT_EXTENDED @weight DESC, price ASC, @id DESC ? SPH_SORT_EXPR $cl->SetSortMode ( SPH_SORT_EXPR, "@weight + ( user_karma + ln(pageviews) )*0.1" );
  • 18. 搜索 – 分布式搜索 ? 横向划分数据,分别进行索引 ? 在主searchd上配置分布式索引 ? 主searchd发送请求到各个从searchd,合并返回的结果,并 最终返回 ? cluster中的每个searchd都可作为主searchd, 进行负载均衡
  • 19. 搜索 – SphinxQL: 使用sql语法进行搜索 ? searchd实现了mysql的网络协议 ? 可将searchd当做mysql服务器使用,通过mysql client连接 SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske SELECT * FROM test1 WHERE MATCH('"test doc"/3') SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION ranker=bm25, max_matches=3000
  • 20. SphinxSE: mysql存储引擎 特点 ? 类似InnoDB, MyISAM, 需要编译进mysql ? 本身不存储数据,而是与searchd通信来获取数据 优点 ? 任何语言都可使用,而naive api只支持几种语言 ? 当搜索结果需要在mysql端进一步处理时,效率更高 (JOIN, mysql-like filtering)
  • 21. Sphinx vs. xapian Sphinx ? searchd提供搜索服务 ? 不用自己实现indexer,不用写C++代码,仅通过配置就能实 现索引和搜索 ? 分布式搜索 xapian ? 类似lucene,api直接访问索引文件进行搜索 ? 得自己实现indexer ? 可定制性强 (豆瓣从sphinx切到xapian)
  • 22. demo – scrapy + sphinx实现搜索引擎 以爬取,索引,搜索起点小说为例,实现一个小说搜索引擎. demo的代码可从github下载: git clone git://github.com/pkufranky/sedemo-indexer.git git clone git://github.com/pkufranky/sedemo-spider.git ? 使用scrapy实现爬虫 ? 使用sphinx实现索引和搜索 ? 实现搜索前端 具体见 http://pkufranky.heroku.com/2010/06/03/scrapysphinx/