8. Built-in middlewares
? Downloader middlewares
o DefaultHeadersMiddleware
o HttpAuthMiddleware
o HttpCacheMiddleware
o RedirectMiddleware
o RetryMiddleware
? Spider middlewares
o DepthMiddleware
o RefererMiddleware
? Scheduler middlewares
o DuplicatesFilterMiddleware
9. Extensions
? 特性
o Scrapy启动时加载的普通class
o 监听各种signal (engine_started, item_scraped,
item_dropped)
? Built-in extensions
o CoreStats
o WebConsole
o …
10. 从网页中提取数据
? CrawlSpider: Rule/Matcher/callback
? 使用XPath进行提取
? Scrapy shell
? Parsley: a selector language, superset of XPath and css3 (
内存泄露)
li.main>a/@href
12. Sphinx特性
? high indexing speed (upto 10 MB/sec on modern CPUs);
? high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
? high scalability (upto 100 GB of text, upto 100 M documents on a single
CPU);
? provides good relevance ranking through combination of phrase proximity
ranking and statistical (BM25) ranking;
? provides distributed searching capabilities;
? provides document exceprts generation;
? provides searching from within MySQL through pluggable storage engine;
? supports boolean, phrase, and word proximity queries;
? supports multiple full-text fields per document (upto 32 by default);
? supports multiple additional attributes per document (ie. groups, timestamps,
etc);
? supports stopwords;
? supports both single-byte encodings and UTF-8;
? supports English stemming, Russian stemming, and Soundex for morphology;
? supports MySQL natively (MyISAM and InnoDB tables are both supported);
? supports PostgreSQL natively.
14. 索引
? 数据源: 数据库, xml, 等等。
o 表的每一行视为一篇文档,
o 可在配置中指定哪些列需要进行索引
? 属性:表的某些列可被指定为文档的属性,不被索引,但可
用来做过滤和排序
15. 索引(2)
索引配置的片段
sql_query = SELECT id, title, content,
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date
过滤和排序应用示例
// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );
// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
16. 搜索 – 匹配模式
匹配模式
o SPH_MATCH_ALL
o SPH_MATCH_ANY
o SPH_MATCH_PHRASE
o SPH_MATCH_BOOLEAN
o SPH_MATCH_EXTENDED2
最灵活的SPH_MATCH_EXTENDED2
hello | world
hello | -world
@name hello @intro world
"hello world"
aaa << bbb << ccc
"hello world foo"~10
"the world is a wonderful place"/3
"hello world" @title "example program"~5 @body python -(php|perl) @* code
19. 搜索 – SphinxQL: 使用sql语法进行搜索
? searchd实现了mysql的网络协议
? 可将searchd当做mysql服务器使用,通过mysql client连接
SELECT *, @weight*10+docboost AS skey FROM example ORDER BY ske
SELECT * FROM test1 WHERE MATCH('"test doc"/3')
SELECT * FROM test WHERE MATCH('@title hello @body world') OPTION
ranker=bm25, max_matches=3000