狠狠撸

FAST ESP 搜索系统技术中心 2009 年 9 月

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容系统结构相关术语

系统结构 Administration Services 内容聚合 web DB file 文档处理内容文档索引系统搜索系统查询和结果处理用户

Content— 内容。比如： html 页面、 db 表、 word 文档等等原始数据 Document 和 Document Element - 文档和文档元素。 Index profile collection Search Profile— 由多个 collection Row 、 Column 相关术语

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容获取内容概述 Collections Document Model Web 内容 ---Crawler File 、 xml 内容 ---File Traverser DB 内容 ---JDBC Connection

获取内容（ feeding content ）获取内容的方式 Enterprise Web Crawler File Traverser Jdbc database connectors 第三方 connectors Content Api Connectors ContentApi crawler File traverser Jdbc connectors 文档处理器 Fast esp 集成的 Fast 软件包 web DB file

Collections 获取的内容都会进入 collections 一个 collection 是可搜索文档的一个逻辑组 Collection 属性名称属于单一的 search cluster 每个 collection 对应单个文档处理 pipeline 一个 collection 可以对应多个数据源。但是同一种数据源只能对应一个 pipeline web DB file crawler jdbc File Trav Collection ： Documents

内容会被处理为 document 。重要的 Document 元素 url Docid= 初始为 url Contentid=docid Internalid=md5(docid)+”_”+collectionName Data—default element Xml---default element 在通过其他处理器时， Contendid 和 internalid 保持不变 Document Model

支持分布式的抓取支持增量抓取和更新支持通过 crawler store 重新获取内容 Request rate ：站外： 60s ；站内： 30s Refresh interval ： 1440s 针对传统网页抓取设计可以抓取 rss Web 内容 ---Crawler

File Traverser 其他文件 PDF 文件 XML 文件 File Traverser Document Processing DB collection collection collection

可以转换远程文件服务器中的文件支持 PDF 、 ppt 、 word 、 txt 等文件的转换批量发送文件到 content api 可以独立运行 File Traverser

JDBC Connector JDBC Connector Document Processing DB Result Set sql 内容分发每行 1 个文档 collection collection collection

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容文档处理系统内容流 Document 、 Collection 、 Pipeline 、 Stage 文档处理 stage Entity Extraction

文档处理系统 Document Processing Engine 内容 API 内容分发 Search index index QR SFE collection collection collection

文档处理系统包括 Content Distributors 内容分发 Content API 提交文档到内容分发器内容分发器分发文档到文档处理器 Document Processing Engine 文档处理引擎提交处理的文档发送文档到 indexer

内容流 Document Processing Engin Collection 1 内容 API 内容分发 index searchApi SFE Collection n Collection 2 API 客户扩展的处理器

Document 、 Collection 、 Pipeline 、 Stage Document 可以被搜索的实体属性集，属性值在获取内容时被设置在文档处理时被计算属性可以 map 到 index 中的 field Document Processing Attributes Content Api Document elements Index Document fields Index profile

Document 、 Collection 、 Pipeline 、 Stage Collection Collection A Collection B

Document 、 Collection 、 Pipeline 、 Stage Pipeline 由 stage 组成一个 collection 对应一个 pipeline 一个 pipeline 可以对应多个集合 Content Doc init Doc Retri eval … Gen fixml Send To Indexer Indexing

Document 、 Collection 、 Pipeline 、 Stage Stage 读取属性值分析计算修改或设置属性值可以扩展

Entity Extraction 从源属性取出感兴趣的实体把结果保存在目地属性方法基于字典正则表达式 Python code 电影电视剧动漫动作喜剧剧情刘德华周星 Title ：天若有情 - 主演：刘德华 Title ：电视剧 - 李小龙传奇 Title ： Tag: 刘德华 Title ： Tag: 电视剧

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容概述文档处理相关搜索处理系统 Query 和结果处理相关

为搜索的 cluster 定义的 index 的“ schema” 包含 fields 配置文档处理属性（ tokenization 、 lemmatization ）配置 search 相关属性（那些 field 可以被搜索、怎样搜索）配置 result 处理相关属性（ navigation 和 result view 、 sort 、 ranking ）每个 cluster 只能对应一个 XML 格式的配置文件概述

<index-profile name=“default”> <field-list></field-list> <geo-secification></geo-secification> <scope-field-list></ scope-field-list> <composite-field></composite-field> <rank-profile></rank-profile> <result-secification></result-secification> </index-profile> 概述

Fields Field 的属性 Linguistics Searching Sorting Result display Type:string/int32/uint32/geo/float/double/datetime 例子 <field name=“title” fullsort=“yes” tokenize=“auto|delimiters” lemmatize=“yes”> <vectorize default=“10:0”> </field> <field name=“size”type=“int32”fullsort=“yes”/> 文档处理相关

Scope field 对应结构化的 xml 文档。等级结构 . Type:text 例子 Input: <book> <chapter> <id type=“int32”>1</id> <heading>act 1</heading> <sentence>sen1</sentence> <sentence>sen2</sentence> </chapter> </book> 文档处理器 map 到 xml field ： xml: book ： (string) chapter:(string) id:1(type=int32) heading:”act1”(string) sentence:”sen1”(string) sentence:”sen2”(string) 文档处理相关

搜索处理系统 index 搜索引擎 SFE Search Api Query and Result Server query 结果 Query& 参数 Query& 参数 HTTP client Text/xml 结果 Enhanced 结果 API client 结果处理 pipeline query 处理 pipeline

Field 的相关属性 Index Max-index-size Boundary-match Separator Wildcard ： prefix|full substring Query 和结果处理相关 Prefix ： fas* Full:fas*/*ash/f?st Substring=6 Query:summer D:midsummer

Composite Field <composite-field name="content" default="yes" query-tokenize="auto" lemmas="yes"> <field-ref name="title" level="4" field-separation-length="256"/> <field-ref name="body" level="1" field-separation-length="256"/> <field-ref name="description" level="2" field-separation-length="256"/> <field-ref name="urlkeywords" level="3" field-separation-length="256"/> <field-ref name="keywords" level="3" field-separation-length="256"/> <field-ref name="anchortext" type="external" level="5" field-separation-length="256"/> </composite-field> Query 和结果处理相关

Sort 属性 Sort ： yes |no. 前 4 个字符 Fullsort:yes| no |latent ranking Result view Result Filters-- 移除重复记录 Query 和结果处理相关

Composite Field <composite-field name="content" default="yes" query-tokenize="auto" lemmas="yes"> <field-ref name="title" level="4" field-separation-length="256"/> <field-ref name="body" level="1" field-separation-length="256"/> <field-ref name="description" level="2" field-separation-length="256"/> <field-ref name="urlkeywords" level="3" field-separation-length="256"/> <field-ref name="keywords" level="3" field-separation-length="256"/> <field-ref name="anchortext" type="external" level="5" field-separation-length="256"/> </composite-field> Sort 和 ranking Result view Navigator String Navigator Numeric Naviator Query 和结果处理相关

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容 Rank Profile 相关术语相关算法

Rank Profile <rank-profile name="default" rank-model="default" default="no" stop-word-threshold="2E6" position-stop-word-threshold="2E7"> <quality weight="500" field-ref="uuseedocquality" /> <authority weight="80" field-ref="anchortext"/> <freshness weight="0" field-ref="uuseeupdatedate" auto="yes"/> <composite-rank composite-field-ref="content"> <proximity weight="50" /> <context weight="50"> <field-weight field-ref="body" value="5"/> <field-weight field-ref="description" value="30"/> <field-weight field-ref="urlkeywords" value="40"/> <field-weight field-ref="keywords" value="50"/> <field-weight field-ref="title" value="60"/> </context> </composite-rank> </rank-profile>

相关术语（ Relevancy Terminology ） For muli-term queries:the shorter the distance between query terms in a document,the higher the document’s rank value Proximity Importance of matching a query in a given document field Context Importance of geographical distance between a document’s associated latitude/longitude and a target location specified in a query Geo Assigned importance of a document ， independent of the query Quality Importance of a document determined by the links to it from other documents Authority Age of a document compared to the time when the query is issued Freshness 描述术语

相关术语（ Relevancy Terminology ）计算 context 和 proximity 时额外用到的统计数据。 The greater the number of query terms present in the same field of a matching document, the highter the document’s rank value Completeness The more frequent a query term occurs in the document(term frequency or TF)relative to the term’s frequency in the index(inverse document frequency or IDF),the higher the document’s rank value Frequency The earlier a query term occurs in a field,the highter the document’s rank value Position 描述术语

相关算法（ Relevancy Formula ） R(d,q)=S(d)+F(d,T)+D(d,q) R=query q 在 document d 中的 rank 值 S=document d 的静态 rank 值，与 query 无关 F=freshness of document d at time t D=dynamic rank

相关算法（ Relevancy Formula ） R(d,q)=S(d)+F(d,T)+D(d,q) (boost_coefficient*w_quality/100)*static_rank_field(d) boost_coefficient 默认为 2 w_quality quality 的权重 static_rank_field 默认值索引文档时指定通过 SBC 修改

相关算法（ Relevancy Formula ） R(d,q)=S(d)+F(d,T)+D(d,q) (w_freshness/100)*fn(time scale,document age) w_freshness freshness 的权重 Document age = current time – document time 单位：分钟

相关算法（ Relevancy Formula ） R(d,q)=S(d)+F(d,T)+D(d,q) 单 term 的 query (Fn(FO)+fn(NO)+W_authority/100*fn(ExtNO)+single_boost*W_context/100*sum(W_fieldN/100)) /fn(num_matching_docs) Fn(FO)-- 基础是 query term 在文档中第一次出现的位置 fn(NO)-- 基础是 query term 在文档中出现的次数 W_authority—authority 的权重 fn(ExtNO)— 在 authority 相关的 field 中出现的次数 single_boost— 单 term 时的 boost 系数 W_context---context 的权重 W_fieldN---context 中 field 的权重 fn(num_matching_docs)— 在单个搜索节点， term 在 document 中出现的总数

相关算法（ Relevancy Formula ） R(d,q)=S(d)+F(d,T)+D(d,q) Multi_term 的 query D(d,q1)+D(d,q2)+…+W_context/100*fn(common context)+fn(operator)+W_proximity/100*fn(term proximity) fn(common context)— 在相同的 context 中有几个 term 被发现 fn(operator)— 当 OR/ANY/NEAR/ONEAR 时使用 W_proximity— proximity 的权重 fn(term proximity )

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容 Fast ESP 的语言学特性 CJK 语言

Fast ESP 的语言学特性自动探测 79 种语言高级特性支持 30 种语言，包括中文分词 Tokenization 符号标准化。 Character normalization 移除停止词 Anti-phrasing 和 stopword 语音搜索 Phonetic search Email 、人名、地名等 Entity Extraction French Open ， John Lervik Proper Name 或 phrase 识别 Car—automobile Synonyms go—goes—going—went—gone Lemmatization sarsh----search spellchecking

Fast ESP 的语言学特性增加索引时间增加内容处理时间增加磁盘使用好的用户体验坏处好处

Fast ESP 的语言学特性不需要重新处理文档需要重新处理文档增加 QPS 省 query 时间 Proper name 和 phrase recognition 不影响 index 增加 index Anti-phrasing 、 stopword Entity extraction Spell checking Tokenization Synonym Synonym Lemmatization Lemmatization Query 时使用内容处理时使用

CJK 语言定义 CJK- 中文、日文、韩文英文使用空格或标点符号作为单词之间的分隔符；中文、日文、韩文没有明显的分词标识

CJK 语言 Character normalization （符号标准化）全角到半角。-> Fast ESP 5.0 半角到全角。-> カギ汉语数字到阿拉伯数字。一万五千-> 15000 異體字到正體字。羣-> 群 , 歎-> 嘆简体繁体互换。中國對外經濟貿易? 中国对外经济贸易軟體软件金山詞霸词霸 After character normalization 軟體软件金山詞霸 After qt_synonym 軟體金山詞霸 Original Query Query keywords Stage

CJK 语言中文分词规则人名：姓和名的第一个字被分割为一个词。文档：李小龙 query ：李小结果：李小龙小龙结果： misss 助词被作为分隔符处理。比如：的 , 们 , 地 , 了 , 过文档：老師的 , 同学们 , 簡略地 , 生產了取消過 . query ：同学结果：同学们补语被作为分隔符处理。比如：表達出來 , 使用下去 , 確認得了 , 漂亮極这些语法上特性都是默认支持的。的不需要 lemmatization stage 支持。

CJK 语言中华人民共和国 1.substring=1 中华人民共和国 2.substring=2 中华华人人民民共共和和国 Substring

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容定义多节点体系 Index Partitions 机制

Fault Tolerance 在软件或硬件故障时，提供服务的能力几种模式 Fail safe Fail soft Fail stop 定义

factor Content volumn 系统可以处理的内容数量 Query rate QPS- 系统每秒必须处理的请求数 Content dynamics 每天新增文档数、删除文档数 Search latency 提交 query 到 QR 服务器到结果返回的时间 Index latency 文档提交到可以搜索的时间定义

多节点体系获取内容文档处理子系统索引子系统搜索子系统查询与结果处理子系统搜索用户 Admin 组件

多节点体系 - 文档处理子系统文档处理器文档处理器内容分发文档处理器文档处理器内容分发多个组件可以增加可靠性、提供文档处理能力一个内容组件同时只能连接到一个内容分发器每个文档处理器同时仅和一个内容分发器通信如果内容分发其 down ，动态重新配置

多节点体系 - 索引子系统 indexer indexer 内容分发 indexer 索引分发索引发发将文档分发到不同的 column 多个可以增加可靠性、提高处理能力 Indexer- 矩阵形式多 column 增加文档数量和提高索引的性能多 row 提供可靠性 indexer

多节点体系 - 索引子系统 Indexer （ master ） Indexer （ master ）内容分发 Indexer （ backup ）索引分发 Master indexer 一 column 仅有一个 master indexer Search 仅和 master indexer 通信 Backup indexer 仅存储 fixml 和 indexing 操作 Failover 请求索引的建立（ fail soft ） Indexer （ backup ）

多节点体系 - 索引子系统 Indexer （ master ）内容分发索引分发索引分发器提交文档操作到列的 master indexer 如果超过 indexers 失败的数操作不会执行。这个数是可配置的 Master 安排索引仅当所有 row 报告 ready 时激活新的 index Indexer （ backup ） Indexer （ backup ）

多节点体系 - 搜索和 QR 子系统 search R0C0 Top-level 分发（ QR ）矩阵 Row— 可靠性和 query 性能 Column— 文档数量。和 index 的 column 数量一样 Top-level 分发器在 row 间负载平衡发送 query 到某一行的每一列、合并结果 search R1C0 search R0C2 search R0C1 search R1C2 search R1C1

多节点体系 - 搜索和 QR 子系统 search R0C0 Top-level 分发（ QR ）多 QR 可靠性和负载均衡 search R1C0 search R0C2 search R0C1 search R1C2 search R1C1 搜索用户负载均衡

多节点体系 -index 部署 Indexer 和 search 在同一节点 Indexer 和 search 在不同节点 indexer Master indexer backup search search search indexer indexer indexer search search search

多节点体系 -admin 子系统 Name service 和 License Manager 可以 fault tolerant 。存在单点故障其他组件不支持 fault tolerant CORBA Name Service License Manager Resource Service Log Transformer Log Server Config server Cache Manager Admin Server Relbench Storage service Web server

Index Partitions 机制每个 indexer 或 search 节点使用多个 indexing 的 partitions 每个 partition 容纳了索引的文档 Partitions are mirrored to permit incremental indexing and continuous search Content 被 index 进入最小的 partition Content 被合并到比较高的 partition 根据触发条件 0 1 2 docsDistributionPst ： 100 ， 100 ， 100 触发条件： 10000 ， 1000000 2 ： 6

Index Partitions 优化尽可能保持最小的 Partition 是空的索引新内容意味着重新索引最小 Partition 为最小 Partition 设置比较低的触发条件，最小 Partition 将要快速移动索引到比较高的 Partition 但是：移动内容到比较高的 Partition 意味着比较高的 Partition 会重建索引在 RAM disk 上建立最小的 Partition 0 1 2

Index blacklisting Blacklisting 被用来提高移除文档的速度 Blacklisting index 仅仅包含被删除文档的 id Search ：自动调整为 term and not in blacklisting

Index blacklisting 例子 Document A （ A-1 ）已经被索引。位于 partition2 Document A （ A-2 ）修改后重新被提交。新版本位于 partition0 ，这时在索引中同时又 2 个 Document A 在 blacklist index 重建之前，搜索结果中将同时包含 2 个 Document A blacklist index 重建后，搜索结果中将没有 Document A （ A-1 ），但是索引中仍然有当 Document A （ A-2 ）移动到 partition2 ， partition2 重建索引， Document A （ A-1 ）从索引中被移除。但是在 FiXML 中仍然有 2 个版本的 Document A 。 0 1 2 DocumentA-2 DocumentA-1

系统概述获取内容文档处理 Index Profile 排名机制语言学特性多节点体系我们的 fast esp 讨论内容部署结构数据源索引字段

部署结构 RI RS Ind RS RI RI RS RS RS RS QR Ind CP DP admin QR RT RT search10 search9 search5 search4 search3 search2 search1

部署结构 Ind DP QR QR RS RS RS RS RS RS RI RI RI Ind CP DP

Youku--------(2000 万 ) Tudou--------(1400 万 ) ku6 --------(3000 万 ) sina --------(500 万 ) sohu --------(200 万 ) cctv --------( 几十万 ) 56 --------(400 万 ) 数据源

索引字段 UGC- 现有字段：建立专辑时使用 vuploadusername 视频上传用户 uuseeupdate 刷新时间可以根据该字段导航 vsourcesite 来源网站可以根据该字段导航 vtags 标签可排序 vlength 视频长度 vplayurl 视频播放页面地址可以根据该字段导航 vcategorys 类别 vlogo 视频图片可排序 vtitle 视频标题 vevid 原始编码 ID vvid 原始 ID vid 唯一标识备注索引字段名称

索引字段 UGC- 准备扩充字段：建立专辑 vuploaduserid 视频上传用户 ID 计算权重。可排序 vlink 引用次数计算权重。可排序 vcomment 评论次数计算权重。可排序 vpageview 观看次数计算权重。可排序 Vfav 收藏次数可以根据该字段导航 vchannel 频道备注索引字段名称

索引字段专辑 - 现有字段： plvideocount 视频数可以根据该字段导航 plchannel 频道 Vod 还是 ugc pltype 类别 plvideotitles 视频标题 plvideoinfo 视频信息可以根据该字段导航 plsourcesite 来源网站可以根据该字段导航 pltags 标签可排序 pllength 总长度 VOD 时为文件 GUID plplayurl 第 1 个视频播放地址可以根据该字段导航 plcategorys 类型 pllogo 图片可排序 pltitle 标题 plbaikeid 百科 ID plid 唯一标识备注索引字段名称

狠狠撸

Fast Esp搜索系统

More Related Content

Fast Esp搜索系统