際際滷

際際滷Share a Scribd company logo
In Search Of...
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
integrating site search
Friday, 29 October 2010
2
How Search Works
Integrating Search
Improving Results
Using Search
Search Performance
Questions
Friday, 29 October 2010
3
Friday, 29 October 2010
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query
Parser
QueryQueryQueryQuery
ResultResultResultResult
Friday, 29 October 2010
5
With AT&Ts help, the F.B.I
Miami-Dade of鍖ce had recovered
$1.1 million from OHealys Ponzi
scheme, 10-15% more than
expected.
Tokenisation

Friday, 29 October 2010
6
PHP Tokenisation
function tokenise($string) {
$string = strtolower($string);
preg_match_all('/w+/', $string,
$matches, PREG_OFFSET_CAPTURE);
return $matches[0];
}
Friday, 29 October 2010
7
Document Term Pairs
Document ID Term
1 the
1 best
1 of
1 the
... ...
204 and
204 what
204 would
Friday, 29 October 2010
8
Inverted Index
Term Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
Friday, 29 October 2010
9
Boolean Query Merge
Query: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338
western 4 95 194 204 298 305
hotel 2 40 200 298 355 402
working 4 298 305
Friday, 29 October 2010
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed sit amet ante vitae enim
elementum semper sodales quis ipsum. Aliquam
vel condimentum neque. Curabitur ornare
feugiat ornare. Donec consectetur elit metus.
Nulla eleifend tincidunt massa et euismod.
Vestibulum vestibulum, justo vel egestas
elementum, purus enim ornare quam, vel
gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel
risus vitae mauris vehicula facilisis sit amet in
mi. Nulla ut turpis id felis sollicitudin dictum
sed non ipsum. Praesent ut risus nulla, sed
blandit leo. Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec dapibus
fringilla arcu, et semper lacus egestas non.
Quisque eu purus ut lacus egestas dapibus.
Integer in velit id est dictum bibendum in id mi.
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante
vitae enim elementum semper sodales quis
ipsum. Aliquam vel condimentum neque.
Curabitur ornare feugiat ornare. Donec
consectetur elit metus. Nulla eleifend
tincidunt massa et euismod. Vestibulum
vestibulum, justo vel egestas elementum,
purus enim ornare quam, vel gravida est
enim vel nibh.
Nam non eros nisi, eget fringilla justo.
Fusce vel risus vitae mauris vehicula
facilisis sit amet in mi. Nulla ut turpis id
felis sollicitudin dictum sed non ipsum.
Praesent ut risus nulla, sed blandit leo.
Curabitur volutpat laoreet lacus, ut
consectetur arcu vestibulum vel. Donec
dapibus fringilla arcu, et semper lacus
Friday, 29 October 2010
11
TF-IDF
function getWeight($docID, $term, $total) {
$tf = count($term[$docID]);
$idf = log($total / count($term), 2);
return $tf * $idf;
}
Friday, 29 October 2010
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 0 0 0.002 0.003 ...
Friday, 29 October 2010
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.023 0.001 0.034 0.004
1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
Ranked Query Merge
13
Friday, 29 October 2010
14
PHP Similarity
function score($queryString, $index) {
$query = tokenize($queryString);
$matches = array();
foreach($query as $qterm) {
$postings = $index[$qterm];
foreach($postings as $id => $posting) {
$matches[$id] += $posting['score'];
}
}
return arsort($matches);
}
Friday, 29 October 2010
15
Integrating Search
Friday, 29 October 2010
16
CREATE TABLE example (
id INT(11) NOT NULL auto_increment,
title VARCHAR(255),
content TEXT,
PRIMARY KEY(id),
FULLTEXT(title,content)
) Engine=MyISAM;
INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
MySQL Full Text Search
Friday, 29 October 2010
17
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+
| id | title | content |
+----+------------------+------------------------+
| 1 | Mikko & Bacon | Mikko loves bacon |
| 2 | Marcello & Bacon | Marcello hates bacon |
| 3 | Jo & Sausages | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)
Friday, 29 October 2010
18
Sphinx
http://www.sphinxsearch.com
Friday, 29 October 2010
19
Sphinx Con鍖guration
source posts
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = search
sql_query = 
SELECT id, title, content FROM example;
sql_attr_multi = uint tag from query; 
SELECT example_id, tag_id FROM tags;
}
Friday, 29 October 2010
20
index posts
{
source = posts
path = /var/data/sphinx/example
morphology = stem_en
min_word_len = 3
min_prefix_len = 3
min_infix_len = 0
enable_star = 1
}
Friday, 29 October 2010
21
Stemming
happening
happened
happens
http://tartarus.org/~martin/PorterStemmer
- happen
- happen
- happen
Friday, 29 October 2010
22
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon
displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits
searchd --config /etc/sphinx.conf
Friday, 29 October 2010
23
Sphinx From PHP
$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);
Friday, 29 October 2010
24
Swish-E
http://swish-e.org
pecl install swish-beta
Friday, 29 October 2010
Filesystem Index With Swish-E
IndexDir /var/data/documents
IndexFile fs-swish-e.index
IndexOnly .doc .docx .pdf
FuzzyIndexingMode Stemming_en1
FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
fs-swish-e.conf
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf
Friday, 29 October 2010
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.pl
IndexFile www-swish-e.index
SwishProgParameters default http://phpir.com/
FuzzyIndexingMode Stemming_en1
DefaultContents HTML
www-swish-e.conf
/usr/local/bin/swish-e -S prog -c www-swish-e.conf
Friday, 29 October 2010
Swish-E With Multiple Indices
$swish = new Swish(
'www-swish-e.index fs-swish-e.index'
);
$search = $swish->prepare();
$queryStr = 'search string goes here';
$result = $search->execute($queryStr);
$total = $result->hits;
while($r = $result->nextResult()) {
echo $r->swishdocpath; // url
}
Friday, 29 October 2010
28
Lucene
Friday, 29 October 2010
29
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text(
'title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'content', $content));
$index->addDocument($doc);
}
Build Index
Friday, 29 October 2010
30
$results = $index->find('loves bacon');
foreach($results as $result) {
echo $result->score, " ";
echo $result->title, "n";
}
Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon
Query Zend Search Lucene
Friday, 29 October 2010
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html::
loadHTML($file);
$doc->addField(
Zend_Search_Lucene_Field::Text(
'url', $url
);
$index->addDocument($doc)
Index HTML
Friday, 29 October 2010
32
Solr
http://lucene.apache.org/solr/
Friday, 29 October 2010
33
Solr Search Index
$options = array( 'hostname' => 'localhost',
'port' => 8983 );
$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();
Friday, 29 October 2010
34
Solr Search Client
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();
foreach($r['response']['docs'] as $d) {
echo $d->title[0] . "n";
}
Friday, 29 October 2010
35
Xapian
http://xapian.org
Friday, 29 October 2010
36
Xapian In PHP
$db = new XapianWritableDatabase(
'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));
$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);
$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
Friday, 29 October 2010
37
Xapian Search In PHP
$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);
$enquire->set_query($query);
Friday, 29 October 2010
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();
while(!$i->equals($matches->end())) {
$n = $i->get_rank() + 1;
$data = $i->get_document()->get_data();
$title = $i->get_document()->get_value(1);
$score = $i->get_percent();
$i->next();
}
Friday, 29 October 2010
39
Improving Results
Friday, 29 October 2010
40
Anchor Text
Friday, 29 October 2010
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Parse Anchor Text
Friday, 29 October 2010
42
1
2
3
Zone Weighting
Friday, 29 October 2010
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text
('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$index->addDocument($doc);
Friday, 29 October 2010
44
Document Authority
Friday, 29 October 2010
45
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text
('title', $title));
$doc->addField(
Zend_Search_Lucene_Field::UnStored
('content', $content));
$doc->boost = 1 + ($numComments / 100);
$index->addDocument($doc);
Friday, 29 October 2010
46
Using Search
Friday, 29 October 2010
47
Summaries & Highlighting
Friday, 29 October 2010
48
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
$text[$doc] = getTextFromDB($doc);
}
$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
echo $extract;
}
Friday, 29 October 2010
Friday, 29 October 2010
50
Xapian Spelling Correction
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
XapianTermGenerator::FLAG_SPELLING);
Indexer
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
$q->get_corrected_query_string() . "n";
Searcher
Friday, 29 October 2010
51
Spelling Correction Output
php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for strreplace or str_cmp:
1: 2% docid=572
[phpdocs/html/cc.license.html]
2: 2% docid=7169
[phpdocs/html/imagick.constants.html]
3: 2% docid=10086
[phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
[phpdocs/html/function.swf-posround.html]
Friday, 29 October 2010
52
Results Sorting
Friday, 29 October 2010
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser::
parse('search string');
$results = $index->find($q, 'title');
foreach($results as $result) {
echo '<h3>', $result->title, "</h3>n";
$doc = getDocumentFromDB($result->did);
echo
$q->htmlFragmentHighlightMatches($doc);
}
Friday, 29 October 2010
54
Faceted Search
Friday, 29 October 2010
55
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
echo $facet . " " . $count . "n";
}
Friday, 29 October 2010
56
More Like This
Friday, 29 October 2010
57
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);
$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
$qs[] = new XapianQuery($t->get_term(),
intval($t->get_weight()));
}
$query = new XapianQuery(
XapianQuery::OP_OR, $qs);
Friday, 29 October 2010
58
More Like This Example
php xapsim.php
1656 results found:
1: 100% docid=5959
[phpdocs/html/function.str-replace.html]
2: 47% docid=5956
[phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
[phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
[phpdocs/html/function.str-repeat.html]
Friday, 29 October 2010
59
Search Performance
Friday, 29 October 2010
60
Index Updates
Docs
Main
New
Delta
Delta Main
Query
Delta Main
Main
DocsDocsDocs
Friday, 29 October 2010
61
Search Speed
$index = Zend_Search_Lucene::open('index');
$index->optimize();
indexer --merge main delta --rotate
Zend Search Lucene
Sphinx
$client = new SolrClient($options);
$client->optimize();
Solr
xapian-compact xapindex xapindex2
Xapian
Friday, 29 October 2010
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
Friday, 29 October 2010
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
Friday, 29 October 2010
64
Image Credits
Title http://www.鍖ickr.com/photos/generated/2084287794/
What Do You Want http://www.鍖ickr.com/photos/the_justi鍖ed_sinner/
2498066986/You Are Here http://www.鍖ickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search http://www.鍖ickr.com/photos/squeaks2569/3700355684/
Sphinx http://www.鍖ickr.com/photos/generated/2084287794/
Lucene http://www.鍖ickr.com/photos/mypanda/7731447/
Swish-e http://www.鍖ickr.com/photos/ryan_fung/2239687100/
Solr http://www.鍖ickr.com/photos/m-j-s/2724756177/
Xapian http://www.鍖ickr.com/photos/olibac/3522056495/
Using Search http://www.鍖ickr.com/photos/eneas/175027945/
Improving Search http://www.鍖ickr.com/photos/x-ray_delta_one/3928200642/
Search Performance http://www.鍖ickr.com/photos/maisonbisson/1634408/
Large Scale Search http://www.鍖ickr.com/photos/zedzap/3663508847/
Friday, 29 October 2010
Questions?
65
Friday, 29 October 2010
Thank You!
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
Friday, 29 October 2010

More Related Content

In Search Of: Integrating Site Search (PHP Barcelona)

  • 1. In Search Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com integrating site search Friday, 29 October 2010
  • 2. 2 How Search Works Integrating Search Improving Results Using Search Search Performance Questions Friday, 29 October 2010
  • 5. 5 With AT&Ts help, the F.B.I Miami-Dade of鍖ce had recovered $1.1 million from OHealys Ponzi scheme, 10-15% more than expected. Tokenisation Friday, 29 October 2010
  • 6. 6 PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } Friday, 29 October 2010
  • 7. 7 Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would Friday, 29 October 2010
  • 8. 8 Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... Friday, 29 October 2010
  • 9. 9 Boolean Query Merge Query: Best Western Hotel Result: Document 298 best 1 4 129 298 305 338 western 4 95 194 204 298 305 hotel 2 40 200 298 355 402 working 4 298 305 Friday, 29 October 2010
  • 10. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Friday, 29 October 2010
  • 11. 11 TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } Friday, 29 October 2010
  • 12. 12 Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... Friday, 29 October 2010
  • 13. best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 Ranked Query Merge 13 Friday, 29 October 2010
  • 14. 14 PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } Friday, 29 October 2010
  • 16. 16 CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); MySQL Full Text Search Friday, 29 October 2010
  • 17. 17 MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) Friday, 29 October 2010
  • 19. 19 Sphinx Con鍖guration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } Friday, 29 October 2010
  • 20. 20 index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } Friday, 29 October 2010
  • 22. 22 Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf Friday, 29 October 2010
  • 23. 23 Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); Friday, 29 October 2010
  • 25. Filesystem Index With Swish-E IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl fs-swish-e.conf /usr/local/bin/swish-e -S fs -c fs-swish-e.conf Friday, 29 October 2010
  • 26. Crawling Content IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML www-swish-e.conf /usr/local/bin/swish-e -S prog -c www-swish-e.conf Friday, 29 October 2010
  • 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url } Friday, 29 October 2010
  • 29. 29 $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index Friday, 29 October 2010
  • 30. 30 $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene Friday, 29 October 2010
  • 31. 31 $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML Friday, 29 October 2010
  • 33. 33 Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); Friday, 29 October 2010
  • 34. 34 Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } Friday, 29 October 2010
  • 36. 36 Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); Friday, 29 October 2010
  • 37. 37 Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); Friday, 29 October 2010
  • 38. 38 $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } Friday, 29 October 2010
  • 41. 41 $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } Parse Anchor Text Friday, 29 October 2010
  • 43. 43 ZSL Zone Weighting $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); Friday, 29 October 2010
  • 45. 45 Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); Friday, 29 October 2010
  • 48. 48 Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } Friday, 29 October 2010
  • 50. 50 Xapian Spelling Correction $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Indexer $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; Searcher Friday, 29 October 2010
  • 51. 51 Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for strreplace or str_cmp: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] Friday, 29 October 2010
  • 53. 53 Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } Friday, 29 October 2010
  • 55. 55 Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } Friday, 29 October 2010
  • 56. 56 More Like This Friday, 29 October 2010
  • 57. 57 More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); Friday, 29 October 2010
  • 58. 58 More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] Friday, 29 October 2010
  • 60. 60 Index Updates Docs Main New Delta Delta Main Query Delta Main Main DocsDocsDocs Friday, 29 October 2010
  • 61. 61 Search Speed $index = Zend_Search_Lucene::open('index'); $index->optimize(); indexer --merge main delta --rotate Zend Search Lucene Sphinx $client = new SolrClient($options); $client->optimize(); Solr xapian-compact xapindex xapindex2 Xapian Friday, 29 October 2010
  • 64. 64 Image Credits Title http://www.鍖ickr.com/photos/generated/2084287794/ What Do You Want http://www.鍖ickr.com/photos/the_justi鍖ed_sinner/ 2498066986/You Are Here http://www.鍖ickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.鍖ickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.鍖ickr.com/photos/generated/2084287794/ Lucene http://www.鍖ickr.com/photos/mypanda/7731447/ Swish-e http://www.鍖ickr.com/photos/ryan_fung/2239687100/ Solr http://www.鍖ickr.com/photos/m-j-s/2724756177/ Xapian http://www.鍖ickr.com/photos/olibac/3522056495/ Using Search http://www.鍖ickr.com/photos/eneas/175027945/ Improving Search http://www.鍖ickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.鍖ickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.鍖ickr.com/photos/zedzap/3663508847/ Friday, 29 October 2010