This document provides an overview of tools and techniques for working with the Examine search engine in Umbraco, including:
- Tools like Luke and the Examine Dashboard for debugging indexes.
- Using the GatheringNodeData event to merge fields, add fields like node type aliases, and handle errors during indexing.
- Indexing different media types like PDFs using Tika.
- Techniques for search highlighting, boosting documents, and deploying index changes across environments.
- Faceted search capabilities and using the index as an object database.
The presenter encourages exploring the full capabilities of Examine and provides examples of how to optimize indexing and searching.
3. What this talk is not
? How to install
? How to configure
4. What we will cover
? Tools to help you
? Hints and tips regarding indexing
? GatheringNodeData event is your friend!
? Indexing media (pdf,word etc)
? Deep in the bowels with DocumentWriting event
? Search highlighting
? Deployment to staging / production environments
? Faceting (Not exactly examine but still useful)
? Food for thought
? Questions and answers
6. Tools to help you
“Use the source Luke!”
http://code.google.com/p/luke/
7. Tools to help you
? http://luke.codeplex.com/ (.net port)
? Subset of common features present
? Scripting with Rhino missing etc
8. Using Luke
? Writing out generated queries to test in luke
var criteria = searcher.CreateSearchCriteria(IndexTypes.Content);
IBooleanOperation query = criteria.NodeTypeAlias("NewsItem");
query = query.Not().Field("umbracoNaviHide", 1.ToString());
var results = searcher.Search(query.Compile());
criteria.ToString();
Generates the following query
SearchIndexType: content, LuceneQuery: +(+__NodeTypeAlias:newsitem -umbracoNaviHide:1)
+__IndexType:content
9. Tools to help you
http://our.umbraco.org/projects/developer-tools/examine-dashboard
10. GatheringNode Data
? Examine has rich event system
? In all my implementations I have used
GatheringNode
– Merge into one contents field
– Searching on path
– Adding nodeTypeAlias field into pdf index
11. GatheringNode Data
Merge into contents field
? Example query
var query =
searchCriteria.Field("nodeName","hello").Or().Field("metaTit
le","hello").Field("metaDescription","hello").Compile();
12. GatheringNode Data
Merge to contents field
public class ExamineEvents:ApplicationBase {
public ExamineEvents() {
ExamineManager.Instance.IndexProviderCollection[Constants.ATGMainIndexerName].GatheringNodeDa
ta += ATGMainExamineEvents_GatheringNodeData;
}
void ATGMainExamineEvents_GatheringNodeData(object sender, IndexingNodeDataEventArgs e) {
AddToContentsField(e);
}
private void AddToContentsField(IndexingNodeDataEventArgs e) {
var fields = e.Fields;
var combinedFields = new StringBuilder();
foreach (var keyValuePair in fields) {
combinedFields.AppendLine(keyValuePair.Value);
}
e.Fields.Add("contents", combinedFields.ToString());
}
}
13. GatheringNode Data
Merge to contents field
? Query now looks like
query.Field(“contents”,”hello”)
? Adding new fields is just case of rebuild index
14. GatheringNode Data
Creating a searchable path
? Path is in index as 1,1056,1078 not tokenised
? Add new field with , replaced with space
15. GatheringNode Data
? How to query when no value e.g sql query like
select where value=‘’
? Select all
? Cannot do query like this in Examine / Lucene
? However can use GatheringNode data event
to inject in some arbitrary value then query on
that.
? E.g. field noData_Title value 1
16. GatheringNode Data
? Re Indexing errors
? MNTP field referencing a node that no longer
exists
? Use try catch and log the offending node
17. Document writing event
? You need lower level Lucene access
? E.g. boosting a field
? What is boosting? Not all documents are equal you need to artificially give
higher ranking to certain documents . When sort by is just not enough e.g.
– Person doc type. If they have important title they need to appear at
top of person search list
– Boost documents by age. Penalize older documents useful for news
and business documents.
– Boost based on unique views (would need to know up front also base
on time slots e.g last month, last week)
– Documents with more likes (custom like functionality)
– Tagging using XFS Term selector with weighting
http://our.umbraco.org/projects/website-utilities/xfs-term-selector
19. Indexing media
? Pdf indexer. Only indexes pdf content.
? CogUmbracoExamineMediaIndexer (Available as package on our)
– Uses apache tika. Indexes content and any associated meta data
– XML and derived formats
– Microsoft Office document formats
– OpenDocument Format
– Portable Document Format
– Electronic Publication Format
– Rich Text Format
– Compression and packaging formats
– Text formats
– Audio formats (MP3 etc)
– Image formats
– Video formats
– Java class files and archives
– The mbox format
20. Search highlighting
? Lucene contrib package Highlighter.net
? Highlights occurrences of your search term in
search results summary fragment.
? Wiki on our http://our.umbraco.org/wiki/how-
tos/how-to-highlight-text-in-examine-search-
results
21. Deployment to staging / production
environments
? Cannot copy index
? Can check in but could corrupt
? Selenium with ashx to rebuild index
22. Deployment to staging / production
environments
public class RebuildIndexes : IHttpHandler
{
readonly List<string> indexes = new List<string> { "ATGIndexer", "InternalIndexer", "directoryIndexer" };
public void ProcessRequest(HttpContext context)
{
context.Response.ContentType = "text/plain";
try
{
if(string.IsNullOrEmpty(context.Request.QueryString["index"]))
{
foreach (var index in indexes)
{
ExamineManager.Instance.IndexProviderCollection[index].RebuildIndex();
}
}
else
{
ExamineManager.Instance.IndexProviderCollection[context.Request.QueryString["index"]].RebuildIndex();
}
context.Response.Write("done");
}
catch(Exception ex)
{
context.Response.Write(ex.ToString());
}
}
public bool IsReusable
{
get
{
return false;
}
}
}
23. Deployment to staging / production
environments
[SetUp]
public void SetupTest()
{
selenium = new DefaultSelenium("localhost", 4444, "*chrome", "http://mydevsite");
selenium.Start();
_verificationErrors = new StringBuilder();
}
[Test]
public void RebuildIndex()
{
//not proper test but a hack to get indexes rebuilt after a deployment
try
{
selenium.Open("/umbraco/webservices/RebuildIndexes.ashx");
}
catch (SeleniumException se)
{
if (!se.Message.StartsWith("Timed out"))
{
throw;
}
}
catch (AssertionException e)
{
_verificationErrors.Append(e.Message);
}
}
24. Faceting
? Faceted search, also called faceted navigation or faceted
browsing, is a technique for accessing information organized
according to a faceted classification system, allowing users to
explore a collection of information by applying multiple filters
? Amazon, LinkedIn
http://www.linkedin.com/search/fpsearch?type=people&key
words=umbraco&pplSearchOrigin=GLHD&pageKey=member-
home&search=Search
? LinkedIn uses Bobo browser. Written in java it has been
ported to .net http://bobo.codeplex.com/
? Demo is SimpleFacetHandler others are available e.g
RangeFacet,PathFacet, GetFacet
25. Food for thought
? Using the index as object db ala RavenDb
? Scenario: You have nodes with large number of multi tree node pickers used as look ups
29. Food for thought
? In index node ids are stored as CSV list if MNTP
set to csv.
? Use GatheringNodeData event to do lookups
create a POCO with lookup data, serialise POCO
to JSON and store that in index.
? Advantage: Instant lookup all data ready to use
? Disadvantage: Need to keep up with lookup
changes. E.g. If Country code changes then you
would need to lookup code already in use and
update.
? Nice approach if lookup data is fairly static
30. Food for thought
? POCO hydration using activelucenenet ala
USiteBuilder
? Create pocos and decorate with attributes
public class Product
{
[LuceneField(“sku")]
public string Sku { get; set; }
[LuceneField(“productName")]
public string ProductName { get; set; }
}
31. Food for thought
var luceneProductDoc = GetItFromLucene(1234);
var product = LuceneMediator<Product>.ToRecord(luceneProductDoc );
Would need to use Lucene directly as there is a no way of getting the lucene
document from examine search result wrapper?
#3: Rationale behind talkAsk how many people are using it?Examine / Lucene is awesomeVery very fast!Examiness not real word I don’t think but used by shannon when he presented at cg describes the nuances of examine!
#4: Seeumbracotv vids also codegarden videos done by Tim Geyssens
#5: This will be more interactive session rather than me just going on for 20 mins
#6: If you do not have this book you are doing it wrong. It’s a deep grok into lucene. Examine is just a wrapper. Covers the mechanics of analysers, indexing and searching process also how a document is scored etc
#8: If you don’t want to stick java on local machine or server
#9: In hidden field or trace, use luke with atg index. Grouped or And testing date ranges etc. Analysers etc. Fire up luke with atg index. Has helped to fix some strange errors not all examine related
#10: You want to rebuild your indexes use this.I had written a simple one this one is far superior. Latest version I think breaks. Update usercontrol
#12: That list of fields to query on can get pretty big you can pass in array of fields but you need to set those up front and know what they are. Also will need to add new fields after you add them to your doc type.
#13: Fields is dictionary of all fields defined in ExamineIndex node IndexUserFields I usually leave mine blank so all fields are in the index
#15: To do query where get all items from a given parent. Show in atg index
#16: Examine already has field like that egindexType so can use that just to get all content nodes. Can also use to do. ATG directory search all items.
#18: Not as common to use this event. Examine abstracts away lucene. Eg company sentiment analysis?? More details on boosting see lucene in action. Custom like so users get to like it and this boosts its relevancy in a search because it is more popular
#19: Have class inherits from ApplicationBase. Document is lucene document object. Field is lucene field.
#20: Uses ikvm so spins up java virtual machine. Images exif meta data image location search type functionality. Lucene.net spatial NB not image and audio content only meta data
#21: Mention how this is relevant to munged contents field
#22: Who uses selenium? Team city to run selenium test. Cheating as its not really a test!
#24: Old seleniumn not web driver code which is better!
#25: Amazon LOTR search as well as your list of results you get categories in left hand side e.gBooks,Music,Games.Demoatg facets on directory.
#26: Looks up for price, codes etc. Show advanced donate tool.
#31: Ask now many people use UsitebuilderHackathon??
#32: As far as I aware you cannot get the lucene document when searching using examine.