1. The document summarizes new features in Oracle Text 11g and the roadmap for Oracle's search products, including Oracle Text and Secure Enterprise Search.
2. Key new features in Oracle Text 11g include composite domain indexes, automatic language recognition with context-sensitive stemming, and offline index creation. Oracle Text 11.2.0.2 introduces entity extraction, name search, and a result set interface that returns XML results.
3. The roadmap discusses merging Oracle Text and Secure Enterprise Search and bringing additional natural language processing, partitioning, faceted navigation, and performance improvements to Oracle's search products.
2. <Insert Picture Here>
Oracle Database 11g New Search Features and Roadmap
Roger Ford
Senior Principal Product Manager
3. Contents
Oracles Search Products
Oracle Text 11g New Features
Oracle Text 11.2.0.2 New Features
<Insert Picture Here>
Entity Extraction
Name Search
Result Set Interface
Search Product Roadmap
Oracle Text
Secure Enterprise Search
3
4. Oracles Search Products
Oracle Text
A SQL and PL/SQL based toolkit for creating full-text search
applications
Free with all database versions
Previously known as Context Option, interMedia Text
Secure Enterprise Search
A complete search based on Oracle Text capabilities
Crawlers for datasources such as web, email, document
repositories, databases
End-user query application and APIs for embedding
4
5. Oracle Text 11g New Features
Composite Domain Indexes and SDATA sections
Allows storage of structured info (eg numbers, dates) within
text index
Makes for much faster mixed queries
Auto Lexer
Automatic Language Recognition
Segmentation and Stemming for 32 languages
Context-sensitive stemming for 23 of these languages
Off-line and time-limited index creation
Enables rebuild of indexes offline in quiet periods for true
24x7 operation
5
7. 11.2.0.2 New Features - Summary
1. Entity Extraction
Find entities such as people, countries, cities, states, zip codes,
phone numbers etc from the text
Use default dictionary and rules or define your own dictionary and
rules based on regular expressions
2. Name Search (NDATA sections)
Inexact searches, copes with mis-spellings, segmentation errors,
contractions and word reversal
Useful for many searches, but particular good for names
3. ResultSet Interface
Query request in XML and results returned as XML
Avoids SQL layer and requirement to work within SELECT
semantics
7
8. Entity Extraction
Indentify names, places, dates, times, etc
Tag each occurence with type and subtype
Entities are defined by DICTIONARY and RULES
Implemented by CTX_ENTITY package
create_extract_policy create a policy to which you can add extract
rules
Choose to use/not use built in rules and dictionary
add_extract_rule create an XML-based rule to define an entity
add_stop_entity prevent defined entities from being used
compile build the policy with its rules
extract get an XML-based list of entities for a doc
Also can use ctxload to load user dictionary
8
12. Entity Extraction
Example 2: User rule
ctx_entity.create_extract_policy('mypolicy');
ctx_entity.add_extract_rule('mypolicy', 5,
'<rule>
<expression>((North|South)? America)</expression>
<type refid="1">xContinent</type>
</rule>');
ctx_entity.compile('mypolicy');
ctx_entity.extract('mypolicy', mydoc, mylang, myresults);
Note parentheses around expression. refid="1" means take the first expression in
paren so "North America" or just "America".
User defined types must be prefixed with a "x" hence "xContinent"
<entities>
<entity id="0" offset="75" length="13" source="UserRule">
<text>North America</text>
<type>xContinent</type>
</entity>
</entities>
12
13. Ent Ext: Adding a user dictionary
Create file
ud.xml:
<dictionary> <entities>
<entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity>
<entity> <value>S&P 500</value> <type>xIndex</type> </entity>
<entities> </dictionary>
Create the policy with CTXLOAD (can add rules later)
ctxload -user scott/tiger -extract -name pol1 -file ud.xml
Compile the policy
ctx_entity.compile('pol1');
Results
<entity id="69" offset="1010" length="7" source="UserDictionary">
<text>S&P 500</text>
<type>xIndex</type>
</entity>
13
14. Entity Extraction other stuff
Extracting only certain entity types:
ctx_entity.extract('p1', mydoc, null, myresults,
'city,company,xContinent');
14
15. Name Search
Searching names has many difficulties
Spelling (steven = stephen)
Alternate Names (fred = alfred, chuck = charles)
Transcription (copying from spoken to written form)
Transliteration (copying from one writing system to another)
Segmentation (Mary Jane, Maryjane)
First, Middle, and Last Name Classification
Name search does intelligent matching across all
these issues
15
17. NDATA section type
Basic implementation for name search
Limitations
511 characters
255 whitespace-delimited terms
No offset information, therefore no:
Highlighting / Markup
NEAR or phrase search with NDATA
Uses WORDLIST preference attributes:
NDATA_ALTERNATE_SPELLING
NDATA_BASE_LETTER
NDATA_THESAURUS (for alternate names default thesaurus provided)
NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac')
Query Syntax
NDATA(fieldname, search terms [, order [, proximity ] ] )
17
18. Result Set Interface
Some queries are difficult to express in SQL:
eg "Give me the top 5 hits in each category"
Result set interface uses a simple text query and an
XML result set descriptor
Hitlist is returned in XML according to result set
descriptor
Uses SDATA sections for
Grouping
Counting
18
23. Roadmap merging Text and SES
Oracle Text
Secure Enterprise
Search
Full Control
Full Featured
Fine-grained Index Options
Built in database and mid-tier
Data Storage Options
Crawlers for many sources
Lexer Options
Simple Query Interface
Stoplists
End user GUI / API
Use existing database
Embedded security
RAC, Exadata
23
24. Coming Search Features
Natural Language Processing enhancements
Ontology based classification
Question answering
Automatic Partitioning
Query load load balancing
Full support for facetted navigation (MVDATA sections)
Functional completeness for Result Set Interface
Result Iterator streaming support
Parallel Query
Replication Support
Golden Gate / Logical Standby / Streams
Operator improvements
NEAR2 best query in one operator
MNOT mild not, eg YORK mnot NEW YORK
Nested near
Substring index and query performance improvements
24
25. Coming Search Features - Continued
Multiple enhancements to query performance
BIGIO leverages Secure Files CLOBs
Automatic optimization of indexes with stage index
Two level index keep common search terms in memory
Partition maintenance without reindexing
Off-load filtering from database server
Section specific index options
Choose different options, eg language, stopwords, PRINTJOINS for
each section
Regular expression based stopwords
Forward Index
Hugely improved performance for highlighting, snippets
PDF Native Highlighting
Unlimited SDATA, MDATA and Field Sections
25
26. The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracles
products remains at the sole discretion of Oracle.
26