際際滷

際際滷Share a Scribd company logo
1
<Insert Picture Here>

Oracle Database 11g New Search Features and Roadmap
Roger Ford
Senior Principal Product Manager
Contents

 Oracles Search Products
 Oracle Text 11g New Features
 Oracle Text 11.2.0.2 New Features

<Insert Picture Here>

 Entity Extraction
 Name Search
 Result Set Interface

 Search Product Roadmap
 Oracle Text
 Secure Enterprise Search

3
Oracles Search Products

 Oracle Text
 A SQL and PL/SQL based toolkit for creating full-text search
applications
 Free with all database versions
 Previously known as Context Option, interMedia Text

 Secure Enterprise Search
 A complete search based on Oracle Text capabilities
 Crawlers for datasources such as web, email, document
repositories, databases
 End-user query application and APIs for embedding

4
Oracle Text 11g New Features
 Composite Domain Indexes and SDATA sections
 Allows storage of structured info (eg numbers, dates) within
text index
 Makes for much faster mixed queries

 Auto Lexer
 Automatic Language Recognition
 Segmentation and Stemming for 32 languages
 Context-sensitive stemming for 23 of these languages

 Off-line and time-limited index creation
 Enables rebuild of indexes offline in quiet periods for true
24x7 operation

5
Demo: Auto Lexer

6
11.2.0.2 New Features - Summary
1. Entity Extraction



Find entities such as people, countries, cities, states, zip codes,
phone numbers etc from the text
Use default dictionary and rules or define your own dictionary and
rules based on regular expressions

2. Name Search (NDATA sections)



Inexact searches, copes with mis-spellings, segmentation errors,
contractions and word reversal
Useful for many searches, but particular good for names

3. ResultSet Interface



Query request in XML and results returned as XML
Avoids SQL layer and requirement to work within SELECT
semantics

7
Entity Extraction





Indentify names, places, dates, times, etc
Tag each occurence with type and subtype
Entities are defined by DICTIONARY and RULES
Implemented by CTX_ENTITY package
 create_extract_policy  create a policy to which you can add extract
rules
 Choose to use/not use built in rules and dictionary
 add_extract_rule  create an XML-based rule to define an entity
 add_stop_entity  prevent defined entities from being used
 compile  build the policy with its rules
 extract  get an XML-based list of entities for a doc

 Also can use ctxload to load user dictionary

8
Demo: Entity Extraction

9
Entities: built-in types
















building
city
company
country
currency
date
day
email_address
geo_political
holiday
location_other
month
non_profit
organization_other
















percent
person_jobtitle
person_name
person_other
phone_number
postal_address
product
region
ssn
state
time_duration
tod
url
zip_code

10
Entity Extraction 
Example 1: Defaults
ctx_entity.create_extract_policy('my_default_policy');
ctx_entity.compile('mypolicy');
ctx_entity.extract('mypolicy', mydoc, mylang, myresults);

 Output in "myresults":
<entities>
<entity id="0" offset="75" length="8" source="SuppliedDictionary">
<text>New York</text>
<type>city</type>
</entity>
<entity id="1" offset="55" length="16" source="SuppliedRule">
<text>Hupplewhite Inc.</text>
<type>company</type>
</entity>
</entities>

11
Entity Extraction 
Example 2: User rule
ctx_entity.create_extract_policy('mypolicy');
ctx_entity.add_extract_rule('mypolicy', 5,
'<rule>
<expression>((North|South)? America)</expression>
<type refid="1">xContinent</type>
</rule>');
ctx_entity.compile('mypolicy');
ctx_entity.extract('mypolicy', mydoc, mylang, myresults);

 Note parentheses around expression. refid="1" means take the first expression in
paren  so "North America" or just "America".
 User defined types must be prefixed with a "x"  hence "xContinent"
<entities>
<entity id="0" offset="75" length="13" source="UserRule">
<text>North America</text>
<type>xContinent</type>
</entity>
</entities>

12
Ent Ext: Adding a user dictionary
 Create file

ud.xml:

<dictionary> <entities>
<entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity>
<entity> <value>S&amp;P 500</value> <type>xIndex</type> </entity>
<entities> </dictionary>

 Create the policy with CTXLOAD (can add rules later)
ctxload -user scott/tiger -extract -name pol1 -file ud.xml
 Compile the policy

ctx_entity.compile('pol1');


Results
<entity id="69" offset="1010" length="7" source="UserDictionary">
<text>S&amp;P 500</text>
<type>xIndex</type>
</entity>

13
Entity Extraction  other stuff
 Extracting only certain entity types:
 ctx_entity.extract('p1', mydoc, null, myresults,
'city,company,xContinent');

14
Name Search
 Searching names has many difficulties







Spelling (steven = stephen)
Alternate Names (fred = alfred, chuck = charles)
Transcription (copying from spoken to written form)
Transliteration (copying from one writing system to another)
Segmentation (Mary Jane, Maryjane)
First, Middle, and Last Name Classification

 Name search does intelligent matching across all
these issues

15
Demo: Name Search

16
NDATA section type
 Basic implementation for name search
 Limitations
 511 characters
 255 whitespace-delimited terms
 No offset information, therefore no:
 Highlighting / Markup
 NEAR or phrase search with NDATA

 Uses WORDLIST preference attributes:





NDATA_ALTERNATE_SPELLING
NDATA_BASE_LETTER
NDATA_THESAURUS (for alternate names  default thesaurus provided)
NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac')

 Query Syntax
 NDATA(fieldname, search terms [, order [, proximity ] ] )

17
Result Set Interface
 Some queries are difficult to express in SQL:
 eg "Give me the top 5 hits in each category"

 Result set interface uses a simple text query and an
XML result set descriptor
 Hitlist is returned in XML according to result set
descriptor
 Uses SDATA sections for
 Grouping
 Counting

18
Result Set Example Query
ctx_query.result_set('docidx', 'oracle',
'<ctx_result_set_descriptor>
<count/>
<hitlist start_hit_num="1" end_hit_num="2" order="pubDate
desc, score desc">
<score/> <rowid/>
<sdata name="author"/>
<sdata name="pubDate"/>
</hitlist>
<group sdata="pubDate">
<count/>
</group>
<group sdata="author">
<count/>
</group>
</ctx_result_set_descriptor> ', rs);

19
Result Set Output
<ctx_result_set>
<hitlist>
<hit>
<score>3</score><rowid>AAAPoEAABAAAMWsAAC</rowid>
<sdata name="AUTHOR">John</sdata>
<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
</hit>
<hit>
<score>3</score><rowid>AAAPoEAABAAAMWsAAG</rowid>
<sdata name="AUTHOR">John</sdata>
<sdata name="PUBDATE">2001-01-03 00:00:00</sdata>
</hit>
</hitlist>
<count>100</count>

20
Result Set Output - Continued
<groups sdata="PUBDATE">
<group value="2001-01-01 00:00:00"><count>25</count></group>
<group value="2001-01-02 00:00:00"><count>50</count></group>
<group value="2001-01-03 00:00:00"><count>25</count></group>
</groups>
<groups sdata="AUTHOR">
<group value="John"><count>50</count></group>
<group value="Mike"><count>25</count></group>
<group value="Steve"><count>25</count></group>
</groups>
</ctx_result_set>

21
Preview

22
Roadmap  merging Text and SES

Oracle Text

Secure Enterprise
Search

Full Control

Full Featured

 Fine-grained Index Options

 Built in database and mid-tier

 Data Storage Options

 Crawlers for many sources

 Lexer Options

 Simple Query Interface

 Stoplists

 End user GUI / API

 Use existing database

 Embedded security

 RAC, Exadata

23
Coming Search Features
 Natural Language Processing enhancements
 Ontology based classification
 Question answering

 Automatic Partitioning
 Query load load balancing

 Full support for facetted navigation (MVDATA sections)
 Functional completeness for Result Set Interface
 Result Iterator  streaming support
 Parallel Query

 Replication Support
 Golden Gate / Logical Standby / Streams

 Operator improvements
 NEAR2  best query in one operator
 MNOT  mild not, eg YORK mnot NEW YORK
 Nested near

 Substring index and query performance improvements
24
Coming Search Features - Continued
 Multiple enhancements to query performance
 BIGIO leverages Secure Files CLOBs
 Automatic optimization of indexes with stage index
 Two level index  keep common search terms in memory

 Partition maintenance without reindexing
 Off-load filtering from database server
 Section specific index options
 Choose different options, eg language, stopwords, PRINTJOINS for
each section

 Regular expression based stopwords
 Forward Index
 Hugely improved performance for highlighting, snippets

 PDF Native Highlighting
 Unlimited SDATA, MDATA and Field Sections

25
The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracles
products remains at the sole discretion of Oracle.

26
27

More Related Content

Oracle by Muhammad Iqbal

  • 1. 1
  • 2. <Insert Picture Here> Oracle Database 11g New Search Features and Roadmap Roger Ford Senior Principal Product Manager
  • 3. Contents Oracles Search Products Oracle Text 11g New Features Oracle Text 11.2.0.2 New Features <Insert Picture Here> Entity Extraction Name Search Result Set Interface Search Product Roadmap Oracle Text Secure Enterprise Search 3
  • 4. Oracles Search Products Oracle Text A SQL and PL/SQL based toolkit for creating full-text search applications Free with all database versions Previously known as Context Option, interMedia Text Secure Enterprise Search A complete search based on Oracle Text capabilities Crawlers for datasources such as web, email, document repositories, databases End-user query application and APIs for embedding 4
  • 5. Oracle Text 11g New Features Composite Domain Indexes and SDATA sections Allows storage of structured info (eg numbers, dates) within text index Makes for much faster mixed queries Auto Lexer Automatic Language Recognition Segmentation and Stemming for 32 languages Context-sensitive stemming for 23 of these languages Off-line and time-limited index creation Enables rebuild of indexes offline in quiet periods for true 24x7 operation 5
  • 7. 11.2.0.2 New Features - Summary 1. Entity Extraction Find entities such as people, countries, cities, states, zip codes, phone numbers etc from the text Use default dictionary and rules or define your own dictionary and rules based on regular expressions 2. Name Search (NDATA sections) Inexact searches, copes with mis-spellings, segmentation errors, contractions and word reversal Useful for many searches, but particular good for names 3. ResultSet Interface Query request in XML and results returned as XML Avoids SQL layer and requirement to work within SELECT semantics 7
  • 8. Entity Extraction Indentify names, places, dates, times, etc Tag each occurence with type and subtype Entities are defined by DICTIONARY and RULES Implemented by CTX_ENTITY package create_extract_policy create a policy to which you can add extract rules Choose to use/not use built in rules and dictionary add_extract_rule create an XML-based rule to define an entity add_stop_entity prevent defined entities from being used compile build the policy with its rules extract get an XML-based list of entities for a doc Also can use ctxload to load user dictionary 8
  • 11. Entity Extraction Example 1: Defaults ctx_entity.create_extract_policy('my_default_policy'); ctx_entity.compile('mypolicy'); ctx_entity.extract('mypolicy', mydoc, mylang, myresults); Output in "myresults": <entities> <entity id="0" offset="75" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="1" offset="55" length="16" source="SuppliedRule"> <text>Hupplewhite Inc.</text> <type>company</type> </entity> </entities> 11
  • 12. Entity Extraction Example 2: User rule ctx_entity.create_extract_policy('mypolicy'); ctx_entity.add_extract_rule('mypolicy', 5, '<rule> <expression>((North|South)? America)</expression> <type refid="1">xContinent</type> </rule>'); ctx_entity.compile('mypolicy'); ctx_entity.extract('mypolicy', mydoc, mylang, myresults); Note parentheses around expression. refid="1" means take the first expression in paren so "North America" or just "America". User defined types must be prefixed with a "x" hence "xContinent" <entities> <entity id="0" offset="75" length="13" source="UserRule"> <text>North America</text> <type>xContinent</type> </entity> </entities> 12
  • 13. Ent Ext: Adding a user dictionary Create file ud.xml: <dictionary> <entities> <entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity> <entity> <value>S&amp;P 500</value> <type>xIndex</type> </entity> <entities> </dictionary> Create the policy with CTXLOAD (can add rules later) ctxload -user scott/tiger -extract -name pol1 -file ud.xml Compile the policy ctx_entity.compile('pol1'); Results <entity id="69" offset="1010" length="7" source="UserDictionary"> <text>S&amp;P 500</text> <type>xIndex</type> </entity> 13
  • 14. Entity Extraction other stuff Extracting only certain entity types: ctx_entity.extract('p1', mydoc, null, myresults, 'city,company,xContinent'); 14
  • 15. Name Search Searching names has many difficulties Spelling (steven = stephen) Alternate Names (fred = alfred, chuck = charles) Transcription (copying from spoken to written form) Transliteration (copying from one writing system to another) Segmentation (Mary Jane, Maryjane) First, Middle, and Last Name Classification Name search does intelligent matching across all these issues 15
  • 17. NDATA section type Basic implementation for name search Limitations 511 characters 255 whitespace-delimited terms No offset information, therefore no: Highlighting / Markup NEAR or phrase search with NDATA Uses WORDLIST preference attributes: NDATA_ALTERNATE_SPELLING NDATA_BASE_LETTER NDATA_THESAURUS (for alternate names default thesaurus provided) NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac') Query Syntax NDATA(fieldname, search terms [, order [, proximity ] ] ) 17
  • 18. Result Set Interface Some queries are difficult to express in SQL: eg "Give me the top 5 hits in each category" Result set interface uses a simple text query and an XML result set descriptor Hitlist is returned in XML according to result set descriptor Uses SDATA sections for Grouping Counting 18
  • 19. Result Set Example Query ctx_query.result_set('docidx', 'oracle', '<ctx_result_set_descriptor> <count/> <hitlist start_hit_num="1" end_hit_num="2" order="pubDate desc, score desc"> <score/> <rowid/> <sdata name="author"/> <sdata name="pubDate"/> </hitlist> <group sdata="pubDate"> <count/> </group> <group sdata="author"> <count/> </group> </ctx_result_set_descriptor> ', rs); 19
  • 20. Result Set Output <ctx_result_set> <hitlist> <hit> <score>3</score><rowid>AAAPoEAABAAAMWsAAC</rowid> <sdata name="AUTHOR">John</sdata> <sdata name="PUBDATE">2001-01-03 00:00:00</sdata> </hit> <hit> <score>3</score><rowid>AAAPoEAABAAAMWsAAG</rowid> <sdata name="AUTHOR">John</sdata> <sdata name="PUBDATE">2001-01-03 00:00:00</sdata> </hit> </hitlist> <count>100</count> 20
  • 21. Result Set Output - Continued <groups sdata="PUBDATE"> <group value="2001-01-01 00:00:00"><count>25</count></group> <group value="2001-01-02 00:00:00"><count>50</count></group> <group value="2001-01-03 00:00:00"><count>25</count></group> </groups> <groups sdata="AUTHOR"> <group value="John"><count>50</count></group> <group value="Mike"><count>25</count></group> <group value="Steve"><count>25</count></group> </groups> </ctx_result_set> 21
  • 23. Roadmap merging Text and SES Oracle Text Secure Enterprise Search Full Control Full Featured Fine-grained Index Options Built in database and mid-tier Data Storage Options Crawlers for many sources Lexer Options Simple Query Interface Stoplists End user GUI / API Use existing database Embedded security RAC, Exadata 23
  • 24. Coming Search Features Natural Language Processing enhancements Ontology based classification Question answering Automatic Partitioning Query load load balancing Full support for facetted navigation (MVDATA sections) Functional completeness for Result Set Interface Result Iterator streaming support Parallel Query Replication Support Golden Gate / Logical Standby / Streams Operator improvements NEAR2 best query in one operator MNOT mild not, eg YORK mnot NEW YORK Nested near Substring index and query performance improvements 24
  • 25. Coming Search Features - Continued Multiple enhancements to query performance BIGIO leverages Secure Files CLOBs Automatic optimization of indexes with stage index Two level index keep common search terms in memory Partition maintenance without reindexing Off-load filtering from database server Section specific index options Choose different options, eg language, stopwords, PRINTJOINS for each section Regular expression based stopwords Forward Index Hugely improved performance for highlighting, snippets PDF Native Highlighting Unlimited SDATA, MDATA and Field Sections 25
  • 26. The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracles products remains at the sole discretion of Oracle. 26
  • 27. 27