際際滷

際際滷Share a Scribd company logo
Improving VIVO search results through Semantic Ranking. Anup Sawant Deepak Konidena
VIVO Search till Release 1.2.1  VIVO Search till Release 1.2.1. Lucene keyword based search. Score based on Textual relevance. Importance of a node was not taken into consideration. Additional data that describes a relationship was not being searched.
Adding knowledge from semantic relationships  VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like:    Hey I work for "USDA" and when I search for "USDA", my  profile doesn't show up in the search results and vice-versa.    Hey information related to my Educational background,  Awards, the Roles I assumed, etc. that appear on my profile  don't show up in the search results when I search for them individually or when I search for my name.
How does the semantic graph look like with the presence of context nodes?
Intermediate nodes were overlooked.  Traditionally semantic relationships of an Individual like Roles, Educational Training, Awards, Authorship, etc. were not stored in the Index. Individuals were connected to these properties through intermediate nodes called "Context Nodes". And the information hiding beyond these context nodes was not captured.
Lucene field for an Individual.  And here's why
VIVO Search in 1.3  VIVO Search in 1.3 Transition from Lucene to SOLR. Provides base for distributed search capabilities. Individuals enriched by description of semantic relationships. Enhanced score by Individual connectivity. Improved precision and recall of search results.
Influence of PageRank Introduced by Larry Page & Sergey Brin. Every node relies on every other node for its ranking. Intuitive understanding: Node importance is calculated based on incoming connections and contribution of highly ranked important nodes.
Some parameters based on PageRank 硫 Number of nodes connected to a particular node. Intuition: Probably, a node deserves high rank because it is connected to lot of individuals. 陸 Average over 硫 values of all the nodes to which a node is connected. Intuition: Probably, a node deserves high rank because it is connected to some important individuals.  Average strength of uniqueness of properties through which a node is connected. Intuition: Probably, a node deserves high rank based on the strength of connection to other nodes.
Search Index Architecture: Enriching with Semantic Relations. Overall connectivity of an Individual () Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase Multithreaded.
Real-time Indexing: Enriching with Semantic Relations. Overall connectivity of an Individual () Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase ADD/EDIT/DELETE of an Individual or its properties. The changes occur in real time and propagate beyond intermediate nodes. Multithreaded.
Cluster Analysis of Search Results  Intuition  Assume search results from Release 1.2.1 and Release 1.3 are two different clusters. Expectation Results from Release 1.3 should have their mean vector close to query vector. Results Text to vector conversion using Bag of words technique. Tanimoto distance measure used. Code at :  https:// github.com / anupsavvy / Cluster_Analysis Query Distance from Mean vector of Release 1.2.1  Distance from Mean vector of Release 1.3 Scripps 0.27286328362357193 0.004277746256068157 Paulson James 0.009907336493786136 0.004650133621323327 Genome Sequencing 9.185463752863598E-4 8.154498815206635E-4 Kenny Paul 0.007610235640599918 0.003984303949283425
Understanding how it happens .. R1 R2 R3 R4 R5 . . . . name location description name research name articles name location Bla bla bla .
Understanding how it happens .. scripps loring jeanne institute cornell florida . . . . R1 R2 R3 .. .. .. .. 6 1 Q 1 0 0 1 4 0 1 1 0 1 4 0 1 1 0 1 1 1 1 0 0 0 - - - - - - - - - - - - - - - -
Understanding how it happens ..  institute cornell loring V1 V2 慮 Euclidean distance Cosine distance
Understanding how it happens ..  institute cornell loring V2 慮 V1 Euclidean distance increases, Cosine distance remains the same
Query vector distance from Cluster Mean vectors
User testing for Relevance
Precision and Recall Total Relevant Total Retrieved Precision = X / (Total Retrieved) Recall = X / (Total Relevant)  X
Precision-Recall graphs based on User Analysis.
Cluster Analysis for Relevance
Precision-Recall graphs based on Cluster Analysis
Query vector distance from individual search result vectors
Experiments : SOLR Search query expansion can be done using SOLR synonym analyzer.  Princeton Wordnet  http://wordnet.princeton.edu/  is frequently used with SOLR synonym analyzer. A gist code by Bradford on Github  https://gist.github.com/562776  was used to convert wordnet flat file into SOLR compatible synonyms file. Pros High Recall Documents can be matched to well known acronyms and words not present in SOLR index. For instance, a query which has  fl   as one of the terms would retrieve documents related to   Florida   as well. Cons Documents matching just the synonym part of the query could be ranked higher.
Experiments : SOLR ( cont. ) Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer. Phonetic Analyzer uses Apache Commons Codec for phonetic implementations.  Pros High Recall Helps in detecting spelling mistakes in search query. For instance, if a query like  scrips   would be accurately match to a similar sounding word   scripps   which is actually present in the index. Misspelled name like   Polex Frank   in the query could be matched to correct name   Polleux Franck  . Cons Number of results matched just based on Phonetics could decrease the precision of the engine.
Experiments : Ontology provides a good base for Factoid Questioning. Properties of Individuals give direct reference to the information. Natural language techniques and Machine learning algorithms could help us understand the search query better.  A query like  What is Brian Lowes email id ?  should probably return just the email id on top or a query like  Who are the co-authors of Brian Lowe ?  should return just the list of co-authors of Brian Lowe. We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set.  http://cogcomp.cs.illinois.edu/page/resources/data
Experiments : Ontology provides a good base for Factoid Questioning. ( cont. ) Once the question type is determined, we could grammatically parse the question using Stanford Lexparser  http://nlp.stanford.edu/software/lex- parser.shtml Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query. Stanford Lexparser Kmeans/SVM Search Query SPARQL Query Corpora Question type Terms
Summary Transition from Lucene to SOLR Additional information of semantic relationships and interconnectivity in the index. More relevant results and good ranking compared to VIVO 1.2.1 Improvements in indexing time due to multithreading.
Team Work

More Related Content

Vivo Search

  • 1. Improving VIVO search results through Semantic Ranking. Anup Sawant Deepak Konidena
  • 2. VIVO Search till Release 1.2.1 VIVO Search till Release 1.2.1. Lucene keyword based search. Score based on Textual relevance. Importance of a node was not taken into consideration. Additional data that describes a relationship was not being searched.
  • 3. Adding knowledge from semantic relationships VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like: Hey I work for "USDA" and when I search for "USDA", my profile doesn't show up in the search results and vice-versa. Hey information related to my Educational background, Awards, the Roles I assumed, etc. that appear on my profile don't show up in the search results when I search for them individually or when I search for my name.
  • 4. How does the semantic graph look like with the presence of context nodes?
  • 5. Intermediate nodes were overlooked. Traditionally semantic relationships of an Individual like Roles, Educational Training, Awards, Authorship, etc. were not stored in the Index. Individuals were connected to these properties through intermediate nodes called "Context Nodes". And the information hiding beyond these context nodes was not captured.
  • 6. Lucene field for an Individual. And here's why
  • 7. VIVO Search in 1.3 VIVO Search in 1.3 Transition from Lucene to SOLR. Provides base for distributed search capabilities. Individuals enriched by description of semantic relationships. Enhanced score by Individual connectivity. Improved precision and recall of search results.
  • 8. Influence of PageRank Introduced by Larry Page & Sergey Brin. Every node relies on every other node for its ranking. Intuitive understanding: Node importance is calculated based on incoming connections and contribution of highly ranked important nodes.
  • 9. Some parameters based on PageRank 硫 Number of nodes connected to a particular node. Intuition: Probably, a node deserves high rank because it is connected to lot of individuals. 陸 Average over 硫 values of all the nodes to which a node is connected. Intuition: Probably, a node deserves high rank because it is connected to some important individuals. Average strength of uniqueness of properties through which a node is connected. Intuition: Probably, a node deserves high rank based on the strength of connection to other nodes.
  • 10. Search Index Architecture: Enriching with Semantic Relations. Overall connectivity of an Individual () Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase Multithreaded.
  • 11. Real-time Indexing: Enriching with Semantic Relations. Overall connectivity of an Individual () Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase ADD/EDIT/DELETE of an Individual or its properties. The changes occur in real time and propagate beyond intermediate nodes. Multithreaded.
  • 12. Cluster Analysis of Search Results Intuition Assume search results from Release 1.2.1 and Release 1.3 are two different clusters. Expectation Results from Release 1.3 should have their mean vector close to query vector. Results Text to vector conversion using Bag of words technique. Tanimoto distance measure used. Code at : https:// github.com / anupsavvy / Cluster_Analysis Query Distance from Mean vector of Release 1.2.1 Distance from Mean vector of Release 1.3 Scripps 0.27286328362357193 0.004277746256068157 Paulson James 0.009907336493786136 0.004650133621323327 Genome Sequencing 9.185463752863598E-4 8.154498815206635E-4 Kenny Paul 0.007610235640599918 0.003984303949283425
  • 13. Understanding how it happens .. R1 R2 R3 R4 R5 . . . . name location description name research name articles name location Bla bla bla .
  • 14. Understanding how it happens .. scripps loring jeanne institute cornell florida . . . . R1 R2 R3 .. .. .. .. 6 1 Q 1 0 0 1 4 0 1 1 0 1 4 0 1 1 0 1 1 1 1 0 0 0 - - - - - - - - - - - - - - - -
  • 15. Understanding how it happens .. institute cornell loring V1 V2 慮 Euclidean distance Cosine distance
  • 16. Understanding how it happens .. institute cornell loring V2 慮 V1 Euclidean distance increases, Cosine distance remains the same
  • 17. Query vector distance from Cluster Mean vectors
  • 18. User testing for Relevance
  • 19. Precision and Recall Total Relevant Total Retrieved Precision = X / (Total Retrieved) Recall = X / (Total Relevant) X
  • 20. Precision-Recall graphs based on User Analysis.
  • 21. Cluster Analysis for Relevance
  • 22. Precision-Recall graphs based on Cluster Analysis
  • 23. Query vector distance from individual search result vectors
  • 24. Experiments : SOLR Search query expansion can be done using SOLR synonym analyzer. Princeton Wordnet http://wordnet.princeton.edu/ is frequently used with SOLR synonym analyzer. A gist code by Bradford on Github https://gist.github.com/562776 was used to convert wordnet flat file into SOLR compatible synonyms file. Pros High Recall Documents can be matched to well known acronyms and words not present in SOLR index. For instance, a query which has fl as one of the terms would retrieve documents related to Florida as well. Cons Documents matching just the synonym part of the query could be ranked higher.
  • 25. Experiments : SOLR ( cont. ) Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer. Phonetic Analyzer uses Apache Commons Codec for phonetic implementations. Pros High Recall Helps in detecting spelling mistakes in search query. For instance, if a query like scrips would be accurately match to a similar sounding word scripps which is actually present in the index. Misspelled name like Polex Frank in the query could be matched to correct name Polleux Franck . Cons Number of results matched just based on Phonetics could decrease the precision of the engine.
  • 26. Experiments : Ontology provides a good base for Factoid Questioning. Properties of Individuals give direct reference to the information. Natural language techniques and Machine learning algorithms could help us understand the search query better. A query like What is Brian Lowes email id ? should probably return just the email id on top or a query like Who are the co-authors of Brian Lowe ? should return just the list of co-authors of Brian Lowe. We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set. http://cogcomp.cs.illinois.edu/page/resources/data
  • 27. Experiments : Ontology provides a good base for Factoid Questioning. ( cont. ) Once the question type is determined, we could grammatically parse the question using Stanford Lexparser http://nlp.stanford.edu/software/lex- parser.shtml Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query. Stanford Lexparser Kmeans/SVM Search Query SPARQL Query Corpora Question type Terms
  • 28. Summary Transition from Lucene to SOLR Additional information of semantic relationships and interconnectivity in the index. More relevant results and good ranking compared to VIVO 1.2.1 Improvements in indexing time due to multithreading.

Editor's Notes

  • #3: Duplicate slide to maintain title and subtitle formatting
  • #13: Duplicate slide to maintain title and subtitle formatting