際際滷

際際滷Share a Scribd company logo
SET EXPANSION ON NAMED ENTITIES
GROUP# 38
PROJECT# 3
ANKIT CHOUDHARY(201206570)
LOVLEAN ARORA(201305590)
SAKSHAM MAHESHWARI(201130184)
AMAN JAIN(201101132)
What is Set Expansion?
 In simple terms we can define set expansion is basically determining
the set to which given named entities belongs to
 For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian
Cricket Team
 Define:
Set expansion refers to expanding a given partial set of objects
into a more complete set. In set expansion, the user issues a
query consisting of small number of seeds x1,x2,...xk
(assumption we will be given atleast three valid seeds) where
each xi is a member of some target set Si. The answer to query
is a listing of other probable elements of Si.
Why Set Expansion ?
 With such a huge expansion of data/service providers, the need
of users has been shifted from detailed query to simple query
 Now user wants desired results in quick time with some words as
query
 If some user want to get list of Indian cricket players, he can just
pass sachin tendulkar, rahul dravid as input and will get list of
cricketers from set expansion technique
 Ex:
 Input : {Sachin Tendulkar, Rahul Dravid}
 Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar,
Rahul Dravid}
Related Work

Our system works on Wikipedia data and currently Wikipedia has no such
features
 Wikipedia just provide title based search
 Google Sets:
 Google Sets has been used for a number of purposes in research
community, including deriving features for named-entity recognition and
evaluation of question answering systems.
 Shortcomings: Google Sets is a proprietary that may be changed any
time

SEAL (Set Expander for Any Language): Exploits semi-structured nature of
web pages to find seed and wrapper around them. Wrappers are further
used to search other related entities
 Others like Boo!Wa! System based on Web wrapper technologies to extract
and rank entities iteratively, is also there in this race
Approach
 Our entire is work is distributed over two parts:
1. Indexing
2. Searching
Some external tools like POS(Part of Speech) Tagger we
are applying on final retrieved document names to refine
our results and constrained under named entities
Indexing
 For Index preparation, we have gone through some
specifics like, tokenization, stop word removal, stemming,
diacritics normalization
 We focus on following fields provided by Wiki data to get
our results
 Titles, Categories, Infobox, Body Text, External
References (order in decreasing order of their
weights)
and build some primary and secondary indexes on
them
Searching
 Document Fetcher
 Retrieving relevant top 10 documents for each seed
 Attribute Classifier
 Crawling each document based on Category, Infobox/Taxobox and
Introductory Text
 Ranker
 Rank the Set of documents corresponding to the attributes with highest
weightage given to Category followed by Infobox and then Text.
 POS Tagger
 Retrieving only the named entities that belong to the set thus obtained
Complete Architecture
Tests
 Input: Lagaan talaash
 Results:
 3 idiots, sarfarosh champion (2003 film), lagaan, p.k.
(film), afsana pyaar ka, delhi belly (film), dhoom 3, jo
jeeta wohi sikandar, nation awakes, ghajini (2008 film),
ready (2011 film), welcome (2007 film), luck by chance,
elements trilogy
Applications
 Set Expansion on Wiki data itself
 General Knowledge
 For ex: if you want to know list of diseases and you
know only few diseases like malaria and cholera, you
just give them as inputs and you will get variety of
diseases in results
 Comparisons between named entities
 Search Result suggestion on Wiki like Google
and many more to come...
Conclusion
 In this project, we were supposed to expand the set of
named entities and we think we are quite successful in it
 Yah, its possible that our results may not up to the mark
in some cases, but it covers the most general results as
expected
 There is lot of scope in future for this project and we are
planning to gets our hand dirty
 We can try to handle whole Wiki data more efficiently and
God knows may be our tool will be used by billions :)
References
 http://en.wikipedia.org/wiki/Collaborative_filtering
 https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/posts
cript/icdm2007.pdf
 http://aclweb.org/anthology//P/P09/P091050.pdfhttp://su2
010-projekt.googlecode.com/svn-
history/r115/trunk/literatura/melville2002content.pdf
 http://www.cs.uic.edu/~lzhang3/paper/set_expansion.pdf
Thank You

More Related Content

SetExpansion on Named Entities

  • 1. SET EXPANSION ON NAMED ENTITIES GROUP# 38 PROJECT# 3 ANKIT CHOUDHARY(201206570) LOVLEAN ARORA(201305590) SAKSHAM MAHESHWARI(201130184) AMAN JAIN(201101132)
  • 2. What is Set Expansion? In simple terms we can define set expansion is basically determining the set to which given named entities belongs to For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian Cricket Team Define: Set expansion refers to expanding a given partial set of objects into a more complete set. In set expansion, the user issues a query consisting of small number of seeds x1,x2,...xk (assumption we will be given atleast three valid seeds) where each xi is a member of some target set Si. The answer to query is a listing of other probable elements of Si.
  • 3. Why Set Expansion ? With such a huge expansion of data/service providers, the need of users has been shifted from detailed query to simple query Now user wants desired results in quick time with some words as query If some user want to get list of Indian cricket players, he can just pass sachin tendulkar, rahul dravid as input and will get list of cricketers from set expansion technique Ex: Input : {Sachin Tendulkar, Rahul Dravid} Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar, Rahul Dravid}
  • 4. Related Work Our system works on Wikipedia data and currently Wikipedia has no such features Wikipedia just provide title based search Google Sets: Google Sets has been used for a number of purposes in research community, including deriving features for named-entity recognition and evaluation of question answering systems. Shortcomings: Google Sets is a proprietary that may be changed any time SEAL (Set Expander for Any Language): Exploits semi-structured nature of web pages to find seed and wrapper around them. Wrappers are further used to search other related entities Others like Boo!Wa! System based on Web wrapper technologies to extract and rank entities iteratively, is also there in this race
  • 5. Approach Our entire is work is distributed over two parts: 1. Indexing 2. Searching Some external tools like POS(Part of Speech) Tagger we are applying on final retrieved document names to refine our results and constrained under named entities
  • 6. Indexing For Index preparation, we have gone through some specifics like, tokenization, stop word removal, stemming, diacritics normalization We focus on following fields provided by Wiki data to get our results Titles, Categories, Infobox, Body Text, External References (order in decreasing order of their weights) and build some primary and secondary indexes on them
  • 7. Searching Document Fetcher Retrieving relevant top 10 documents for each seed Attribute Classifier Crawling each document based on Category, Infobox/Taxobox and Introductory Text Ranker Rank the Set of documents corresponding to the attributes with highest weightage given to Category followed by Infobox and then Text. POS Tagger Retrieving only the named entities that belong to the set thus obtained
  • 9. Tests Input: Lagaan talaash Results: 3 idiots, sarfarosh champion (2003 film), lagaan, p.k. (film), afsana pyaar ka, delhi belly (film), dhoom 3, jo jeeta wohi sikandar, nation awakes, ghajini (2008 film), ready (2011 film), welcome (2007 film), luck by chance, elements trilogy
  • 10. Applications Set Expansion on Wiki data itself General Knowledge For ex: if you want to know list of diseases and you know only few diseases like malaria and cholera, you just give them as inputs and you will get variety of diseases in results Comparisons between named entities Search Result suggestion on Wiki like Google and many more to come...
  • 11. Conclusion In this project, we were supposed to expand the set of named entities and we think we are quite successful in it Yah, its possible that our results may not up to the mark in some cases, but it covers the most general results as expected There is lot of scope in future for this project and we are planning to gets our hand dirty We can try to handle whole Wiki data more efficiently and God knows may be our tool will be used by billions :)
  • 12. References http://en.wikipedia.org/wiki/Collaborative_filtering https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/posts cript/icdm2007.pdf http://aclweb.org/anthology//P/P09/P091050.pdfhttp://su2 010-projekt.googlecode.com/svn- history/r115/trunk/literatura/melville2002content.pdf http://www.cs.uic.edu/~lzhang3/paper/set_expansion.pdf