�ݺ�ߣ

SET EXPANSION ON NAMED ENTITIES
GROUP# 38
PROJECT# 3
ANKIT CHOUDHARY(201206570)
LOVLEAN ARORA(201305590)
SAKSHAM MAHESHWARI(201130184)
AMAN JAIN(201101132)

What is Set Expansion?
● In simple terms we can define set expansion is basically determining
the set to which given named entities belongs to
● For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian
Cricket Team
● Define:
Set expansion refers to expanding a given partial set of objects
into a more complete set. In set expansion, the user issues a
query consisting of small number of seeds x1,x2,...xk
(assumption we will be given atleast three valid seeds) where
each xi is a member of some target set Si. The answer to query
is a listing of other probable elements of Si.

Why Set Expansion ?
● With such a huge expansion of data/service providers, the need
of users has been shifted from detailed query to simple query
● Now user wants desired results in quick time with some words as
query
● If some user want to get list of Indian cricket players, he can just
pass sachin tendulkar, rahul dravid as input and will get list of
cricketers from set expansion technique
● Ex:
– Input : {Sachin Tendulkar, Rahul Dravid}
– Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar,
Rahul Dravid}

Related Work
●
Our system works on Wikipedia data and currently Wikipedia has no such
features
● Wikipedia just provide title based search
● Google Sets:
– Google Sets has been used for a number of purposes in research
community, including deriving features for named-entity recognition and
evaluation of question answering systems.
– Shortcomings: Google Sets is a proprietary that may be changed any
time
●
SEAL (Set Expander for Any Language): Exploits semi-structured nature of
web pages to find seed and wrapper around them. Wrappers are further
used to search other related entities
● Others like Boo!Wa! System based on Web wrapper technologies to extract
and rank entities iteratively, is also there in this race

Approach
● Our entire is work is distributed over two parts:
1. Indexing
2. Searching
Some external tools like POS(Part of Speech) Tagger we
are applying on final retrieved document names to refine
our results and constrained under named entities

Indexing
● For Index preparation, we have gone through some
specifics like, tokenization, stop word removal, stemming,
diacritics normalization
● We focus on following fields provided by Wiki data to get
our results
– Titles, Categories, Infobox, Body Text, External
References (order in decreasing order of their
weights)
and build some primary and secondary indexes on
them

Searching
● Document Fetcher
– Retrieving relevant top 10 documents for each seed
● Attribute Classifier
– Crawling each document based on Category, Infobox/Taxobox and
Introductory Text
● Ranker
– Rank the Set of documents corresponding to the attributes with highest
weightage given to Category followed by Infobox and then Text.
● POS Tagger
– Retrieving only the named entities that belong to the set thus obtained

Tests
● Input: Lagaan talaash
● Results:
● 3 idiots, sarfarosh champion (2003 film), lagaan, p.k.
(film), afsana pyaar ka, delhi belly (film), dhoom 3, jo
jeeta wohi sikandar, nation awakes, ghajini (2008 film),
ready (2011 film), welcome (2007 film), luck by chance,
elements trilogy

Applications
● Set Expansion on Wiki data itself
● General Knowledge
– For ex: if you want to know list of diseases and you
know only few diseases like malaria and cholera, you
just give them as inputs and you will get variety of
diseases in results
● Comparisons between named entities
● Search Result suggestion on Wiki like Google
and many more to come...

Conclusion
● In this project, we were supposed to expand the set of
named entities and we think we are quite successful in it
● Yah, its possible that our results may not up to the mark
in some cases, but it covers the most general results as
expected
● There is lot of scope in future for this project and we are
planning to gets our hand dirty
● We can try to handle whole Wiki data more efficiently and
God knows may be our tool will be used by billions :)

References
● http://en.wikipedia.org/wiki/Collaborative_filtering
● https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/posts
cript/icdm2007.pdf
● http://aclweb.org/anthology//P/P09/P091050.pdfhttp://su2
010-projekt.googlecode.com/svn-
history/r115/trunk/literatura/melville2002content.pdf
● http://www.cs.uic.edu/~lzhang3/paper/set_expansion.pdf

�ݺ�ߣ

SetExpansion on Named Entities

More Related Content

SetExpansion on Named Entities