This document describes a project on set expansion using Wikipedia data. The project aims to take a small set of named entities as input and output a more complete set that the inputs belong to. The approach involves indexing Wikipedia data by fields like titles, categories, infoboxes and body text, then searching relevant documents for the input seeds and ranking and filtering the results to extract the expanded set. Tests show the system can accurately expand sets for movie titles and diseases. Potential applications include general knowledge queries and search suggestion.
1 of 13
Download to read offline
More Related Content
SetExpansion on Named Entities
1. SET EXPANSION ON NAMED ENTITIES
GROUP# 38
PROJECT# 3
ANKIT CHOUDHARY(201206570)
LOVLEAN ARORA(201305590)
SAKSHAM MAHESHWARI(201130184)
AMAN JAIN(201101132)
2. What is Set Expansion?
In simple terms we can define set expansion is basically determining
the set to which given named entities belongs to
For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian
Cricket Team
Define:
Set expansion refers to expanding a given partial set of objects
into a more complete set. In set expansion, the user issues a
query consisting of small number of seeds x1,x2,...xk
(assumption we will be given atleast three valid seeds) where
each xi is a member of some target set Si. The answer to query
is a listing of other probable elements of Si.
3. Why Set Expansion ?
With such a huge expansion of data/service providers, the need
of users has been shifted from detailed query to simple query
Now user wants desired results in quick time with some words as
query
If some user want to get list of Indian cricket players, he can just
pass sachin tendulkar, rahul dravid as input and will get list of
cricketers from set expansion technique
Ex:
Input : {Sachin Tendulkar, Rahul Dravid}
Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar,
Rahul Dravid}
4. Related Work
Our system works on Wikipedia data and currently Wikipedia has no such
features
Wikipedia just provide title based search
Google Sets:
Google Sets has been used for a number of purposes in research
community, including deriving features for named-entity recognition and
evaluation of question answering systems.
Shortcomings: Google Sets is a proprietary that may be changed any
time
SEAL (Set Expander for Any Language): Exploits semi-structured nature of
web pages to find seed and wrapper around them. Wrappers are further
used to search other related entities
Others like Boo!Wa! System based on Web wrapper technologies to extract
and rank entities iteratively, is also there in this race
5. Approach
Our entire is work is distributed over two parts:
1. Indexing
2. Searching
Some external tools like POS(Part of Speech) Tagger we
are applying on final retrieved document names to refine
our results and constrained under named entities
6. Indexing
For Index preparation, we have gone through some
specifics like, tokenization, stop word removal, stemming,
diacritics normalization
We focus on following fields provided by Wiki data to get
our results
Titles, Categories, Infobox, Body Text, External
References (order in decreasing order of their
weights)
and build some primary and secondary indexes on
them
7. Searching
Document Fetcher
Retrieving relevant top 10 documents for each seed
Attribute Classifier
Crawling each document based on Category, Infobox/Taxobox and
Introductory Text
Ranker
Rank the Set of documents corresponding to the attributes with highest
weightage given to Category followed by Infobox and then Text.
POS Tagger
Retrieving only the named entities that belong to the set thus obtained
10. Applications
Set Expansion on Wiki data itself
General Knowledge
For ex: if you want to know list of diseases and you
know only few diseases like malaria and cholera, you
just give them as inputs and you will get variety of
diseases in results
Comparisons between named entities
Search Result suggestion on Wiki like Google
and many more to come...
11. Conclusion
In this project, we were supposed to expand the set of
named entities and we think we are quite successful in it
Yah, its possible that our results may not up to the mark
in some cases, but it covers the most general results as
expected
There is lot of scope in future for this project and we are
planning to gets our hand dirty
We can try to handle whole Wiki data more efficiently and
God knows may be our tool will be used by billions :)