�ݺ�ߣ

GIR
Behrooz Rasuli

Iranian Research Inst. For Information Science & Technol.
rasuli9@gmail.com



Address information is essential for people's
daily life. People often need to query
addresses of unfamiliar location through Web
and then use map services to mark down the
location for direction purpose. Although both
address information and map services are
available online, they are not well combined.







general search engines are widely used to
retrieve Web pages
Specialized search engines are dedicated to
find either particular types of resources or
Web pages based on different criteria e.g.
language or geographic location
People use search engines to find Web pages
of local services and events around them or
in a particular area





is the data pertaining to the location of
geographical entities together with their spatial
dimensions
Location could be defined as “a place on the
Internet where an Internet resource, such as a
Web page, is stored”



Source Geography

◦ physical location of hosts
◦ signal processing and network-based techniques



Target Geography

◦ uses elements contained in the page to deduce
locations (place names, postal addresses, and
phone numbers)
◦ Challenge: involves evidence extraction, semantic
analysis, and interpretation, in order to link Web
pages to geographic locations





Geographic Information Retrieval (GIR) is an
applied research field that involves indexing,
searching,
retrieving,
and
browsing
georeferenced information sources, and
designing systems to execute these tasks
effectively and efficiently
Like IR, GIR includes indexing, storage and
ranking





pattern extraction from raw text has already
been done. For example, M. Hearst (1990s)
developed an approach for discovering
lexico-syntactic patterns for hypernyms



Pattern-Based Methods;
◦ Named Entity Recognition (NER)
◦ Gazetteer approach (Web-a-Where);
◦ Pattern-based method



Ontology-Based Methods;
◦ OnLocus



Machine Learning Methods;

Geographic Information Retrieval (GIR)





Few commercial geographic search engines
have been commercially developed among
them Google Map and Yahoo Local are
notable
ambiguous dynamic nature of location
names, various addressing styles, lack of
geographic information, and multiple
locations related to a Web resource







extract proper names from texts and
documents
an algorithm that distinguishes five classes
for name of locations: CITY, REGION,
COUNTRY, ISLAND, RIVER, and MOUNTAIN
method is time-consuming and is not useful
for real-time search

tagging individual place names (geotagger);
◦ finds and disambiguates geographic names
(assigning a canonical taxonomy node to
each phrase in the text)
1. Spotting;
2. Disambiguation;
3. Focus determination;
crawling the Web, storing the resulting pages
and indexing their contents









Basically, a geographic search engine must be
able to find related addresses and location
names and assign them to Web pages
Current
address
extraction
techniques
basically require large gazetteers which are
expensive and unavailable for many countries
different markup styles e.g. HTML, XML and
DOM
natural language processing models are not
able to extract all addresses and location
names from Web page contents

divide an address to its semantic components
automatic

pattern-based model which uses HTML
and visual segmentations to improve
address extraction on Web pages
new location names
much human effort

large scale gazetteers










The proposed address extraction system
consists of five components:
HTML Pre- Processor,
Parser,
Knowledge Searcher,
Decision Maker,
and Knowledge Accumulator










analyze HTML tags and codes;
convert HTML files to XML (by employing the
VIPS Demo software);
in-depth analyzing and traversing the XML
to obtain content information;
sorting them in a linear sequence together
with their node numbers;
a node index is built







It tries to find all candidate phrases (potential
addresses) in a node;
divides a potential address into its
component;
Each segment obtained in this step, will be
utilized as default searching unit of Database
Searcher;




itemizes elements of a potential address;
It finds all possibilities of a potential address
and forms them into a list of possible patterns
in three steps:
◦ Standardizing Word Formats (different spells, abbreviations,
synonyms)
◦ Knowledge-Base Place Name Matching (separates elements into
more delicate level)
◦ Ambiguity Eliminating (tries to match place name)




whether a candidate phrase is an address or not;
by matching it with address patterns already stored in
a database;
◦ Delimitating ambiguities and conflicts of place names (syntactic and semantic:

geo/non-geo and geo/geo);

◦ Itemizing each potential address to its elements;
◦ Adding the lost parts to address based on a location tree
wherever it is possible

the address ”No. 10, William Street, Toowong,

Queensland” will be modified as ”No. 10, William
Street, Toowong, Brisbane, Queensland, Australia”



the last component of the system; exhibits in
two aspects:
◦ Location Accumulation;
◦ Address Pattern Accumulation





there are 9 lemmas in KB; 3 lemmas have
multiple identities (Victoria, Churchill, Howard
Avenue);
Following algorithm indicates how place
names are detected in Phrases
◦
◦
◦
◦
◦

PW - A candidate phrase
W - the ith word in PW
f - any syntactic format of W
KB - Knowledge-Base
C - Result Collection
i

i

i

Inputs

1. PW(pre word, Wi) {
2. if ((pre word + f) = a place name found in
KB)
3. add (pre word + f) to Ci;
4. if (pre word + f) = part of a name in KB
5. pre word = pre word + f;
6. PW(pre word, Wi+1);//try next word in PW

1.
2.
3.
4.

SyntacticAE(Potential) {
current word = first word in Potential
C = NULL; //initialize C
While current word != EOF

6. C = SAE (C, current word); //add longest
result in C
7. current word = next new word in Potential;



•
•
•
•

inconsistencies between accumulated
knowledge in KB and extracted information
from the Web:
◦ misspelling and synonymy
◦ incompleteness of KB
Keeping the Conflict
Removing Meaningless Conflict Element
Finding Synonymous Sub-tree
Merging Synonymous Sub-Tree



Direct references
◦ place names, complete postal addresses



Indirect references
◦ postal codes and telephone area codes, or from
expressions that indicate relationships to other
places, which are directly referenced (for instance,
“The hotel is two blocks from Times Square”)



propose a three-phase process for
recognizing geographic evidence in Web
pages:
◦ Extraction (selecting relevant Web content),
◦ Recognition (corresponds to isolating references to
places embedded in text and includes dealing with
ambiguity),
◦ Location (obtains locations from the place
descriptions previously recognized, using
positioning data from gazetteers or from spatial
databases)





an extraction ontology is able to identify
objects and relationships;
ontology must describe rules for identifying
elements within its domain that are present in
Web pages



recognition of terms and expressions as place
names;
◦ compared to a gazetteer: Alexandria and GeoNames



try to determine an actual location
from a gazetteer or performing a process known
as geocoding



Location of direct references



◦ matching and locating



Location of indirect references
◦ Formal

 establish a correspondence between a code and the area it
serves (supported by spatial databases)

◦ Informal

 natural language interpretation is required






apply Text Mining procedures to the Internet
in order to classify places into different
location types (e.g., Maebashi is a CITY,
Honshu is an ISLAND) and to determine for a
given place name, where the place is (e.g.
Maebashi is in Japan, Honshu is in the Pacific
ocean);
acquire exhaustive fine-grained gazetteers
automatically and thus avoid hand-coding;
distinguish 6 location types (CITY, REGION,
COUNTRY, ISLAND, RIVER, MOUNTAIN)




dataset consists of 1260 names of locations
For each class constructed a set of patterns
◦ patterns have the form “KEYWORD+of+X” and
“X+KEYWORD” (Alta Vista counts)





Each class has from 3 (ISLAND) up to 10
(MOUNTAIN) different keywords
Keywords and patterns were selected
manually



For example, for the class CITY use 4
keywords (“city”, “town”, “mayor”, “streets”)
and 7 corresponding patterns (“city+ of+X”,
“X+city”, “town+of+X”, “mayor+of+X”, “X+
mayor”, “streets+of+X”, and “X+streets”

Thank You!
Presented in Information Retrieval Course, under supervision of Dr.
Saeid Asadi

�ݺ�ߣ

Geographic Information Retrieval (GIR)

More Related Content

Similar to Geographic Information Retrieval (GIR) (20)

Recently uploaded (20)

Geographic Information Retrieval (GIR)