際際滷

際際滷Share a Scribd company logo
GIR
Behrooz Rasuli

Iranian Research Inst. For Information Science & Technol.
rasuli9@gmail.com


Address information is essential for people's
daily life. People often need to query
addresses of unfamiliar location through Web
and then use map services to mark down the
location for direction purpose. Although both
address information and map services are
available online, they are not well combined.






general search engines are widely used to
retrieve Web pages
Specialized search engines are dedicated to
find either particular types of resources or
Web pages based on different criteria e.g.
language or geographic location
People use search engines to find Web pages
of local services and events around them or
in a particular area




is the data pertaining to the location of
geographical entities together with their spatial
dimensions
Location could be defined as a place on the
Internet where an Internet resource, such as a
Web page, is stored


Source Geography

 physical location of hosts
 signal processing and network-based techniques



Target Geography

 uses elements contained in the page to deduce
locations (place names, postal addresses, and
phone numbers)
 Challenge: involves evidence extraction, semantic
analysis, and interpretation, in order to link Web
pages to geographic locations




Geographic Information Retrieval (GIR) is an
applied research field that involves indexing,
searching,
retrieving,
and
browsing
georeferenced information sources, and
designing systems to execute these tasks
effectively and efficiently
Like IR, GIR includes indexing, storage and
ranking




pattern extraction from raw text has already
been done. For example, M. Hearst (1990s)
developed an approach for discovering
lexico-syntactic patterns for hypernyms


Pattern-Based Methods;
 Named Entity Recognition (NER)
 Gazetteer approach (Web-a-Where);
 Pattern-based method



Ontology-Based Methods;
 OnLocus



Machine Learning Methods;
Geographic Information Retrieval (GIR)




Few commercial geographic search engines
have been commercially developed among
them Google Map and Yahoo Local are
notable
ambiguous dynamic nature of location
names, various addressing styles, lack of
geographic information, and multiple
locations related to a Web resource






extract proper names from texts and
documents
an algorithm that distinguishes five classes
for name of locations: CITY, REGION,
COUNTRY, ISLAND, RIVER, and MOUNTAIN
method is time-consuming and is not useful
for real-time search
tagging individual place names (geotagger);
 finds and disambiguates geographic names
(assigning a canonical taxonomy node to
each phrase in the text)
1. Spotting;
2. Disambiguation;
3. Focus determination;
crawling the Web, storing the resulting pages
and indexing their contents







Basically, a geographic search engine must be
able to find related addresses and location
names and assign them to Web pages
Current
address
extraction
techniques
basically require large gazetteers which are
expensive and unavailable for many countries
different markup styles e.g. HTML, XML and
DOM
natural language processing models are not
able to extract all addresses and location
names from Web page contents
Geographic Information Retrieval (GIR)
divide an address to its semantic components
automatic

pattern-based model which uses HTML
and visual segmentations to improve
address extraction on Web pages
new location names
much human effort

large scale gazetteers









The proposed address extraction system
consists of five components:
HTML Pre- Processor,
Parser,
Knowledge Searcher,
Decision Maker,
and Knowledge Accumulator









analyze HTML tags and codes;
convert HTML files to XML (by employing the
VIPS Demo software);
in-depth analyzing and traversing the XML
to obtain content information;
sorting them in a linear sequence together
with their node numbers;
a node index is built






It tries to find all candidate phrases (potential
addresses) in a node;
divides a potential address into its
component;
Each segment obtained in this step, will be
utilized as default searching unit of Database
Searcher;



itemizes elements of a potential address;
It finds all possibilities of a potential address
and forms them into a list of possible patterns
in three steps:
 Standardizing Word Formats (different spells, abbreviations,
synonyms)
 Knowledge-Base Place Name Matching (separates elements into
more delicate level)
 Ambiguity Eliminating (tries to match place name)



whether a candidate phrase is an address or not;
by matching it with address patterns already stored in
a database;
 Delimitating ambiguities and conflicts of place names (syntactic and semantic:

geo/non-geo and geo/geo);

 Itemizing each potential address to its elements;
 Adding the lost parts to address based on a location tree
wherever it is possible

the address No. 10, William Street, Toowong,

Queensland will be modified as No. 10, William
Street, Toowong, Brisbane, Queensland, Australia


the last component of the system; exhibits in
two aspects:
 Location Accumulation;
 Address Pattern Accumulation




there are 9 lemmas in KB; 3 lemmas have
multiple identities (Victoria, Churchill, Howard
Avenue);
Following algorithm indicates how place
names are detected in Phrases






PW - A candidate phrase
W - the ith word in PW
f - any syntactic format of W
KB - Knowledge-Base
C - Result Collection
i

i

i

Inputs
1. PW(pre word, Wi) {
2. if ((pre word + f) = a place name found in
KB)
3. add (pre word + f) to Ci;
4. if (pre word + f) = part of a name in KB
5. pre word = pre word + f;
6. PW(pre word, Wi+1);//try next word in PW
1.
2.
3.
4.

SyntacticAE(Potential) {
current word = first word in Potential
C = NULL; //initialize C
While current word != EOF

6. C = SAE (C, current word); //add longest
result in C
7. current word = next new word in Potential;







inconsistencies between accumulated
knowledge in KB and extracted information
from the Web:
 misspelling and synonymy
 incompleteness of KB
Keeping the Conflict
Removing Meaningless Conflict Element
Finding Synonymous Sub-tree
Merging Synonymous Sub-Tree
Geographic Information Retrieval (GIR)


Direct references
 place names, complete postal addresses



Indirect references
 postal codes and telephone area codes, or from
expressions that indicate relationships to other
places, which are directly referenced (for instance,
The hotel is two blocks from Times Square)


propose a three-phase process for
recognizing geographic evidence in Web
pages:
 Extraction (selecting relevant Web content),
 Recognition (corresponds to isolating references to
places embedded in text and includes dealing with
ambiguity),
 Location (obtains locations from the place
descriptions previously recognized, using
positioning data from gazetteers or from spatial
databases)




an extraction ontology is able to identify
objects and relationships;
ontology must describe rules for identifying
elements within its domain that are present in
Web pages


recognition of terms and expressions as place
names;
 compared to a gazetteer: Alexandria and GeoNames
Geographic Information Retrieval (GIR)


try to determine an actual location
from a gazetteer or performing a process known
as geocoding



Location of direct references



 matching and locating



Location of indirect references
 Formal

 establish a correspondence between a code and the area it
serves (supported by spatial databases)

 Informal

 natural language interpretation is required
Geographic Information Retrieval (GIR)
Geographic Information Retrieval (GIR)





apply Text Mining procedures to the Internet
in order to classify places into different
location types (e.g., Maebashi is a CITY,
Honshu is an ISLAND) and to determine for a
given place name, where the place is (e.g.
Maebashi is in Japan, Honshu is in the Pacific
ocean);
acquire exhaustive fine-grained gazetteers
automatically and thus avoid hand-coding;
distinguish 6 location types (CITY, REGION,
COUNTRY, ISLAND, RIVER, MOUNTAIN)



dataset consists of 1260 names of locations
For each class constructed a set of patterns
 patterns have the form KEYWORD+of+X and
X+KEYWORD (Alta Vista counts)





Each class has from 3 (ISLAND) up to 10
(MOUNTAIN) different keywords
Keywords and patterns were selected
manually


For example, for the class CITY use 4
keywords (city, town, mayor, streets)
and 7 corresponding patterns (city+ of+X,
X+city, town+of+X, mayor+of+X, X+
mayor, streets+of+X, and X+streets
Thank You!
Presented in Information Retrieval Course, under supervision of Dr.
Saeid Asadi

More Related Content

Similar to Geographic Information Retrieval (GIR) (20)

PDF
Performing Fast Spatial Query Search by Using Ultimate Code Words
BRNSSPublicationHubI
PPTX
Spatial Databases
Pratibha Chaudhary
PDF
Scalable Keyword Cover Search using Keyword NNE and Inverted Indexing
IRJET Journal
PPTX
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Craig Knoblock
PPTX
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
Jerin John
PDF
Efficiently searching nearest neighbor in documents
eSAT Publishing House
PPTX
Spot db consistency checking and optimization in spatial database
Pratik Udapure
PDF
Syntactic and semantic based approaches for Geoinformation Management - Dr. S...
NeGD Capacity Building
PDF
Efficiently searching nearest neighbor in documents using keywords
eSAT Journals
PDF
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
PPTX
Search engine. Elasticsearch
Selecto
PDF
DBpedia mobile
Kishoj Bajracharya
PPT
Building a Spatial Database in PostgreSQL
Kudos S.A.S
PPT
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
Vikas Jagtap
PPT
Location Information on GIS
Ashik Ekbal khan
PDF
Toward Next Generation of Gazetteer: Utilizing GeoSPARQL For Developing Link...
Dongpo Deng
PPT
Ranking spatial data by quality preferences ppt
Saurav Kumar
PDF
Designing of Semantic Nearest Neighbor Search: Survey
Editor IJCATR
PPTX
Open Source Mapping with Python, and MongoDB
techprane
PDF
Balaji Sharma Professional Summary
Balaji Sharma
Performing Fast Spatial Query Search by Using Ultimate Code Words
BRNSSPublicationHubI
Spatial Databases
Pratibha Chaudhary
Scalable Keyword Cover Search using Keyword NNE and Inverted Indexing
IRJET Journal
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Craig Knoblock
Databases Basics and Spacial Matrix - Discussig Geographic Potentials of Data...
Jerin John
Efficiently searching nearest neighbor in documents
eSAT Publishing House
Spot db consistency checking and optimization in spatial database
Pratik Udapure
Syntactic and semantic based approaches for Geoinformation Management - Dr. S...
NeGD Capacity Building
Efficiently searching nearest neighbor in documents using keywords
eSAT Journals
Efficient processing of continuous spatial-textual queries over geo-textual d...
nooriasukmaningtyas
Search engine. Elasticsearch
Selecto
DBpedia mobile
Kishoj Bajracharya
Building a Spatial Database in PostgreSQL
Kudos S.A.S
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
Vikas Jagtap
Location Information on GIS
Ashik Ekbal khan
Toward Next Generation of Gazetteer: Utilizing GeoSPARQL For Developing Link...
Dongpo Deng
Ranking spatial data by quality preferences ppt
Saurav Kumar
Designing of Semantic Nearest Neighbor Search: Survey
Editor IJCATR
Open Source Mapping with Python, and MongoDB
techprane
Balaji Sharma Professional Summary
Balaji Sharma

Recently uploaded (20)

PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
PDF
Unlocking FME Flows Potential: Architecture Design for Modern Enterprises
Safe Software
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
PDF
Kubernetes - Architecture & Components.pdf
geethak285
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
PPTX
The birth and death of Stars - earth and life science
rizellemarieastrolo
PDF
How to Visualize the Spatio-Temporal Data Using CesiumJS
SANGHEE SHIN
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) 際際滷s
Ravi Tamada
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
PDF
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
The Future of Product Management in AI ERA.pdf
Alyona Owens
Unlocking FME Flows Potential: Architecture Design for Modern Enterprises
Safe Software
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
Kubernetes - Architecture & Components.pdf
geethak285
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
The birth and death of Stars - earth and life science
rizellemarieastrolo
How to Visualize the Spatio-Temporal Data Using CesiumJS
SANGHEE SHIN
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) 際際滷s
Ravi Tamada
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
ArcGIS Utility Network Migration - The Hunter Water Story
Safe Software
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
Ad

Geographic Information Retrieval (GIR)

  • 1. GIR Behrooz Rasuli Iranian Research Inst. For Information Science & Technol. rasuli9@gmail.com
  • 2. Address information is essential for people's daily life. People often need to query addresses of unfamiliar location through Web and then use map services to mark down the location for direction purpose. Although both address information and map services are available online, they are not well combined.
  • 3. general search engines are widely used to retrieve Web pages Specialized search engines are dedicated to find either particular types of resources or Web pages based on different criteria e.g. language or geographic location People use search engines to find Web pages of local services and events around them or in a particular area
  • 4. is the data pertaining to the location of geographical entities together with their spatial dimensions Location could be defined as a place on the Internet where an Internet resource, such as a Web page, is stored
  • 5. Source Geography physical location of hosts signal processing and network-based techniques Target Geography uses elements contained in the page to deduce locations (place names, postal addresses, and phone numbers) Challenge: involves evidence extraction, semantic analysis, and interpretation, in order to link Web pages to geographic locations
  • 6. Geographic Information Retrieval (GIR) is an applied research field that involves indexing, searching, retrieving, and browsing georeferenced information sources, and designing systems to execute these tasks effectively and efficiently Like IR, GIR includes indexing, storage and ranking
  • 7. pattern extraction from raw text has already been done. For example, M. Hearst (1990s) developed an approach for discovering lexico-syntactic patterns for hypernyms
  • 8. Pattern-Based Methods; Named Entity Recognition (NER) Gazetteer approach (Web-a-Where); Pattern-based method Ontology-Based Methods; OnLocus Machine Learning Methods;
  • 10. Few commercial geographic search engines have been commercially developed among them Google Map and Yahoo Local are notable ambiguous dynamic nature of location names, various addressing styles, lack of geographic information, and multiple locations related to a Web resource
  • 11. extract proper names from texts and documents an algorithm that distinguishes five classes for name of locations: CITY, REGION, COUNTRY, ISLAND, RIVER, and MOUNTAIN method is time-consuming and is not useful for real-time search
  • 12. tagging individual place names (geotagger); finds and disambiguates geographic names (assigning a canonical taxonomy node to each phrase in the text) 1. Spotting; 2. Disambiguation; 3. Focus determination; crawling the Web, storing the resulting pages and indexing their contents
  • 13. Basically, a geographic search engine must be able to find related addresses and location names and assign them to Web pages Current address extraction techniques basically require large gazetteers which are expensive and unavailable for many countries different markup styles e.g. HTML, XML and DOM natural language processing models are not able to extract all addresses and location names from Web page contents
  • 15. divide an address to its semantic components automatic pattern-based model which uses HTML and visual segmentations to improve address extraction on Web pages new location names much human effort large scale gazetteers
  • 16. The proposed address extraction system consists of five components: HTML Pre- Processor, Parser, Knowledge Searcher, Decision Maker, and Knowledge Accumulator
  • 17. analyze HTML tags and codes; convert HTML files to XML (by employing the VIPS Demo software); in-depth analyzing and traversing the XML to obtain content information; sorting them in a linear sequence together with their node numbers; a node index is built
  • 18. It tries to find all candidate phrases (potential addresses) in a node; divides a potential address into its component; Each segment obtained in this step, will be utilized as default searching unit of Database Searcher;
  • 19. itemizes elements of a potential address; It finds all possibilities of a potential address and forms them into a list of possible patterns in three steps: Standardizing Word Formats (different spells, abbreviations, synonyms) Knowledge-Base Place Name Matching (separates elements into more delicate level) Ambiguity Eliminating (tries to match place name)
  • 20. whether a candidate phrase is an address or not; by matching it with address patterns already stored in a database; Delimitating ambiguities and conflicts of place names (syntactic and semantic: geo/non-geo and geo/geo); Itemizing each potential address to its elements; Adding the lost parts to address based on a location tree wherever it is possible the address No. 10, William Street, Toowong, Queensland will be modified as No. 10, William Street, Toowong, Brisbane, Queensland, Australia
  • 21. the last component of the system; exhibits in two aspects: Location Accumulation; Address Pattern Accumulation
  • 22. there are 9 lemmas in KB; 3 lemmas have multiple identities (Victoria, Churchill, Howard Avenue); Following algorithm indicates how place names are detected in Phrases PW - A candidate phrase W - the ith word in PW f - any syntactic format of W KB - Knowledge-Base C - Result Collection i i i Inputs
  • 23. 1. PW(pre word, Wi) { 2. if ((pre word + f) = a place name found in KB) 3. add (pre word + f) to Ci; 4. if (pre word + f) = part of a name in KB 5. pre word = pre word + f; 6. PW(pre word, Wi+1);//try next word in PW
  • 24. 1. 2. 3. 4. SyntacticAE(Potential) { current word = first word in Potential C = NULL; //initialize C While current word != EOF 6. C = SAE (C, current word); //add longest result in C 7. current word = next new word in Potential;
  • 25. inconsistencies between accumulated knowledge in KB and extracted information from the Web: misspelling and synonymy incompleteness of KB Keeping the Conflict Removing Meaningless Conflict Element Finding Synonymous Sub-tree Merging Synonymous Sub-Tree
  • 27. Direct references place names, complete postal addresses Indirect references postal codes and telephone area codes, or from expressions that indicate relationships to other places, which are directly referenced (for instance, The hotel is two blocks from Times Square)
  • 28. propose a three-phase process for recognizing geographic evidence in Web pages: Extraction (selecting relevant Web content), Recognition (corresponds to isolating references to places embedded in text and includes dealing with ambiguity), Location (obtains locations from the place descriptions previously recognized, using positioning data from gazetteers or from spatial databases)
  • 29. an extraction ontology is able to identify objects and relationships; ontology must describe rules for identifying elements within its domain that are present in Web pages
  • 30. recognition of terms and expressions as place names; compared to a gazetteer: Alexandria and GeoNames
  • 32. try to determine an actual location from a gazetteer or performing a process known as geocoding Location of direct references matching and locating Location of indirect references Formal establish a correspondence between a code and the area it serves (supported by spatial databases) Informal natural language interpretation is required
  • 35. apply Text Mining procedures to the Internet in order to classify places into different location types (e.g., Maebashi is a CITY, Honshu is an ISLAND) and to determine for a given place name, where the place is (e.g. Maebashi is in Japan, Honshu is in the Pacific ocean); acquire exhaustive fine-grained gazetteers automatically and thus avoid hand-coding; distinguish 6 location types (CITY, REGION, COUNTRY, ISLAND, RIVER, MOUNTAIN)
  • 36. dataset consists of 1260 names of locations For each class constructed a set of patterns patterns have the form KEYWORD+of+X and X+KEYWORD (Alta Vista counts) Each class has from 3 (ISLAND) up to 10 (MOUNTAIN) different keywords Keywords and patterns were selected manually
  • 37. For example, for the class CITY use 4 keywords (city, town, mayor, streets) and 7 corresponding patterns (city+ of+X, X+city, town+of+X, mayor+of+X, X+ mayor, streets+of+X, and X+streets
  • 38. Thank You! Presented in Information Retrieval Course, under supervision of Dr. Saeid Asadi