際際滷

際際滷Share a Scribd company logo
In today's web
Information Extraction
from the Web
Benjamin Habegger
University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205
Seminary on Information Extraction from the Web
ENSIAS, Rabat, Morocco - June 19, 2013
About Me
@b_habegger
http://www.linkedin.com/in/benjaminhabegger
benjamin.habegger@insa-lyon.fr
Where is the web today ?
Web of humans
 Interlinked documents
 Social Web
 Web 2.0
 Crowd-sourcing
Web of machines
 REST / API
 Service Interaction
 Open Data
 Semantic Web
Somehow we're creating 2 webs
Web of DataWeb of humans
HTML
Javascript
CSS
RDF
REST
SPARQL
There are some interactions
Open data still has some way to go
Data thrown on the web in its original format
 Not many standardized formats
 Not many standardized semantics
 Can be
 An Excel, CSV file
 A REST service
Still the Linked Open Data and
Semantic Web are emerging
 Vocabularies
 Foaf
 Dublin Core
 
 Datasets
 DBPedia
 ...
But still, can't we dream a little ?
Having (a little) smarter machines...
Shared web
Learning capabilities
Making our web robots smarter
could even help improve our web...
What does the following query give you today ?
lyon informatique emploi
Do you see any jobs there ?
Nope, listing of pages which
contain lists of jobs, ...
There's still a long way to go...
but information extraction from the web
is a little step in making machines smarter
And there are many people
interested out there...
Freelancer.com search for web scrapping
So where does information
extraction from the web fit in ?
Open DataOpen Data
Linked DataLinked Data
Semantic WebSemantic Web
Information ExtractionInformation Extraction
Machine LearningMachine Learning
Pattern MiningPattern Mining
Data IntegrationData Integration
Standardized VocabulariesStandardized Vocabularies
Machine LearningMachine Learning
Web ScrappingWeb Scrapping
And what is it about ?
...
Data for humans
Data for machines
How do we do that ?
We'll see that after the break :)
http://www.slideshare.net/BenjaminHabegger/2013-06ensiasrabatiealg
Ad

Recommended

Nicolas Delaforge: Modeling the Web resource, extracting the context: stakes ...
Nicolas Delaforge: Modeling the Web resource, extracting the context: stakes ...
PhiloWeb
Rijksoverheid.nl - Content migration CmPros Gilbane Boston 1 December 2009
Rijksoverheid.nl - Content migration CmPros Gilbane Boston 1 December 2009
Gerrit Berkouwer
The CSO Open Data Experience
The CSO Open Data Experience
Dublinked .
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe
Adaptive information extraction
Adaptive information extraction
unyil96
Information extraction systems aspects and characteristics
Information extraction systems aspects and characteristics
George Ang
Open Calais
Open Calais
ymark
Information Extraction
Information Extraction
Ignacio Delgado
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
Anna Lisa Gentile
Anne-Catherine Gerber 1954 - 2015
Benjamin Habegger
Feedback from a startup experience in collaboration with academia
Feedback from a startup experience in collaboration with academia
Benjamin Habegger
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
Matthew Rowe
Social Computing Research with Apache Spark
Social Computing Research with Apache Spark
Matthew Rowe
Comparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache Stanbol
Vladimir Alexiev, PhD, PMP
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
Information Extraction
Information Extraction
butest
An Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
Ankur Biswas
Semantic Web
Semantic Web
butest
Web Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading Group
Tobias Wunner
2009 God
2009 God
xoanon
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
Michael Genkin
osm.cs.byu.edu
osm.cs.byu.edu
butest
The Data Records Extraction from Web Pages
The Data Records Extraction from Web Pages
ijtsrd
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
information extraction by selamu shirtawi
information extraction by selamu shirtawi
selamu shirtawi
Search Engine Scrapper
Search Engine Scrapper
IRJET Journal
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
csandit

More Related Content

Viewers also liked (7)

Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
Anna Lisa Gentile
Anne-Catherine Gerber 1954 - 2015
Benjamin Habegger
Feedback from a startup experience in collaboration with academia
Feedback from a startup experience in collaboration with academia
Benjamin Habegger
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
Matthew Rowe
Social Computing Research with Apache Spark
Social Computing Research with Apache Spark
Matthew Rowe
Comparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache Stanbol
Vladimir Alexiev, PhD, PMP
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
Anna Lisa Gentile
Anne-Catherine Gerber 1954 - 2015
Benjamin Habegger
Feedback from a startup experience in collaboration with academia
Feedback from a startup experience in collaboration with academia
Benjamin Habegger
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
Matthew Rowe
Social Computing Research with Apache Spark
Social Computing Research with Apache Spark
Matthew Rowe
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger

Similar to Information Extraction from the Web - In today's web (20)

Information Extraction
Information Extraction
butest
An Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
Ankur Biswas
Semantic Web
Semantic Web
butest
Web Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading Group
Tobias Wunner
2009 God
2009 God
xoanon
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
Michael Genkin
osm.cs.byu.edu
osm.cs.byu.edu
butest
The Data Records Extraction from Web Pages
The Data Records Extraction from Web Pages
ijtsrd
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
information extraction by selamu shirtawi
information extraction by selamu shirtawi
selamu shirtawi
Search Engine Scrapper
Search Engine Scrapper
IRJET Journal
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
csandit
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
cscpconf
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
Giorgio Orsi
lecture19-Web-QA.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmm
lecture19-Web-QA.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmm
RAtna29
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
Peter Mika
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Yunyao Li
Information Extraction
Information Extraction
butest
An Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
Ankur Biswas
Semantic Web
Semantic Web
butest
Web Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading Group
Tobias Wunner
2009 God
2009 God
xoanon
Web Information Extraction for the Database Research Domain
Web Information Extraction for the Database Research Domain
Michael Genkin
osm.cs.byu.edu
osm.cs.byu.edu
butest
The Data Records Extraction from Web Pages
The Data Records Extraction from Web Pages
ijtsrd
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
Automatic Hypernym Classification: Towards the Induction of ...
Automatic Hypernym Classification: Towards the Induction of ...
butest
information extraction by selamu shirtawi
information extraction by selamu shirtawi
selamu shirtawi
Search Engine Scrapper
Search Engine Scrapper
IRJET Journal
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
csandit
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
cscpconf
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
Giorgio Orsi
lecture19-Web-QA.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmm
lecture19-Web-QA.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmm
RAtna29
Semantic Search keynote at CORIA 2015
Semantic Search keynote at CORIA 2015
Peter Mika
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Yunyao Li
Ad

Recently uploaded (20)

Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Wenn alles versagt - IBM Tape sch端tzt, was z辰hlt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape sch端tzt, was z辰hlt! Und besonders mit dem neust...
Josef Weingand
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
The Future of Technology: 2025-2125 by Saikat Basu.pdf
The Future of Technology: 2025-2125 by Saikat Basu.pdf
Saikat Basu
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
CapCut Pro Crack For PC Latest Version {Fully Unlocked} 2025
pcprocore
Wenn alles versagt - IBM Tape sch端tzt, was z辰hlt! Und besonders mit dem neust...
Wenn alles versagt - IBM Tape sch端tzt, was z辰hlt! Und besonders mit dem neust...
Josef Weingand
"Database isolation: how we deal with hundreds of direct connections to the d...
"Database isolation: how we deal with hundreds of direct connections to the d...
Fwdays
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
A Constitutional Quagmire - Ethical Minefields of AI, Cyber, and Privacy.pdf
Priyanka Aash
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
Tech-ASan: Two-stage check for Address Sanitizer - Yixuan Cao.pdf
caoyixuan2019
2025_06_18 - OpenMetadata Community Meeting.pdf
2025_06_18 - OpenMetadata Community Meeting.pdf
OpenMetadata
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
Fwdays
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
UserCon Belgium: Honey, VMware increased my bill
UserCon Belgium: Honey, VMware increased my bill
stijn40
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Using the SQLExecutor for Data Quality Management: aka One man's love for the...
Safe Software
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
10 Key Challenges for AI within the EU Data Protection Framework.pdf
10 Key Challenges for AI within the EU Data Protection Framework.pdf
Priyanka Aash
Lessons Learned from Developing Secure AI Workflows.pdf
Lessons Learned from Developing Secure AI Workflows.pdf
Priyanka Aash
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Fwdays
Ad

Information Extraction from the Web - In today's web

  • 1. In today's web Information Extraction from the Web Benjamin Habegger University of Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205 Seminary on Information Extraction from the Web ENSIAS, Rabat, Morocco - June 19, 2013
  • 3. Where is the web today ? Web of humans Interlinked documents Social Web Web 2.0 Crowd-sourcing Web of machines REST / API Service Interaction Open Data Semantic Web
  • 4. Somehow we're creating 2 webs Web of DataWeb of humans HTML Javascript CSS RDF REST SPARQL
  • 5. There are some interactions
  • 6. Open data still has some way to go Data thrown on the web in its original format Not many standardized formats Not many standardized semantics Can be An Excel, CSV file A REST service
  • 7. Still the Linked Open Data and Semantic Web are emerging Vocabularies Foaf Dublin Core Datasets DBPedia ...
  • 8. But still, can't we dream a little ? Having (a little) smarter machines... Shared web Learning capabilities
  • 9. Making our web robots smarter could even help improve our web... What does the following query give you today ? lyon informatique emploi
  • 10. Do you see any jobs there ?
  • 11. Nope, listing of pages which contain lists of jobs, ...
  • 12. There's still a long way to go... but information extraction from the web is a little step in making machines smarter
  • 13. And there are many people interested out there... Freelancer.com search for web scrapping
  • 14. So where does information extraction from the web fit in ? Open DataOpen Data Linked DataLinked Data Semantic WebSemantic Web Information ExtractionInformation Extraction Machine LearningMachine Learning Pattern MiningPattern Mining Data IntegrationData Integration Standardized VocabulariesStandardized Vocabularies Machine LearningMachine Learning Web ScrappingWeb Scrapping
  • 15. And what is it about ? ... Data for humans Data for machines
  • 16. How do we do that ? We'll see that after the break :) http://www.slideshare.net/BenjaminHabegger/2013-06ensiasrabatiealg