際際滷

際際滷Share a Scribd company logo
Dont Scrape, Glean . . Tom Morris
Scraping sucks.
def  lastlogin   (@hmodel/ &quot;//td[@class='text'][@width='193']&quot; ).first.innerHTML.split(&quot;<br />&quot;[ 9 ].strip[ -10 .. -1 ]   return  date[ -4 .. -1 ] + &quot;-&quot; + date[ -7 .. -6 ] + &quot;-&quot; + date[ -10 .. -9 ] end end end end
Hpricot for Last login date on MySpace.
try :   lastlogin = self.soup.findAll( True , { &quot;width&quot; :  &quot;193&quot; })[ 0 ].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string   loginregex  =  re.compile(  r &quot; [0-9] / [0-9] +/ [0-9]* &quot;)   loginregex_inst  =  loginregex.search(lastlogin)   if  loginregex_inst  is   not None :   self.lastlogin  =  loginregex_inst.group()   except :   pass pass pass pass pass pass pass pass
Taken from a Python/BeautifulSoup library.
(The Ruby is prettier, but whos counting?)
乙艶岳掘鉛艶馨艶稼岳壊京霞遺鉛温壊壊鰻温馨艶(f看看)畏0液.界鞄庄鉛糸姻艶稼
Its an edge case. MySpaces HTML  is  worse than average.
But it is an ugly recipe for mental turmoil.
The alternative?
flickr.getPhotos()
And you get back nice XML or JSON (or even SOAP!) (or even SOAP!)
But D.R.Y.! APIs break that principle. APIs break that principle.
This is the data equivalent of the accessible version.
Enter GRDDL.
GRDDL defines a transformation process for XHTML 損 RDF.
XHTML ? Thats what the spec says. Thats what the spec says.
HTML 4 works too. Tidy ! !
RDF? Yes. Trust me. Its not evil. Its not evil. Its not evil.
GRDDL can work like a data stylesheet on top of your HTML. on top of your HTML. on top of your HTML.
You simply use HTML (or XML) in the normal way...
...and define how the data transformation.
You can even use it as a bridge for exisiting APIs and services.
Could even be used for other formats than RDF. Atom? than RDF. Atom? than RDF. Atom?
Simple example: Not Safe For Work Not Safe For Work
<a href=&quot; http://tubgirl.com &quot; class=&quot;nsfw&quot;>
I can write that. I cant write xFolk by hand. I cant write xFolk by hand.
Is nsfw a good class name? No.
Do I care? No.
The data layer becomes separated like CSS is from HTML.
Thats the theory. Now for the demo. Now for the demo.
irc.freenode.net #swig #swhack #swhack #swhack
getsemantic.com [email_address] [email_address]
[email_address] http://tommorris.org http://tommorris.org
Ad

Recommended

CSS naming | ceci n'est pas un pipe
CSS naming | ceci n'est pas un pipe
Wilfred Nas
2310 b xd
2310 b xd
Krazy Koder
Responsive Typography II
Responsive Typography II
Clarissa Peterson
My First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
basic knowledge abot html
basic knowledge abot html
Ankit Dubey
zigbee
zigbee
mahamad juber
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
SAP PartnerEdge program for Application Development
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
SAP PartnerEdge program for Application Development
XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
Diego Berrueta
Semantic framework for web scraping.
Semantic framework for web scraping.
Shyjal Raazi
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
careerPointBasti
ravenbenweb xml and its application .PPT
ravenbenweb xml and its application .PPT
ubaidullah75790
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
advDBMS_XML.pptx
advDBMS_XML.pptx
IreneGetzi
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
Adventures in Linked Data Land (presentation by Richard Light)
Adventures in Linked Data Land (presentation by Richard Light)
jottevanger
Scraping Scripting Hacking
Scraping Scripting Hacking
Mike Ellis
Building a semantic website
Building a semantic website
CJ Jenkins
BDACA - Lecture6
BDACA - Lecture6
Department of Communication Science, University of Amsterdam
Can your website be your API and real life
Can your website be your API and real life
Glenn Jones
YQL: Select * from Internet
YQL: Select * from Internet
drgath
Wade.Semantic User Profiles
Wade.Semantic User Profiles
ancutaionel
Infromation Reprentation, Structured Data and Semantics
Infromation Reprentation, Structured Data and Semantics
Yogendra Tamang
Intro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
Xml
Xml
baabtra.com - No. 1 supplier of quality freshers
Creating web applications with LODSPeaKr
Creating web applications with LODSPeaKr
Alvaro Graves
unit_5_XML data integration database management
unit_5_XML data integration database management
sathiyabcsbs
Semantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with Ontologies
Amit Jain
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance

More Related Content

Similar to Don't scrape, Glean! (20)

XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
Diego Berrueta
Semantic framework for web scraping.
Semantic framework for web scraping.
Shyjal Raazi
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
careerPointBasti
ravenbenweb xml and its application .PPT
ravenbenweb xml and its application .PPT
ubaidullah75790
WEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
advDBMS_XML.pptx
advDBMS_XML.pptx
IreneGetzi
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
Adventures in Linked Data Land (presentation by Richard Light)
Adventures in Linked Data Land (presentation by Richard Light)
jottevanger
Scraping Scripting Hacking
Scraping Scripting Hacking
Mike Ellis
Building a semantic website
Building a semantic website
CJ Jenkins
BDACA - Lecture6
BDACA - Lecture6
Department of Communication Science, University of Amsterdam
Can your website be your API and real life
Can your website be your API and real life
Glenn Jones
YQL: Select * from Internet
YQL: Select * from Internet
drgath
Wade.Semantic User Profiles
Wade.Semantic User Profiles
ancutaionel
Infromation Reprentation, Structured Data and Semantics
Infromation Reprentation, Structured Data and Semantics
Yogendra Tamang
Intro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
Xml
Xml
baabtra.com - No. 1 supplier of quality freshers
Creating web applications with LODSPeaKr
Creating web applications with LODSPeaKr
Alvaro Graves
unit_5_XML data integration database management
unit_5_XML data integration database management
sathiyabcsbs
Semantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with Ontologies
Amit Jain
XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
XSLT+SPARQL: Scripting the Semantic Web with SPARQL embedded into XSLT styles...
Diego Berrueta
Semantic framework for web scraping.
Semantic framework for web scraping.
Shyjal Raazi
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
DATA INTEGRATION (Gaining Access to Diverse Data).ppt
careerPointBasti
ravenbenweb xml and its application .PPT
ravenbenweb xml and its application .PPT
ubaidullah75790
advDBMS_XML.pptx
advDBMS_XML.pptx
IreneGetzi
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
jakehofman
Adventures in Linked Data Land (presentation by Richard Light)
Adventures in Linked Data Land (presentation by Richard Light)
jottevanger
Scraping Scripting Hacking
Scraping Scripting Hacking
Mike Ellis
Building a semantic website
Building a semantic website
CJ Jenkins
Can your website be your API and real life
Can your website be your API and real life
Glenn Jones
YQL: Select * from Internet
YQL: Select * from Internet
drgath
Wade.Semantic User Profiles
Wade.Semantic User Profiles
ancutaionel
Infromation Reprentation, Structured Data and Semantics
Infromation Reprentation, Structured Data and Semantics
Yogendra Tamang
Intro to web scraping with Python
Intro to web scraping with Python
Maris Lemba
Creating web applications with LODSPeaKr
Creating web applications with LODSPeaKr
Alvaro Graves
unit_5_XML data integration database management
unit_5_XML data integration database management
sathiyabcsbs
Semantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with Ontologies
Amit Jain

Recently uploaded (20)

FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
Bridging the divide: A conversation on tariffs today in the book industry - T...
Bridging the divide: A conversation on tariffs today in the book industry - T...
BookNet Canada
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
Supporting the NextGen 911 Digital Transformation with FME
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
AudGram Review: Build Visually Appealing, AI-Enhanced Audiograms to Engage Yo...
AudGram Review: Build Visually Appealing, AI-Enhanced Audiograms to Engage Yo...
SOFTTECHHUB
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
Why Its Critical to Have an Integrated Development Methodology for Edge AI,...
Why Its Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance Seminar State of Passkeys.pptx
FIDO Alliance
MuleSoft for AgentForce : Topic Center and API Catalog
MuleSoft for AgentForce : Topic Center and API Catalog
shyamraj55
Bridging the divide: A conversation on tariffs today in the book industry - T...
Bridging the divide: A conversation on tariffs today in the book industry - T...
BookNet Canada
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc Webinar - 2025 Global Privacy Survey
TrustArc
Artificial Intelligence in the Nonprofit Boardroom.pdf
Artificial Intelligence in the Nonprofit Boardroom.pdf
OnBoard
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Creating Inclusive Digital Learning with AI: A Smarter, Fairer Future
Impelsys Inc.
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC and Open Hackathons Monthly Highlights June 2025
OpenACC
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
War_And_Cyber_3_Years_Of_Struggle_And_Lessons_For_Global_Security.pdf
biswajitbanerjee38
Supporting the NextGen 911 Digital Transformation with FME
Supporting the NextGen 911 Digital Transformation with FME
Safe Software
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
National Fuels Treatments Initiative: Building a Seamless Map of Hazardous Fu...
Safe Software
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Reducing Conflicts and Increasing Safety Along the Cycling Networks of East-F...
Safe Software
AI VIDEO MAGAZINE - June 2025 - r/aivideo
AI VIDEO MAGAZINE - June 2025 - r/aivideo
1pcity Studios, Inc
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
AudGram Review: Build Visually Appealing, AI-Enhanced Audiograms to Engage Yo...
AudGram Review: Build Visually Appealing, AI-Enhanced Audiograms to Engage Yo...
SOFTTECHHUB
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Seminar: Targeting Trust: The Future of Identity in the Workforce.pptx
FIDO Alliance
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Integration of Utility Data into 3D BIM Models Using a 3D Solids Modeling Wor...
Safe Software
Why Its Critical to Have an Integrated Development Methodology for Edge AI,...
Why Its Critical to Have an Integrated Development Methodology for Edge AI,...
Edge AI and Vision Alliance
Ad

Don't scrape, Glean!