際際滷

際際滷Share a Scribd company logo
Screen Scraping with Ruby Jeremy Raines [email_address]
Pre-Reqs XPATH Used for addressing elements in an XML doc http://www.w3schools.com/XPath/default.asp Ruby Regular Expressions www.rubular.com  -- fastest way to build rope-swinging regex skills in Ruby
Rubular Input a string and try out regexs on it
XPATH basics //book  -- selects all book nodes no matter where they are in the document //bookstore/book  selects all books that are a direct child element of bookstore //bookstore//book  selects all books that are children of bookstore, no matter how deep in tree //book[1] - selects first book element //book@category =fiction  -- @ selects an attribute, in this case specifically fiction books
Not-so-secret weapon: Firebug Lets you find the Xpath of any element in a web page. Inspect Element of a representative item Focus on classes, think about loops This will produce easier to read xpaths
Example
Example (con't) //sbScoreboxScores//sbScoreboxTeamAway //sbScoreboxScores//sbScoreboxTotal
How do we get at the data in these elements using Ruby? Open-uri  built-in library for opening pages REXML  built-in Xpath parser Fast, good for straightforward tasks Hpricot  popular gem More powerful, I usually use in conjuction with: WWW::Mechanize For getting at data behind forms.  Requires Hpricot ScRUBYt! -- powerful, high-level abstraction, magic
My Process Find a good source Use firebug to determine Xpaths of the elements you want to scrape Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools Refine Xpaths Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc) Ruby's string methods help with this last part
Example:  Scraping Quotes from your Tumblr
Warning Screen scraping is very iterative & involves a lot of trial and error Make sure you comment a lot as you go along Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
References http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ Inspiration for this presentation Describes the Firebug + hpricot method http://code.whytheluckystiff.net/hpricot/wiki http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/ http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html WWW::Mechanize examples For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd  degree contacts jeremyraines.com/linkedinscraper

More Related Content

Similar to Screen Scraping with Ruby (20)

Consuming API description languages - Refract & Minim
Consuming API description languages - Refract & MinimConsuming API description languages - Refract & Minim
Consuming API description languages - Refract & Minim
Jakub Nesetril
All of Javascript
All of JavascriptAll of Javascript
All of Javascript
Togakangaroo
All of javascript
All of javascriptAll of javascript
All of javascript
Togakangaroo
The JavaScript You Wished You Knew
The JavaScript You Wished You KnewThe JavaScript You Wished You Knew
The JavaScript You Wished You Knew
Troy Miles
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
Chandler Huang
How to Reverse Engineer Web Applications
How to Reverse Engineer Web ApplicationsHow to Reverse Engineer Web Applications
How to Reverse Engineer Web Applications
Jarrod Overson
Sax Dom Tutorial
Sax Dom TutorialSax Dom Tutorial
Sax Dom Tutorial
vikram singh
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java Developers
Robert Reiz
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
Chui-Wen Chiu
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
Rishav Dixit
Javasession6
Javasession6Javasession6
Javasession6
Rajeev Kumar
Rails vu d'un Javaiste
Rails vu d'un JavaisteRails vu d'un Javaiste
Rails vu d'un Javaiste
Christian Blavier
Default Namespace
Default NamespaceDefault Namespace
Default Namespace
LiquidHub
Java scriptforjavadev part2a
Java scriptforjavadev part2aJava scriptforjavadev part2a
Java scriptforjavadev part2a
Makarand Bhatambarekar
Ruby Xml Mapping
Ruby Xml MappingRuby Xml Mapping
Ruby Xml Mapping
Marc Seeger
Workin ontherailsroad
Workin ontherailsroadWorkin ontherailsroad
Workin ontherailsroad
Jim Jones
WorkinOnTheRailsRoad
WorkinOnTheRailsRoadWorkinOnTheRailsRoad
WorkinOnTheRailsRoad
webuploader
Assign
AssignAssign
Assign
EMSNEWS
Scripting as a Second Language
Scripting as a Second LanguageScripting as a Second Language
Scripting as a Second Language
Rob Dunn
Consuming API description languages - Refract & Minim
Consuming API description languages - Refract & MinimConsuming API description languages - Refract & Minim
Consuming API description languages - Refract & Minim
Jakub Nesetril
All of Javascript
All of JavascriptAll of Javascript
All of Javascript
Togakangaroo
All of javascript
All of javascriptAll of javascript
All of javascript
Togakangaroo
The JavaScript You Wished You Knew
The JavaScript You Wished You KnewThe JavaScript You Wished You Knew
The JavaScript You Wished You Knew
Troy Miles
How to Reverse Engineer Web Applications
How to Reverse Engineer Web ApplicationsHow to Reverse Engineer Web Applications
How to Reverse Engineer Web Applications
Jarrod Overson
Sax Dom Tutorial
Sax Dom TutorialSax Dom Tutorial
Sax Dom Tutorial
vikram singh
Ruby for Java Developers
Ruby for Java DevelopersRuby for Java Developers
Ruby for Java Developers
Robert Reiz
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
Using Thinking Sphinx with rails
Using Thinking Sphinx with railsUsing Thinking Sphinx with rails
Using Thinking Sphinx with rails
Rishav Dixit
Default Namespace
Default NamespaceDefault Namespace
Default Namespace
LiquidHub
Ruby Xml Mapping
Ruby Xml MappingRuby Xml Mapping
Ruby Xml Mapping
Marc Seeger
Workin ontherailsroad
Workin ontherailsroadWorkin ontherailsroad
Workin ontherailsroad
Jim Jones
WorkinOnTheRailsRoad
WorkinOnTheRailsRoadWorkinOnTheRailsRoad
WorkinOnTheRailsRoad
webuploader
Assign
AssignAssign
Assign
EMSNEWS
Scripting as a Second Language
Scripting as a Second LanguageScripting as a Second Language
Scripting as a Second Language
Rob Dunn

Recently uploaded (20)

AC1-intro-agenda-Agile concepts in an enterprise environment
AC1-intro-agenda-Agile concepts in an enterprise environmentAC1-intro-agenda-Agile concepts in an enterprise environment
AC1-intro-agenda-Agile concepts in an enterprise environment
Dennis Van Aelst
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdfBrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
Nick Samuel
Introduction to Generative AI refers to a subset of artificial intelligence
Introduction to Generative AI refers to a subset of artificial intelligenceIntroduction to Generative AI refers to a subset of artificial intelligence
Introduction to Generative AI refers to a subset of artificial intelligence
Kongu Engineering College, Perundurai, Erode
Fran巽ais Patch Tuesday - Avril
Fran巽ais Patch Tuesday - AvrilFran巽ais Patch Tuesday - Avril
Fran巽ais Patch Tuesday - Avril
Ivanti
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryDataNew from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
BookNet Canada
How to Achieve High-Accuracy Results When Using LLMs
How to Achieve High-Accuracy Results When Using LLMsHow to Achieve High-Accuracy Results When Using LLMs
How to Achieve High-Accuracy Results When Using LLMs
Aggregage
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
Principled Technologies
Laravel Crud Tutorial Basic Step by Stepy S
Laravel Crud Tutorial Basic Step by Stepy SLaravel Crud Tutorial Basic Step by Stepy S
Laravel Crud Tutorial Basic Step by Stepy S
christopherneo4
Webinar - Protecting Your Microsoft 365 Data
Webinar - Protecting Your Microsoft 365 DataWebinar - Protecting Your Microsoft 365 Data
Webinar - Protecting Your Microsoft 365 Data
MSP360
SaaS Product Development Best Practices
SaaS Product Development Best PracticesSaaS Product Development Best Practices
SaaS Product Development Best Practices
ApptDev
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task BotsUnlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Expeed Software
UiPath Community Dubai: Discover Unified Apps
UiPath Community Dubai: Discover Unified AppsUiPath Community Dubai: Discover Unified Apps
UiPath Community Dubai: Discover Unified Apps
UiPathCommunity
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
Pavel Shukhman
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Buwaneka De Silva
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Agile ME
Model Context Protocol (MCP): The Future of AI | Bluebash
Model Context Protocol (MCP): The Future of AI | BluebashModel Context Protocol (MCP): The Future of AI | Bluebash
Model Context Protocol (MCP): The Future of AI | Bluebash
Bluebash
Introduction to PHP from Beginning to End
Introduction to PHP from Beginning to EndIntroduction to PHP from Beginning to End
Introduction to PHP from Beginning to End
christopherneo4
April Patch Tuesday
April Patch TuesdayApril Patch Tuesday
April Patch Tuesday
Ivanti
Threat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community DayThreat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community Day
Teri Radichel
Build With AI X GDG Harare Beginners .pdf
Build With AI X GDG Harare Beginners .pdfBuild With AI X GDG Harare Beginners .pdf
Build With AI X GDG Harare Beginners .pdf
Google Developer Group - Harare
AC1-intro-agenda-Agile concepts in an enterprise environment
AC1-intro-agenda-Agile concepts in an enterprise environmentAC1-intro-agenda-Agile concepts in an enterprise environment
AC1-intro-agenda-Agile concepts in an enterprise environment
Dennis Van Aelst
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdfBrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
BrightonSEO April 2025 - hreflang XML E-Commerce - Nick Samuel.pdf
Nick Samuel
Fran巽ais Patch Tuesday - Avril
Fran巽ais Patch Tuesday - AvrilFran巽ais Patch Tuesday - Avril
Fran巽ais Patch Tuesday - Avril
Ivanti
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryDataNew from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
BookNet Canada
How to Achieve High-Accuracy Results When Using LLMs
How to Achieve High-Accuracy Results When Using LLMsHow to Achieve High-Accuracy Results When Using LLMs
How to Achieve High-Accuracy Results When Using LLMs
Aggregage
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
A Dell PowerStore shared storage solution is more cost-effective than an HCI ...
Principled Technologies
Laravel Crud Tutorial Basic Step by Stepy S
Laravel Crud Tutorial Basic Step by Stepy SLaravel Crud Tutorial Basic Step by Stepy S
Laravel Crud Tutorial Basic Step by Stepy S
christopherneo4
Webinar - Protecting Your Microsoft 365 Data
Webinar - Protecting Your Microsoft 365 DataWebinar - Protecting Your Microsoft 365 Data
Webinar - Protecting Your Microsoft 365 Data
MSP360
SaaS Product Development Best Practices
SaaS Product Development Best PracticesSaaS Product Development Best Practices
SaaS Product Development Best Practices
ApptDev
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task BotsUnlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Unlocking Efficiency with RPA: A Deep Dive into Automation Anywhere Task Bots
Expeed Software
UiPath Community Dubai: Discover Unified Apps
UiPath Community Dubai: Discover Unified AppsUiPath Community Dubai: Discover Unified Apps
UiPath Community Dubai: Discover Unified Apps
UiPathCommunity
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
From SBOMs to xBOMs to Transparency - Pavel Shukhman at OWASP Ottawa on 2025-...
Pavel Shukhman
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Atlassian Community - Human-Centered AI in Product Management Unleashing Inno...
Buwaneka De Silva
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Leadership in the AI Era: The Reinvention of Human-Centered Leadership by Bor...
Agile ME
Model Context Protocol (MCP): The Future of AI | Bluebash
Model Context Protocol (MCP): The Future of AI | BluebashModel Context Protocol (MCP): The Future of AI | Bluebash
Model Context Protocol (MCP): The Future of AI | Bluebash
Bluebash
Introduction to PHP from Beginning to End
Introduction to PHP from Beginning to EndIntroduction to PHP from Beginning to End
Introduction to PHP from Beginning to End
christopherneo4
April Patch Tuesday
April Patch TuesdayApril Patch Tuesday
April Patch Tuesday
Ivanti
Threat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community DayThreat Modeling a Batch Job System - AWS Security Community Day
Threat Modeling a Batch Job System - AWS Security Community Day
Teri Radichel

Screen Scraping with Ruby

  • 1. Screen Scraping with Ruby Jeremy Raines [email_address]
  • 2. Pre-Reqs XPATH Used for addressing elements in an XML doc http://www.w3schools.com/XPath/default.asp Ruby Regular Expressions www.rubular.com -- fastest way to build rope-swinging regex skills in Ruby
  • 3. Rubular Input a string and try out regexs on it
  • 4. XPATH basics //book -- selects all book nodes no matter where they are in the document //bookstore/book selects all books that are a direct child element of bookstore //bookstore//book selects all books that are children of bookstore, no matter how deep in tree //book[1] - selects first book element //book@category =fiction -- @ selects an attribute, in this case specifically fiction books
  • 5. Not-so-secret weapon: Firebug Lets you find the Xpath of any element in a web page. Inspect Element of a representative item Focus on classes, think about loops This will produce easier to read xpaths
  • 7. Example (con't) //sbScoreboxScores//sbScoreboxTeamAway //sbScoreboxScores//sbScoreboxTotal
  • 8. How do we get at the data in these elements using Ruby? Open-uri built-in library for opening pages REXML built-in Xpath parser Fast, good for straightforward tasks Hpricot popular gem More powerful, I usually use in conjuction with: WWW::Mechanize For getting at data behind forms. Requires Hpricot ScRUBYt! -- powerful, high-level abstraction, magic
  • 9. My Process Find a good source Use firebug to determine Xpaths of the elements you want to scrape Play around in irb to see what kind of output you can get from addressing these elements with aforementioned tools Refine Xpaths Play around in Rubular to find regexes that will clean up your output (remove whitespace, etc) Ruby's string methods help with this last part
  • 10. Example: Scraping Quotes from your Tumblr
  • 11. Warning Screen scraping is very iterative & involves a lot of trial and error Make sure you comment a lot as you go along Even with clear Xpaths, it's best to describe what your code is doing with all the regexes and string functions that inevitably build up
  • 12. References http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/ Inspiration for this presentation Describes the Firebug + hpricot method http://code.whytheluckystiff.net/hpricot/wiki http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/ http://www.xml.com/pub/a/2005/11/09/rexml-processing-xml-in-ruby.html http://mechanize.rubyforge.org/mechanize/files/EXAMPLES_txt.html WWW::Mechanize examples For an example that requires Mechanize for logging in, check out my code for logging into LinkedIn and scraping all your 2 nd degree contacts jeremyraines.com/linkedinscraper