際際滷

際際滷Share a Scribd company logo
Web Scraping with Python
Virendra Rajput,
Hacker @Markitty
Agenda
 What is scraping
 Why we scrape
 My experiments with web scraping
 How do we do it
 Tools to use
 Online demo
 Some more tools
 Ethics for scraping
converting unstructured documents
into structured information
scraping:
What is Web Scraping?
 Web scraping (web harvesting) is a software
technique of extracting information from
websites
 It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
RSS is meta data and not
HTML replacement
Why we scrape?
 Web pages contain wealth of information (in
text form), designed mostly for human
consumption
 Static websites (legacy systems)
 Interfacing with 3rd party with no API access
 Websites are more important than APIs
 The data is already available (in the form of
web pages)
 No rate limiting
 Anonymous access
How search engines use it
My Experiments with Scraping
and more..!
IMDb API
Did you mean!
Facebook Bot for Brahma
Kumaris
Getting started!
Fetching the data
 Involves finding the endpoint - URL or URLs
 Sending HTTP requests to the server
 Using requests library:
import requests
data = requests.get(http://google.com/)
html = data.content
Processing (say no to Reg-ex)
 use reg-ex
 Avoid using reg-ex
 Reasons why not to use it:
1. Its fragile
2. Really hard to maintain
3. Improper HTML & Encoding handling
Use BeautifulSoup for parsing
 Provides simple methods to-
 search
 navigate
 select
 Deals with broken web-pages really well
 Auto-detects encoding
Philosophy-
You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.
Export the data
 Database (relational or non-relational)
 CSV
 JSON
 File (XML, YAML, etc.)
 API
Live example demo
Challenges
 External sites can change without warning
 Figuring out the frequency is difficult (TEST, and
test)
 Changes can break scrapers easily
 Bad HTTP status codes
 example: using 200 OK to signal an error
 cannot always trust your HTTP libraries default
behaviour
 Messy HTML markup
Mechanize
 Stateful web-browsing with
mechanize
 Fill up forms
 Follow links
 Handle cookies
 Browse history
 After Andy Lesters WWW:
Mechanize
Filling forms with Mechanize
Scrapy - a framework for web
scraping
 Uses XPath to select elements
 Interactive shell scripting
 Using Scrapy:
 define a model to store items
 create your spider to extract items
 write a Pipeline to store them
Conclusion
 Scrape wisely
 Do not steal
 Use cloud
 Share your scrapers scraperwiki.com
The End!
Virendra Rajput
http://virendra.me/
http://twitter.com/bkvirendra

More Related Content

What's hot (20)

Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
Shubham Jaybhaye
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
Kyle Banerjee
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
CS8651 Internet Programming - Basics of HTML, HTML5, CSS
CS8651   Internet Programming - Basics of HTML, HTML5, CSSCS8651   Internet Programming - Basics of HTML, HTML5, CSS
CS8651 Internet Programming - Basics of HTML, HTML5, CSS
Vigneshkumar Ponnusamy
Machine Learning
Machine LearningMachine Learning
Machine Learning
Darshan Ambhaikar
Web content mining
Web content miningWeb content mining
Web content mining
Daminda Herath
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
The Full Stack Web Development
The Full Stack Web DevelopmentThe Full Stack Web Development
The Full Stack Web Development
Sam Dias
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
Lecture-1: Introduction to web engineering - course overview and grading scheme
Lecture-1: Introduction to web engineering - course overview and grading schemeLecture-1: Introduction to web engineering - course overview and grading scheme
Lecture-1: Introduction to web engineering - course overview and grading scheme
Mubashir Ali
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
Manoj Mishra
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
Java Servlets
Java ServletsJava Servlets
Java Servlets
BG Java EE Course
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
Nowell Strite
Ajax ppt
Ajax pptAjax ppt
Ajax ppt
OECLIB Odisha Electronics Control Library
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
Edureka!
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
Kyle Banerjee
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
Tommy Tavenner
CS8651 Internet Programming - Basics of HTML, HTML5, CSS
CS8651   Internet Programming - Basics of HTML, HTML5, CSSCS8651   Internet Programming - Basics of HTML, HTML5, CSS
CS8651 Internet Programming - Basics of HTML, HTML5, CSS
Vigneshkumar Ponnusamy
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
The Full Stack Web Development
The Full Stack Web DevelopmentThe Full Stack Web Development
The Full Stack Web Development
Sam Dias
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
Lecture-1: Introduction to web engineering - course overview and grading scheme
Lecture-1: Introduction to web engineering - course overview and grading schemeLecture-1: Introduction to web engineering - course overview and grading scheme
Lecture-1: Introduction to web engineering - course overview and grading scheme
Mubashir Ali
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
Manoj Mishra
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
Nowell Strite
Artificial Intelligence with Python | Edureka
Artificial Intelligence with Python | EdurekaArtificial Intelligence with Python | Edureka
Artificial Intelligence with Python | Edureka
Edureka!

Similar to Web scraping in python (20)

Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
Scrappy
ScrappyScrappy
Scrappy
Vishwas N
Web scraping
Web scrapingWeb scraping
Web scraping
Ashley Davis
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
Paul Redmond
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018) Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Melanie Phung
Python in Industry
Python in IndustryPython in Industry
Python in Industry
Dharmit Shah
Web stats
Web statsWeb stats
Web stats
Andrew Stevens
Web mining
Web miningWeb mining
Web mining
Renusoni8
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
CTruncer
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxIntegrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Begum Kaya
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
What You Need to Know About Technical SEO
What You Need to Know About Technical SEOWhat You Need to Know About Technical SEO
What You Need to Know About Technical SEO
Niki Mosier
Presentation 10all
Presentation 10allPresentation 10all
Presentation 10all
guestaa4c059
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
LazarinaStoyanova
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
Destination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your OrgDestination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your Org
csupilowski
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
Dominic Woodman
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
kurtgessler
Getting started with Scrapy in Python
Getting started with Scrapy in PythonGetting started with Scrapy in Python
Getting started with Scrapy in Python
Viren Rajput
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
Guide for web scraping with Python libraries_ Beautiful Soup, Scrapy, and mor...
ThinkODC
Scraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHPScraping the web with Laravel, Dusk, Docker, and PHP
Scraping the web with Laravel, Dusk, Docker, and PHP
Paul Redmond
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018) Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Common SEO Mistakes During Site Relaunches, Redesigns, Migrations (2018)
Melanie Phung
Python in Industry
Python in IndustryPython in Industry
Python in Industry
Dharmit Shah
Web mining
Web miningWeb mining
Web mining
Renusoni8
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
An EyeWitness View into your Network
An EyeWitness View into your NetworkAn EyeWitness View into your Network
An EyeWitness View into your Network
CTruncer
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptxIntegrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Integrating Structured Data (to an SEO Plan) for the Win _ WTSWorkshop '23.pptx
Begum Kaya
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
Robert Dempsey
What You Need to Know About Technical SEO
What You Need to Know About Technical SEOWhat You Need to Know About Technical SEO
What You Need to Know About Technical SEO
Niki Mosier
Presentation 10all
Presentation 10allPresentation 10all
Presentation 10all
guestaa4c059
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina StoyData Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
Data Studio for SEOs: Reporting Automation Tips - Weekly SEO with Lazarina Stoy
LazarinaStoyanova
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
Dana Brophy
Destination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your OrgDestination Documentation: How Not to Get Lost in Your Org
Destination Documentation: How Not to Get Lost in Your Org
csupilowski
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
Dominic Woodman
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
kurtgessler

Recently uploaded (20)

Field Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci ResearchField Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci Research
Vipin Mishra
Build with AI on Google Cloud Session #4
Build with AI on Google Cloud Session #4Build with AI on Google Cloud Session #4
Build with AI on Google Cloud Session #4
Margaret Maynard-Reid
L01 Introduction to Nanoindentation - What is hardness
L01 Introduction to Nanoindentation - What is hardnessL01 Introduction to Nanoindentation - What is hardness
L01 Introduction to Nanoindentation - What is hardness
RostislavDaniel
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
ScyllaDB
Fl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free DownloadFl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free Download
kherorpacca127
Brave Browser Crack 1.45.133 Activated 2025
Brave Browser Crack 1.45.133 Activated 2025Brave Browser Crack 1.45.133 Activated 2025
Brave Browser Crack 1.45.133 Activated 2025
kherorpacca00126
Unlocking DevOps Secuirty :Vault & Keylock
Unlocking DevOps Secuirty :Vault & KeylockUnlocking DevOps Secuirty :Vault & Keylock
Unlocking DevOps Secuirty :Vault & Keylock
HusseinMalikMammadli
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog GavraReplacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
ScyllaDB
Computational Photography: How Technology is Changing Way We Capture the World
Computational Photography: How Technology is Changing Way We Capture the WorldComputational Photography: How Technology is Changing Way We Capture the World
Computational Photography: How Technology is Changing Way We Capture the World
HusseinMalikMammadli
What Makes "Deep Research"? A Dive into AI Agents
What Makes "Deep Research"? A Dive into AI AgentsWhat Makes "Deep Research"? A Dive into AI Agents
What Makes "Deep Research"? A Dive into AI Agents
Zilliz
UiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and OpportunitiesUiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and Opportunities
DianaGray10
Technology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptxTechnology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptx
kaylagaze
Revolutionizing-Government-Communication-The-OSWAN-Success-Story
Revolutionizing-Government-Communication-The-OSWAN-Success-StoryRevolutionizing-Government-Communication-The-OSWAN-Success-Story
Revolutionizing-Government-Communication-The-OSWAN-Success-Story
ssuser52ad5e
DevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdfDevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdf
Justin Reock
Understanding Traditional AI with Custom Vision & MuleSoft.pptx
Understanding Traditional AI with Custom Vision & MuleSoft.pptxUnderstanding Traditional AI with Custom Vision & MuleSoft.pptx
Understanding Traditional AI with Custom Vision & MuleSoft.pptx
shyamraj55
Unlock AI Creativity: Image Generation with DALL揃E
Unlock AI Creativity: Image Generation with DALL揃EUnlock AI Creativity: Image Generation with DALL揃E
Unlock AI Creativity: Image Generation with DALL揃E
Expeed Software
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & TipsTrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc
Future-Proof Your Career with AI Options
Future-Proof Your  Career with AI OptionsFuture-Proof Your  Career with AI Options
Future-Proof Your Career with AI Options
DianaGray10
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIATHE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
Srivaanchi Nathan
A Framework for Model-Driven Digital Twin Engineering
A Framework for Model-Driven Digital Twin EngineeringA Framework for Model-Driven Digital Twin Engineering
A Framework for Model-Driven Digital Twin Engineering
Daniel Lehner
Field Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci ResearchField Device Management Market Report 2030 - TechSci Research
Field Device Management Market Report 2030 - TechSci Research
Vipin Mishra
Build with AI on Google Cloud Session #4
Build with AI on Google Cloud Session #4Build with AI on Google Cloud Session #4
Build with AI on Google Cloud Session #4
Margaret Maynard-Reid
L01 Introduction to Nanoindentation - What is hardness
L01 Introduction to Nanoindentation - What is hardnessL01 Introduction to Nanoindentation - What is hardness
L01 Introduction to Nanoindentation - What is hardness
RostislavDaniel
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
How Discord Indexes Trillions of Messages: Scaling Search Infrastructure by V...
ScyllaDB
Fl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free DownloadFl studio crack version 12.9 Free Download
Fl studio crack version 12.9 Free Download
kherorpacca127
Brave Browser Crack 1.45.133 Activated 2025
Brave Browser Crack 1.45.133 Activated 2025Brave Browser Crack 1.45.133 Activated 2025
Brave Browser Crack 1.45.133 Activated 2025
kherorpacca00126
Unlocking DevOps Secuirty :Vault & Keylock
Unlocking DevOps Secuirty :Vault & KeylockUnlocking DevOps Secuirty :Vault & Keylock
Unlocking DevOps Secuirty :Vault & Keylock
HusseinMalikMammadli
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog GavraReplacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
Replacing RocksDB with ScyllaDB in Kafka Streams by Almog Gavra
ScyllaDB
Computational Photography: How Technology is Changing Way We Capture the World
Computational Photography: How Technology is Changing Way We Capture the WorldComputational Photography: How Technology is Changing Way We Capture the World
Computational Photography: How Technology is Changing Way We Capture the World
HusseinMalikMammadli
What Makes "Deep Research"? A Dive into AI Agents
What Makes "Deep Research"? A Dive into AI AgentsWhat Makes "Deep Research"? A Dive into AI Agents
What Makes "Deep Research"? A Dive into AI Agents
Zilliz
UiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and OpportunitiesUiPath Agentic Automation Capabilities and Opportunities
UiPath Agentic Automation Capabilities and Opportunities
DianaGray10
Technology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptxTechnology use over time and its impact on consumers and businesses.pptx
Technology use over time and its impact on consumers and businesses.pptx
kaylagaze
Revolutionizing-Government-Communication-The-OSWAN-Success-Story
Revolutionizing-Government-Communication-The-OSWAN-Success-StoryRevolutionizing-Government-Communication-The-OSWAN-Success-Story
Revolutionizing-Government-Communication-The-OSWAN-Success-Story
ssuser52ad5e
DevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdfDevNexus - Building 10x Development Organizations.pdf
DevNexus - Building 10x Development Organizations.pdf
Justin Reock
Understanding Traditional AI with Custom Vision & MuleSoft.pptx
Understanding Traditional AI with Custom Vision & MuleSoft.pptxUnderstanding Traditional AI with Custom Vision & MuleSoft.pptx
Understanding Traditional AI with Custom Vision & MuleSoft.pptx
shyamraj55
Unlock AI Creativity: Image Generation with DALL揃E
Unlock AI Creativity: Image Generation with DALL揃EUnlock AI Creativity: Image Generation with DALL揃E
Unlock AI Creativity: Image Generation with DALL揃E
Expeed Software
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & TipsTrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc Webinar - Building your DPIA/PIA Program: Best Practices & Tips
TrustArc
Future-Proof Your Career with AI Options
Future-Proof Your  Career with AI OptionsFuture-Proof Your  Career with AI Options
Future-Proof Your Career with AI Options
DianaGray10
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIATHE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
Srivaanchi Nathan
A Framework for Model-Driven Digital Twin Engineering
A Framework for Model-Driven Digital Twin EngineeringA Framework for Model-Driven Digital Twin Engineering
A Framework for Model-Driven Digital Twin Engineering
Daniel Lehner

Web scraping in python

  • 1. Web Scraping with Python Virendra Rajput, Hacker @Markitty
  • 2. Agenda What is scraping Why we scrape My experiments with web scraping How do we do it Tools to use Online demo Some more tools Ethics for scraping
  • 3. converting unstructured documents into structured information scraping:
  • 4. What is Web Scraping? Web scraping (web harvesting) is a software technique of extracting information from websites It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed
  • 5. RSS is meta data and not HTML replacement
  • 6. Why we scrape? Web pages contain wealth of information (in text form), designed mostly for human consumption Static websites (legacy systems) Interfacing with 3rd party with no API access Websites are more important than APIs The data is already available (in the form of web pages) No rate limiting Anonymous access
  • 9. and more..! IMDb API Did you mean! Facebook Bot for Brahma Kumaris
  • 11. Fetching the data Involves finding the endpoint - URL or URLs Sending HTTP requests to the server Using requests library: import requests data = requests.get(http://google.com/) html = data.content
  • 12. Processing (say no to Reg-ex) use reg-ex Avoid using reg-ex Reasons why not to use it: 1. Its fragile 2. Really hard to maintain 3. Improper HTML & Encoding handling
  • 13. Use BeautifulSoup for parsing Provides simple methods to- search navigate select Deals with broken web-pages really well Auto-detects encoding Philosophy- You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.
  • 14. Export the data Database (relational or non-relational) CSV JSON File (XML, YAML, etc.) API
  • 16. Challenges External sites can change without warning Figuring out the frequency is difficult (TEST, and test) Changes can break scrapers easily Bad HTTP status codes example: using 200 OK to signal an error cannot always trust your HTTP libraries default behaviour Messy HTML markup
  • 17. Mechanize Stateful web-browsing with mechanize Fill up forms Follow links Handle cookies Browse history After Andy Lesters WWW: Mechanize
  • 18. Filling forms with Mechanize
  • 19. Scrapy - a framework for web scraping Uses XPath to select elements Interactive shell scripting Using Scrapy: define a model to store items create your spider to extract items write a Pipeline to store them
  • 20. Conclusion Scrape wisely Do not steal Use cloud Share your scrapers scraperwiki.com