際際滷

際際滷Share a Scribd company logo
Your First Sitemap.xml
& Robots.txt Implementation
J辿r担me Verstrynge
For Ligatures.net
December, 2014
License: CC BY-ND 4.0
Click for
information
Table Of Contents

Introduction

Sitemap: XML vs HTTP

Location:
 Sitemap.xml
 Robots.txt

Sitemap
 Content I & II
 Generators
 Recommendations

Robots.txt
 Content I & II
 Basic example
 Recommendation &
Warnings

Additional
References
 Further readings
Introduction

Web Crawler
 A search engine
computer searching
for content on the
Internet for later
indexation
 They read robots.txt
and sitemap.xml files
found on websites

Robots.txt
 A text file containing
instructions for web
crawlers

Sitemap.xml
 Text files listing pages
URLs to help web
crawlers find content on
a website
Sitemaps: XML vs HTML (confusion)

HTML Sitemap:
 A web page containing
links facilitating user
navigation on a website

XML Sitemap:
 A structured text file
containing the URLs of
pages of a website for
web crawlers
Displayed in
web browsers
Visited by users
and web crawlers
Never displayed
to users
Read by web
crawlers only
That's what we are
interested in !!!
Sitemap.xml Locations
By default, most web crawlers
search for a sitemap.xml file in
the root
But sitemaps can
be located
anywhere...
...although the
recommended
practice is to put
them all in the root!
sitemap.xml
...
/mydir
/
...
sitemap2.xml
website 'root'
A website can have
more than one
sitemap!
Robots.txt Location
By default, all web
crawlers search for a
robots.txt file in the root
sitemap.xml
...
/mydir
/
robots.txt
website 'root'
A website may not
have a robots.txt
file..
...
...but it is recommended
to always have a
robots.txt file
(even if minimal)
Sitemap.xml Content - I

A structured document defining a <urlset>

One <url>...</url> section per web page URL

In bold required elements, others are
optional
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://mysite.com/page.html</loc>
<lastmod>2014-10-04T13:27:58+03:00<lastmod>
<changefreq>daily</changefreq>
<priority>0.7</priority>
</url>
...
</urlset>
Sitemap.xml Content - II

<loc>: the URL of a page on the website

<lastmod>: when it has been last modified

<changefreq>: how often it is modified

<priority>: your opportunity to tell web crawlers
on which page you think they should spend their
time first (it has no impact on rankings)
<loc>http://mysite.com/page.html</loc>
<lastmod>2014-10-04T13:27:58+03:00<lastmod>
<changefreq>daily</changefreq>
<priority>0.7</priority>
Sitemap.xml Generators

Creating a sitemap.xml manually can be very
time consuming

Many can generate it automatically for their
websites

...but not everyone is a technical!

Solution?
 Use free online sitemap generators
 Some plugins are available for blog platforms
Sitemap.xml Recommendations

Create at least one sitemap.xml in the root

Be as exhaustive as possible

Leave out <lastmod> and <changefreq> if you
can't set reliable values

Don't try to fool search engines with <lastmod>,
<changefreq> and <priority>, it does not work
and can bite back at you

You may submit your sitemaps to search engines
(but it is not mandatory)
Robots.txt Content - I

Rules apply top-down, last prevails on top

User-agent: tells to which web crawler (a.k.a. robot it
applies), * means all

Disallow = forbid access, but if empty, this means
forbid access to nothing (in other words, allow all)

Allow = authorize access
User-agent: *
Disallow:
User-agent: Googlebot
Disallow: /mydir/
Allow: /myfile/myfile.html
Robots.txt Content - II

The above robots.txt says:

All web crawlers (but Google's) can access
everything on the website

Google's web crawler cannot access the content
of the /mydir directory, except myfile.html in this
directory
User-agent: *
Disallow:
User-agent: Googlebot
Disallow: /mydir/
Allow: /myfile/myfile.html
Robots.txt  Basic Example

Use the above example for a start

Allow access to all your website content to all web crawlers

Register all your sitemaps in robots.txt, otherwise web
crawlers likely won't find them

Locations are case-sensitive

Directory locations should end with a '/'
user-agent: *
disallow:
sitemap: http://www.mysite.com/sitemap.xml
sitemap: http://www.mysite.com/sitemap2.xml
...
Robots.txt Recommendations & Warnings

Always create (at least) a minimal robots.txt
where all sitemaps are declared

Never block access to CSS and Javascript
content

Disallow instructions can be bypassed by
malicious web crawlers, they are no means to
protect access to content

Debug your robots.txt with online checkers
Additional References

Further readings:
 Troubleshooting web site indexation issues
 Troubleshooting web pages indexation issues
 Getting started with SEO
 SEO guidelines & checklists
Click

More Related Content

Your first sitemap.xml and robots.txt implementation

  • 1. Your First Sitemap.xml & Robots.txt Implementation J辿r担me Verstrynge For Ligatures.net December, 2014 License: CC BY-ND 4.0 Click for information
  • 2. Table Of Contents Introduction Sitemap: XML vs HTTP Location: Sitemap.xml Robots.txt Sitemap Content I & II Generators Recommendations Robots.txt Content I & II Basic example Recommendation & Warnings Additional References Further readings
  • 3. Introduction Web Crawler A search engine computer searching for content on the Internet for later indexation They read robots.txt and sitemap.xml files found on websites Robots.txt A text file containing instructions for web crawlers Sitemap.xml Text files listing pages URLs to help web crawlers find content on a website
  • 4. Sitemaps: XML vs HTML (confusion) HTML Sitemap: A web page containing links facilitating user navigation on a website XML Sitemap: A structured text file containing the URLs of pages of a website for web crawlers Displayed in web browsers Visited by users and web crawlers Never displayed to users Read by web crawlers only That's what we are interested in !!!
  • 5. Sitemap.xml Locations By default, most web crawlers search for a sitemap.xml file in the root But sitemaps can be located anywhere... ...although the recommended practice is to put them all in the root! sitemap.xml ... /mydir / ... sitemap2.xml website 'root' A website can have more than one sitemap!
  • 6. Robots.txt Location By default, all web crawlers search for a robots.txt file in the root sitemap.xml ... /mydir / robots.txt website 'root' A website may not have a robots.txt file.. ... ...but it is recommended to always have a robots.txt file (even if minimal)
  • 7. Sitemap.xml Content - I A structured document defining a <urlset> One <url>...</url> section per web page URL In bold required elements, others are optional <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority> </url> ... </urlset>
  • 8. Sitemap.xml Content - II <loc>: the URL of a page on the website <lastmod>: when it has been last modified <changefreq>: how often it is modified <priority>: your opportunity to tell web crawlers on which page you think they should spend their time first (it has no impact on rankings) <loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority>
  • 9. Sitemap.xml Generators Creating a sitemap.xml manually can be very time consuming Many can generate it automatically for their websites ...but not everyone is a technical! Solution? Use free online sitemap generators Some plugins are available for blog platforms
  • 10. Sitemap.xml Recommendations Create at least one sitemap.xml in the root Be as exhaustive as possible Leave out <lastmod> and <changefreq> if you can't set reliable values Don't try to fool search engines with <lastmod>, <changefreq> and <priority>, it does not work and can bite back at you You may submit your sitemaps to search engines (but it is not mandatory)
  • 11. Robots.txt Content - I Rules apply top-down, last prevails on top User-agent: tells to which web crawler (a.k.a. robot it applies), * means all Disallow = forbid access, but if empty, this means forbid access to nothing (in other words, allow all) Allow = authorize access User-agent: * Disallow: User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html
  • 12. Robots.txt Content - II The above robots.txt says: All web crawlers (but Google's) can access everything on the website Google's web crawler cannot access the content of the /mydir directory, except myfile.html in this directory User-agent: * Disallow: User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html
  • 13. Robots.txt Basic Example Use the above example for a start Allow access to all your website content to all web crawlers Register all your sitemaps in robots.txt, otherwise web crawlers likely won't find them Locations are case-sensitive Directory locations should end with a '/' user-agent: * disallow: sitemap: http://www.mysite.com/sitemap.xml sitemap: http://www.mysite.com/sitemap2.xml ...
  • 14. Robots.txt Recommendations & Warnings Always create (at least) a minimal robots.txt where all sitemaps are declared Never block access to CSS and Javascript content Disallow instructions can be bypassed by malicious web crawlers, they are no means to protect access to content Debug your robots.txt with online checkers
  • 15. Additional References Further readings: Troubleshooting web site indexation issues Troubleshooting web pages indexation issues Getting started with SEO SEO guidelines & checklists Click