Introduction to XML sitemap and robots.txt files for SEO beginners. Covers the basic to implement them for your first website.
1 of 15
Downloaded 16 times
More Related Content
Your first sitemap.xml and robots.txt implementation
1. Your First Sitemap.xml
& Robots.txt Implementation
J辿r担me Verstrynge
For Ligatures.net
December, 2014
License: CC BY-ND 4.0
Click for
information
2. Table Of Contents
Introduction
Sitemap: XML vs HTTP
Location:
Sitemap.xml
Robots.txt
Sitemap
Content I & II
Generators
Recommendations
Robots.txt
Content I & II
Basic example
Recommendation &
Warnings
Additional
References
Further readings
3. Introduction
Web Crawler
A search engine
computer searching
for content on the
Internet for later
indexation
They read robots.txt
and sitemap.xml files
found on websites
Robots.txt
A text file containing
instructions for web
crawlers
Sitemap.xml
Text files listing pages
URLs to help web
crawlers find content on
a website
4. Sitemaps: XML vs HTML (confusion)
HTML Sitemap:
A web page containing
links facilitating user
navigation on a website
XML Sitemap:
A structured text file
containing the URLs of
pages of a website for
web crawlers
Displayed in
web browsers
Visited by users
and web crawlers
Never displayed
to users
Read by web
crawlers only
That's what we are
interested in !!!
5. Sitemap.xml Locations
By default, most web crawlers
search for a sitemap.xml file in
the root
But sitemaps can
be located
anywhere...
...although the
recommended
practice is to put
them all in the root!
sitemap.xml
...
/mydir
/
...
sitemap2.xml
website 'root'
A website can have
more than one
sitemap!
6. Robots.txt Location
By default, all web
crawlers search for a
robots.txt file in the root
sitemap.xml
...
/mydir
/
robots.txt
website 'root'
A website may not
have a robots.txt
file..
...
...but it is recommended
to always have a
robots.txt file
(even if minimal)
7. Sitemap.xml Content - I
A structured document defining a <urlset>
One <url>...</url> section per web page URL
In bold required elements, others are
optional
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://mysite.com/page.html</loc>
<lastmod>2014-10-04T13:27:58+03:00<lastmod>
<changefreq>daily</changefreq>
<priority>0.7</priority>
</url>
...
</urlset>
8. Sitemap.xml Content - II
<loc>: the URL of a page on the website
<lastmod>: when it has been last modified
<changefreq>: how often it is modified
<priority>: your opportunity to tell web crawlers
on which page you think they should spend their
time first (it has no impact on rankings)
<loc>http://mysite.com/page.html</loc>
<lastmod>2014-10-04T13:27:58+03:00<lastmod>
<changefreq>daily</changefreq>
<priority>0.7</priority>
9. Sitemap.xml Generators
Creating a sitemap.xml manually can be very
time consuming
Many can generate it automatically for their
websites
...but not everyone is a technical!
Solution?
Use free online sitemap generators
Some plugins are available for blog platforms
10. Sitemap.xml Recommendations
Create at least one sitemap.xml in the root
Be as exhaustive as possible
Leave out <lastmod> and <changefreq> if you
can't set reliable values
Don't try to fool search engines with <lastmod>,
<changefreq> and <priority>, it does not work
and can bite back at you
You may submit your sitemaps to search engines
(but it is not mandatory)
11. Robots.txt Content - I
Rules apply top-down, last prevails on top
User-agent: tells to which web crawler (a.k.a. robot it
applies), * means all
Disallow = forbid access, but if empty, this means
forbid access to nothing (in other words, allow all)
Allow = authorize access
User-agent: *
Disallow:
User-agent: Googlebot
Disallow: /mydir/
Allow: /myfile/myfile.html
12. Robots.txt Content - II
The above robots.txt says:
All web crawlers (but Google's) can access
everything on the website
Google's web crawler cannot access the content
of the /mydir directory, except myfile.html in this
directory
User-agent: *
Disallow:
User-agent: Googlebot
Disallow: /mydir/
Allow: /myfile/myfile.html
13. Robots.txt Basic Example
Use the above example for a start
Allow access to all your website content to all web crawlers
Register all your sitemaps in robots.txt, otherwise web
crawlers likely won't find them
Locations are case-sensitive
Directory locations should end with a '/'
user-agent: *
disallow:
sitemap: http://www.mysite.com/sitemap.xml
sitemap: http://www.mysite.com/sitemap2.xml
...
14. Robots.txt Recommendations & Warnings
Always create (at least) a minimal robots.txt
where all sitemaps are declared
Never block access to CSS and Javascript
content
Disallow instructions can be bypassed by
malicious web crawlers, they are no means to
protect access to content
Debug your robots.txt with online checkers
15. Additional References
Further readings:
Troubleshooting web site indexation issues
Troubleshooting web pages indexation issues
Getting started with SEO
SEO guidelines & checklists
Click