Normally, you would use the
robots.txt file to provide directives to search engines on what pages, files, folders, and subdomains you want to be crawled. This is an indispensable tool for sites of any size, but crucial for larger websites.
The issue with the
robots.txt file is that it only contains crawler directives. We should note that there are two kinds of directives: crawler directives, and indexer directives.
Crawler directives tell the googlebot where it can go. They also can be used to point the googlebot to your sitemap. The most common crawler directives are
User-agent. These are used to tell search engines what and where they should crawl.
Indexer directives tell the googlebot what it should index. Unlike the crawler directives which usually are placed in the
robots.txt file, the indexer directives are placed on each page or element. Indexer directives are placed in the HTML head for a page e.g.
<meta name="robots" content="noindex, follow">. They can also be placed inside a link e.g.
<a href="http://example.com/page" rel="nofollow">example page</a>.
This is all well and good, but as previously mentioned images and pdf files don't have HTML heads. Yes, you can
nofollow, noindex all links on your site pointing to an image or pdf, but that does not stop other people from linking to it.
The solution is to use our .htaccess file to set a custom header. The header is the
X-robots-tag. We can use it with any indexer directive.
Example: X-robots-tag HTTP header
X-robots-tag is the same as setting any other custom HTTP header.
For a single file we can use:
<Files white-paper.pdf> Header add X-robots-tag "noindex, noarchive, nosnippet" </Files>
To set the header all .docx and .pdf files we would use the following:
<FilesMatch ".(docx|pdf)$"> Header add X-robots-tag "noindex, noarchive, nosnippet" </FilesMatch>
X-robots-tag is an invaluable tool in the SEO tool box.
rel="canonical" header thought it should be used judiciously. Being careless with any part of the .htaccess file can cause serious problems.
Which brings us to our next topic How to Put rel="canonical" on Non-HTML Resources.