Crawling / Indexing

How to Keep the googlebot focused with .htaccess

Normally, you would use the robots.txt file to provide directives to search engines on what pages, files, folders, and subdomains you want to be crawled. This is an indispensable tool for sites of any size, but crucial for larger websites.

The issue with the robots.txt file is that it only contains crawler directives. We should note that there are two kinds of directives: crawler directives, and indexer directives.

Crawler directives tell the googlebot where it can go. They also can be used to point the googlebot to your sitemap. The most common crawler directives are Allow, Disallow, Sitemap, and User-agent. These are used to tell search engines what and where they should crawl.

Indexer directives tell the googlebot what it should index. Unlike the crawler directives which usually are placed in the robots.txt file, the indexer directives are placed on each page or element. Indexer directives are placed in the HTML head for a page e.g. <meta name="robots" content="noindex, follow">. They can also be placed inside a link e.g. <a href="http://example.com/page" rel="nofollow">example page</a>.

This is all well and good, but as previously mentioned images and pdf files don't have HTML heads. Yes, you can nofollow, noindex all links on your site pointing to an image or pdf, but that does not stop other people from linking to it.

The solution is to use our .htaccess file to set a custom header. The header is the X-robots-tag. We can use it with any indexer directive.

Example: X-robots-tag HTTP header

Setting the X-robots-tag is the same as setting any other custom HTTP header.

For a single file we can use:

<Files white-paper.pdf>
    Header add X-robots-tag "noindex, noarchive, nosnippet"
</Files>

To set the header all .docx and .pdf files we would use the following:

<FilesMatch ".(docx|pdf)$">
    Header add X-robots-tag "noindex, noarchive, nosnippet"
</FilesMatch>

The X-robots-tag is an invaluable tool in the SEO tool box.

Like the rel="canonical" header thought it should be used judiciously. Being careless with any part of the .htaccess file can cause serious problems.

Which brings us to our next topic How to Put rel="canonical" on Non-HTML Resources.

Daniel Morell

Daniel Morell

I am a fullstack web developer with a passion for clean code, efficient systems, tests, and most importantly making a difference for good. I am a perfectionist. That means I love all the nitty-gritty details.

I live in Wisconsin's Fox Valley with my beautiful wife Emily.

Daniel Morell

I am a fullstack web developer, SEO, and builder of things (mostly digital).

I started with just HTML and CSS, and now I mostly work with Python, PHP, JS, and Golang. The web has a lot of problems both technically and socially. I'm here fighting to make it a better place.

© 2018 Daniel Morell.
+ Daniel + = this website.