Crawling / Indexing

How to put rel="canonical" on non-html resources

When the rel="canonical" tag was introduced in 2009 it was quickly adopted by SEOs. Unfortunately, because the canonical tag resides in the HTML head you cannot insert it into non-HTML pages.

Why is this a problem? If you have images or PDF documents that play an important role in your website, they can outrank HTML pages on your site. If you created a redirect no one could read the document or see the image.

The solution is to create a rel="canonical" for the image or document. Since you cannot place the canonical tag in the HTML head on non-HTML documents, search engines provided the option to provide it as an HTTP header.

Instead of just showing you how to serve the rel="canonical" as an HTTP head, I am going to show you how to create any custom HTTP header.

How to create a custom HTTP header.

The syntax for creating a custom HTTP header is simple. You begin by identifying the file the HTTP header will be served for. This can be done with <Files> or <FilesMatch>.

Inside the opening <Files> or <FilesMatch> tag you should place a regular expression in quotation marks or the file name to identify the file. If you use a regular expression in <Files> you will need to insert ~ before your regular expression.

As a rule, it is recommended that you use <Files> for file names and <FilesMatch> for matching with regular expressions.

Once you have matched your file(s), you will need to create your HTTP header. To do this you will use the following syntax Header NAME "VALUE".

In the following HTTP header Status Code: 200: Status Code is the name and 200 is the value. You do not need to place a semicolon after the NAME in your .htaccess file.

Example: Custom HTTP Header

The following code will create a canonical tag on the file white-paper.pdf pointing to the desired HTML page.

<Files white-paper.pdf>
    Header add Link '<http://www.example.com/white-paper-download.html>; rel="canonical"'
</Files>

Creating canonical tags this way can be tedious on a large site. Because of that creating a global rule will allow the placement of the canonical tag programmatically. The best way to do this is to save the file name using an environment variable flag. Once we have the file name we can create an HTTP header for each matching file.

Example: Dynamic HTTP Header

The following code will create a canonical tag for each pdf pointing to an HTML page with the same name.

RewriteRule ([^/]+)\.pdf$ - [E=FILENAME:$1]
<FilesMatch "\.pdf$">
    Header add Link '<http://www.example.com/download/%{FILENAME}e>; rel="canonical"'
</FilesMatch>

This rule will create a canonical tag that will indicate that any PDF on the website is a canonical of an HTML page with the same file name located in the /download directory.

For example, any pdf named epic-white-paper.pdf will have a canonical link pointing to http://www.example.com/download/epic-white-paper.

You can include a file extension after the e and before the closing > if you use file extensions on your website.

Be careful when using this method. It will cause all pdf files on your server to have canonical links to HTML pages with the same file name. To ensure it works properly I recommend following these rules.

  • Do not use this method in your root .htaccess file. Place it in an additional .htaccess file in a child directory such as /downloads.
  • Use all lowercase characters and hyphens between words in the name of your pdf.
  • Create the canonical HTML page prior to uploading your pdf document.
  • Enter the canonical link from each pdf into your web browser and ensure it works properly.

Advanced Dynamic Canonical Link Headers

Sometimes following the rules listed above is not the best way to add canonical link headers.

I was recently asked by a forward-thinking reader for some help with just such a problem. Her question was, "Is it bad to put more than one dynamic HTTP header in the same .htaccess file?"

Here is an example she sent me.

RewriteRule ([^/]+)\.pdf$ - [E=FILENAME:$1]
<FilesMatch "\.pdf$">
    Header add Link '<http://www.example.com/about/press/%{FILENAME}e>; rel="canonical"'
</FilesMatch>

RewriteRule ([^/]+)\.pdf$ - [E=FILENAME:$1]
<FilesMatch "\.pdf$">
    Header add Link '<http://www.example.com/resource/%{FILENAME}e>; rel="canonical"'
</FilesMatch>

Unfortunately, this won't work. The first Header directive is executed for each request for a .pdf, but then the second Header directive is also executed. This means the header is being set twice for each PDF.

To correct this there are two options you can use.

First, you can create a .htaccess file and place it in the directory that your PDF files are in. This would ensure it is only executed when a request is made from that directory.

Second, you can use an <If> statement to selectively set the header. I personally prefer this option. The regex matching is a little more complex on the right side of the comparison operator in the <If> statement, but I will explain it.

Our .htacess directives would look something like this...

<FilesMatch "\.pdf$">
    RewriteRule ([^/]+)\.pdf $ - [E=FILENAME:$1]
    <If "%{REQUEST_URI} =~ m#^/about/.*#">
        Header add Link '<http://www.example.com/about/press/%{FILENAME}e>; rel="canonical"'
    </If>
    <If "%{REQUEST_URI} =~ m#^/resource/.*#">
        Header add Link '<http://www.example.com/resource/%{FILENAME}e>; rel="canonical"'
    </If>
</FilesMatch>

How this .htaccess code works

The <FilesMatch> directive will limit the enclosed directives to files ending in .pdf.

The RewriteRule directive is moved inside the <FilesMatch>. This simply keeps it from being checked on every request.

The <If> statement is used to determine which header to use. The regex uses the m#...# delimiters. The standard is to use /.../ as delimiters. However, this conflicts with the matching of the first slash and ending slash of the directory name.

The statement between the two delimiters restricts the <If> statements to only execute the enclosed Header add directives when the URL path begins with /about/ or /resources/ respectively.

Here is a list of URLs that will match either the first, second or none of the <If> statements.

  • example.com/about/file.pdf == First
  • example.com/about.pdf == None
  • example.com/about/press/file.pdf == First
  • example.com/about == None
  • example.com/about-us/file.pdf == None
  • example.com/resources/some-folder/file.pdf == Second
  • example.com/something == None

This method is not restricted to only be used when you have multiple Header directives. You can use it to restrict the directive to only be executed for resources from a specific directory.

For example, the following code would only add the header for PDFs in the /downloads directory. It then uses /papers as the canonical form.

<FilesMatch "\.pdf$">
    RewriteRule ([^/]+)\.pdf $ - [E=FILENAME:$1]
    <If "%{REQUEST_URI} =~ m#^/downloads/.*#">
        Header add Link '<http://www.example.com/papers/%{FILENAME}e>; rel="canonical"'
    </If>
</FilesMatch>

This would result in the following...

PDF: example.com/downloads/my-cool-paper-you-should-read.pdf

Header: Link: <http://www.example.com/papers/my-cool-paper-you-should-read>; rel="canonical"

Warning:

If you are using Nginx as a reverse-proxy in front of an Apache web server and you are using Google PageSpeed (mod_pagespeed), you may have trouble with image requests not processing your .htaccess file. If this is the case, it will result in your canonical links not being placed on images. To correct this, you may need to change some of your mod_pagespeed settings.

Daniel Morell

Daniel Morell

I am a fullstack web developer with a passion for clean code, efficient systems, tests, and most importantly making a difference for good. I am a perfectionist. That means I love all the nitty-gritty details.

I live in Wisconsin's Fox Valley with my beautiful wife Emily.

Daniel Morell

I am a fullstack web developer, SEO, and builder of things (mostly digital).

I started with just HTML and CSS, and now I mostly work with Python, PHP, JS, and Golang. The web has a lot of problems both technically and socially. I'm here fighting to make it a better place.

© 2018 Daniel Morell.
+ Daniel + = this website.