How to get around Duplicate Content issues with Cloudfront CDN

tom Tom, 10th April 2013

Cloudfront is Amazon's Content Delivery Network (CDN) offering and provides web developers with an easy to use and cost effective way of letting people access content from a variety of regions around the world with the aim of speeding up website delivery.

This isn't a guide to using Cloudfront - if that's what you're after I'd suggest taking a look at this article.

In this article I will attempt to provide a solution to one of the only issues I've had when setting up a Cloudfront distribution, and that's the issue of duplicate content.

When you create a distribution in Cloudfront it's possible to use your website as the 'origin' for the 'distribution'. This basically results in having a complete static copy of your website available at another location.

So an image which is available at:
http://18a.co/img/logo.png

Would now also be available at:
http://d3uy7bfospl44.cloudfront.net/img/logo.png
(For illustrative purposes only - these URLs don't work)

While this makes using Cloudfront really easy (all you need to do is update your website to reference the files via cloudfront), unfortunately this isn't so great if Google gets it's greedy mitts on it.

There are 3 ways I've found for dealing with this issue:

  1. Theoretically if you don't link to it anywhere then Google will never index it - risky
  2. Adding canonical URLs to your pages will tell Google the original version of the page and it 'should' take note of this hint. However according to Matt Cutts this is only a hint and not 100% to be relied upon.
  3. The final technique I've come across (or come up with I'm not sure) is viewed by some as a little extreme considering the 2 aforementioned options available to you. However I don't like to take any chances when it comes to Google so I think it's worth it to avoid any possible duplication of content issues.

The objective here is to use a robots.txt file to tell Google not to index the version of your site on the CDN. The main problem is that if you set your website document root to be the origin of the CDN distribution, then everything will be mirrored exactly as it is on your site. This includes your robots.txt file. Unfortunately at the time of writing, Cloudfront doesn't allow you to edit the robots.txt file available on your distribution (this would get around the problem), so you have to be a little bit creative.

Create an alternative version of your site available via a subdomain, for example http://static.18a.co. The idea being that all your content is available at both http://18a.co and http://static.18a.co. This might seem counter-intuitive as you now have 3 versions of your site, but bear with me.

Now edit your .htaccess file and add a directive that serves up a different robots.txt file depending on the subdomain viewing the site, something like this should do the trick:

# This attempts to serve a custom robots.txt to the CDN subdomain RewriteCond %{HTTP_HOST} ^static\.18a\.co$ [NC] RewriteRule ^robots.txt robots_cdn.txt [L,NC]

Create a new robots_cdn.txt file which contains the following:

User-agent: * Disallow: /

Now if you visit http://18a.co/robots.txt you should see something like this (or whatever is in your original robots.txt file):

User-agent: * Disallow: /cgi-bin/ Disallow: /a/ Disallow: /min/

However if you visit http://static.18a.co/robots.txt you should see something like this:

User-agent: * Disallow: /

With that done, create a distribution using static.18a.co as the origin and it should mirror everything you want, including the special robots.txt directive asking Google to kindly ignore everything on that domain.

Any feedback please leave in the comments.

More from our blog

18a win Netty 2024 award for Activibees.com

18a win Netty 2024 award for Activibees.com

29.02.24

We are delighted to announce that 18a has been recognised for its outstanding work in the "Web Design Agency of the Year - UK" category at… Read →

Generating an Effective Content Security Policy with your Laravel React App

Generating an Effective Content Security Policy with your Laravel React App

27.02.24

I recently had an interesting problem to solve. I'd built a brand new author website on a shiny installation of Laravel 10, utilising its out-of-the-box… Read →

If your WordPress website looks broken, it could be because of this.

If your WordPress website looks broken, it could be because of this.

15.02.24

WordPress is the incredibly popular blogging-come-full-website platform that powers over 835 million websites* in 2024. It's functionality is extended by plugins, and one such very… Read →