How to get around Duplicate Content issues with Cloudfront CDN
10th April 2013
,10th April 2013
,Cloudfront is Amazon's Content Delivery Network (CDN) offering and provides web developers with an easy to use and cost effective way of letting people access content from a variety of regions around the world with the aim of speeding up website delivery.
This isn't a guide to using Cloudfront - if that's what you're after I'd suggest taking a look at this article.
In this article I will attempt to provide a solution to one of the only issues I've had when setting up a Cloudfront distribution, and that's the issue of duplicate content.
When you create a distribution in Cloudfront it's possible to use your website as the 'origin' for the 'distribution'. This basically results in having a complete static copy of your website available at another location.
So an image which is available at:
http://18a.co/img/logo.png
Would now also be available at:
http://d3uy7bfospl44.cloudfront.net/img/logo.png
(For illustrative purposes only - these URLs don't work)
While this makes using Cloudfront really easy (all you need to do is update your website to reference the files via cloudfront), unfortunately this isn't so great if Google gets it's greedy mitts on it.
There are 3 ways I've found for dealing with this issue:
The objective here is to use a robots.txt file to tell Google not to index the version of your site on the CDN. The main problem is that if you set your website document root to be the origin of the CDN distribution, then everything will be mirrored exactly as it is on your site. This includes your robots.txt file. Unfortunately at the time of writing, Cloudfront doesn't allow you to edit the robots.txt file available on your distribution (this would get around the problem), so you have to be a little bit creative.
Create an alternative version of your site available via a subdomain, for example http://static.18a.co. The idea being that all your content is available at both http://18a.co and http://static.18a.co. This might seem counter-intuitive as you now have 3 versions of your site, but bear with me.
Now edit your .htaccess file and add a directive that serves up a different robots.txt file depending on the subdomain viewing the site, something like this should do the trick:
# This attempts to serve a custom robots.txt to the CDN subdomain RewriteCond %{HTTP_HOST} ^static\.18a\.co$ [NC] RewriteRule ^robots.txt robots_cdn.txt [L,NC]
Create a new robots_cdn.txt file which contains the following:
User-agent: * Disallow: /
Now if you visit http://18a.co/robots.txt you should see something like this (or whatever is in your original robots.txt file):
User-agent: * Disallow: /cgi-bin/ Disallow: /a/ Disallow: /min/
However if you visit http://static.18a.co/robots.txt you should see something like this:
User-agent: * Disallow: /
With that done, create a distribution using static.18a.co as the origin and it should mirror everything you want, including the special robots.txt directive asking Google to kindly ignore everything on that domain.
Any feedback please leave in the comments.
29.02.24
We are delighted to announce that 18a has been recognised for its outstanding work in the "Web Design Agency of the Year - UK" category at… Read →
27.02.24
I recently had an interesting problem to solve. I'd built a brand new author website on a shiny installation of Laravel 10, utilising its out-of-the-box… Read →
15.02.24
WordPress is the incredibly popular blogging-come-full-website platform that powers over 835 million websites* in 2024. It's functionality is extended by plugins, and one such very… Read →