Managing duplicates
I’ve been catching up on my SEO reading over the holidays, in particular SEOMoz posts. One post I spent time on was benjarriola’s on managing duplicate content at http://www.seomoz.org/blog/duplicate-content-block-redirect-or-canonical.
Previously I thought you only used the rel=canonical tag on unusual pages when you wanted to concentrate/redirect links. But after reading the above and other articles, it’s a great tag that should always be in use as a preventative measure, rather than in response to a problem.
Rel=canonical
A canonical page is the preferred version of a page when you have multiple versions with identical content. It’s great for handling sessionid parameters, tracking links, and also multiple urls from tagging and categories. Also slight variations, such as when the sort order of a page is changed.
For example the page /content.php may also appear as /tag1/content.php, /tag2/content.php, /category1/content.php, /content.php?ssid=blah, and /content.php?utm=trackinginfo
But we really only want /content.php indexed, and we want all the seo juice concentrated on that page. And so you would use the rel=canonical tag to specify the preferred url as http://www.yourdomain.com/content.php, for all of the above urls.
Most good CMSs, blogs and shopping carts will have rel=canonical already included in the more recent versions of their software: WordPress, x-cart, Drupal all include it.
If your CMS doesn’t already include rel=canonical, you can build it yourself using server-side coding to strip out the parameters for use in the canonical statement. Eg in PHP:
$RequestURI = $_SERVER[REQUEST_URI];
$RequestURI = str_replace(strstr($RequestURI,’?utm’),”,$RequestURI);
echo ‘
‘;
It may be easier if your CMS has a separate user-maintained clean-url field that can be used in the rel statement.
301 Redirects
301 redirects actually redirect the visitor from an old URL to a new URL. They redirect and consolidate link juice, and are best when the old page no longer exists. Using them to manage duplicate content can destroy the navigational path of the visitor, if they were systematically browsing through a section of the site and expect to remain there.
Robots.txt
You can list pages and folders you wish to be excluded from search engines in your robots.txt file.
The robots disallow statement, for either the whole site, a folder or URL, is supposed to tell search engine robots not to visit – not to crawl the page.
However if people are linking to the page that you have tried to exclude, given that you are blocking rather than redirecting, you are wasting link juice, which is why other techniques are frequently used.
And Google actually says they may index URLs if other pages are linking to them (“While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web”). So you’re at the mercy of others – this isn’t a foolproof way of excluding pages from the index.
Meta Robots: NoIndex/Follow tag
Noindex tells search engines not to index the page, thus eliminating duplicate content. Follow tells search engines to still follow the links found on this page, thus still passing around link juice.
The Meta Robots tag is generally considered more reliable than the robots.txt file, for excluding a page from being indexed.
Alternate link page
Similar to the canonical link tag, used mainly for International or Multilingual SEO purposes.
All pages will still be indexed, but this helps Google choose the best result for the individual country versions of Google. And eliminates the problems Google may run into treating pages as duplicate content.
How you handle multilingual sites depends on whether you have localized the content, or just the navigation/framework. But that’s a different post, and a future experiment.