Tips to avoid duplicate content using canonical tag.

What is a Canonical Tag?

Canonical tags are referred as rel=”canonical”, canonical tags are a way of telling search engines that a specific URL is the main copy of a page. Canonical tag prevents the issue caused by duplicate content appearing on multiple URL. So the basic idea is if you have several similar versions of the same content, you pick one “canonical” version and point the search engines at that particular URL. This solves the duplicate content problem where search engines don’t know which version of the content to show in their results. 

Duplicate content is a not recommended/ having no content value in search engines. Having pages of identical or very similar content on your website is seen as a negative, and may be used by Google to devalue your website when determining rankings. If you use https on your site, utilize a content management system like WordPress or Drupal, or run an eCommerce website, the combination of different URLs people can use to access your website opens you up to a major SEO vulnerability if not properly addressed. By properly utilizing canonical tags to pages on your site, you can avoid this red flag and take full advantage of both a strong site and streamlined Search Engine Optimisation practices.

SEO Benefits

Choosing a proper canonical URL for every set of similar URLs improves the SEO of your site. This is because the search engine knows which version is canonical, so it can count all the links pointing at all the different versions as links to the canonical version. Setting a canonical is similar in concept to a 301 redirect, only without actually redirecting. There are a number of reasons why you would want to explicitly choose a canonical page in a set of duplicate pages:

It will specify which URL that you want people to see in search results. It will consolidate link signals for similar or duplicate pages. It helps search engines to be able to consolidate the information they have for the individual URLs into a single, preferred URL. It will prevent spending crawling time on duplicate pages.

How to set canonical URL?

Choose one of your two pages as the canonical version. This should be the version you think is the most important. If you don’t care, pick the one with the most links or visitors. Add a rel=canonical link from the non-canonical page to the canonical one. What this does is merge the two pages into one from a search engine’s perspective. It is a slight redirect, without actually redirecting the user. Links to both URLs now count as the single, canonical version of the URL.

The canonical tag implementation

Since Google may treat uppercase and lowercase URLs as two different URLs, you want to first make sure to force lowercase URLs on your server and then use lowercase URLs for your canonical tags. If you switched over to SSL, make sure that you don’t declare any non-SSL (i.e., HTTP) URLs in your canonical tags. Doing so can theoretically lead to confusion and unexpected results. 

Setting canonicals in sitemaps

Google states that non-canonical pages shouldn’t be included in sitemaps. Only canonical URLs should be listed. Because Google sees the pages listed in a sitemap as suggested canonicals. However, they won’t always select URLs in sitemaps as canonicals. 

Internal links

How you link from one page to another throughout your site is a canonicalization signal. The more consistent you are with all of these signals, the easier it will be for search engines to determine your preferred canonical URL. As mentioned by John in the video, Google also has a preference for HTTPS over HTTP URLs, and for prettier URLs.

Common canonicalisation mistakes to avoid

Canonicalization is a somewhat complex topic. As such, there are a lot of misunderstandings and misconceptions about how to canonicalize properly.

Blocking the canonicalised url via robots.txt

Blocking a URL in robots.txt prevents Google from crawling it, meaning that they’re unable to see any canonical tags on that page. That, in turn, prevents them from transferring any “link equity” from the non-canonical to the canonical.

Setting the canonicalised URL to ‘noindex’

Never mix noindex and rel=canonical. They’re contradictory instructions. Google will usually prioritize the canonical tag over the ‘noindex’ tag, as John Mueller states here. But it’s still bad practice. If you want to noindex and canonicalize a URL, use a 301 redirect. Otherwise, use rel=canonical.

Setting a 4XX HTTP status code for the canonicalised URL

Setting a 4XX HTTP status code for a canonicalized URL has the same effect as using the ‘noindex’ tag: Google will be unable to see the canonical tag and transfer “link equity” to the canonical version.

Canonicalising all paginated pages to the root page

Paginated pages should not be canonicalised to the first paginated page in the series. Instead, self-referencing canonicals should be used on all paginated pages.

Not using canonical tags with hreflang

Hreflang tags are used to specify the language and geographical targeting of a webpage. Google states that when using hreflang, you should “specify a canonical page in the same language, or the best possible substitute language if a canonical doesn’t exist for the same language.”

Having multiple rel=canonical tags

Having multiple rel=canonical tags will cause them to likely be ignored by Google. In many cases this happens because tags are inserted into a system at different points such as by the CMS, the theme, and plugin(s). This is why many plugins have an overwrite option meant to make sure that they are the only source for canonical tags.

Another area where this might be a problem is with canonicals added with JavaScript. If you have no canonical URL specified in the HTML response and then add a rel=canonical tag with JavaScript then it should be respected when Google renders the page. However, if you have a canonical specified in HTML and swap the preferred version with JavaScript, you are sending mixed signals to Google.

Rel=canonical in the <body>

Rel=canonical should only appear in the <head> of a document. A canonical tag in the <body> section of a page will be ignored. Where this can become a problem is with the parsing of a document. While the source code of a page may have the rel=canonical tag in the correct location, when the page is actually constructed in a browser or rendered by a search engine, many different things such as unclosed tags, JavaScript injected, or <iframes> in the <head> section can cause the <head> to end prematurely while rendering. In these cases a canonical tag may be accidentally thrown into the <body> of a rendered page where it will not be respected.

More information is available here.