Imagine you want to develop a Search Engine. Of course, your users are people who are looking up things that they are interested in. They will not use your search engine if you give them –
– irrelevant search results
– repetition of the same search resulted
This second one is what Search Engines qualify as “duplicate content”. Every time a search is made, the search engine browses the entire web and gets a list of web pages that it thinks are relevant to the result(point one above). The relevancy is determined by a complex equation with hundreds of parameters.
Once this list is ready, the search engine algorithm goes through another step. It determines how many of the results have identical or similar (90% or more) content. These “duplicate results” are all clubbed together and only one result is shown.
Based on pre-determined parameters, the search engine decides which one it should show to the user. Basically, if you were looking something up would you want ten different options/opinions or ten links with the same content?
So, from a search engine’s point of view, it is not really a penalty because they are not down-ranking your site. They are just clubbing results under one search result. Interestingly enough, one of the parameters considered is the rank it assigns to the page/site. The higher the rank, better the chance of its page being chosen over other pages with same content.
What’s to be note is that though your page rank doesn’t get affected* your relevant page won’t be shown as a result. In effect, you aren’t getting traffic from that search result. So, for all practical purposes it is a penalty for you.
How can you make sure your pages are not considered duplicate content?
As a logical flow from the intention of search engines these are the things you ought to take care of –
Carry original content
As obvious at it gets, don’t have duplicate content and you won’t get penalised for carrying duplicate content.
Use canonical tag
A great way to get more people to read your content is to have more sites publish your content. However, when you do that, make sure you let the search engine crawler know what it should consider as original content. Google recognizes the “canonical” tag which can be used as so –
<link rel=”canonical” href=”URL to be considered original” />
This becomes even more important if the content is carried by sites ranked higher than yours.
On the other hand, a different way to think about syndicating your content is that you use it for branding rather than driving content to your site. So, don’t worry about your page being considered as duplicate content.
Ask that you post first
Over and above carrying the canonical tag, you can request the other sites to wait a couple of days before they post the content. That way the search engine knows you posted first.
Don’t re-post entire content
Whenever possible, and if it makes sense, don’t allow other sites to use your content in its entirety. Have them carry a paragraph or 100 words or something like that and have them link back to your site to continue reading. This way your page does have unique content that is not on the other site.
Take care of internal duplication
If you have a mobile site and/or printable versions of your pages, you will have two pages on your site carry the same content, only in different formats. This is likely to be considered as duplicate content. You can use canonical tags on the pages you don’t want the search engine to consider.
Avoid repetitive content on each page of your site
Usually, search engines are smart enough to figure out repetition in your content when it is like say in the text of the comments section/form or the legal pages. But, it might be a good idea to have large pieces of content that repeat on each page be moved to a page of its own. You can then link to that page and avoid this problem.
As a follow-up from the earlier point, don’t have blank pages. More likely than not, you will have automatically generated repetitive content on these blank pages.
Check for plagiarism
Keep an eye out for sites using your content without permission. You can use a tool like copyscape or simply copy sample text from your page and paste it into the search engine with quotation marks. If there are other sites using your content and passing it off as theirs, you can –
– request them to take it off, if you don’t want them to carry it
– request them to use the canonical tag, if you don’t mind it
– if they refuse, report them. (This link is for reporting to Google, all search engines will have a similar process)
Do not be malicious
No one wants to be manipulated, not even search engines. So, make sure your site doesn’t have –
– plagiarised content
– content that tries to manipulate the search engine into believing that it is relevant. All popular search engines are smart, they will figure it out.
The crux of the discussion lies in the fact that you don’t have much to worry about if your content is unique and you are keeping track of sites that are using it.
Are there any other techniques you use to make sure your pages don’t get marked as duplicate? Please do share and enrich this list.