How to Avoid Duplicate Content and Its Penalties
In this article, weโll be talking about duplicate content, what it is, why you donโt want too much duplicate content on your site, and how it affects your SEO. Letโs get started.
What Is Duplicate Content?
Duplicate content is content that appears within or across more than one domain on the Internet. When this happens, it can be tough for search engines to decide which version to show in the results.
There are two types of duplicate content: internal and external.
Internal Duplicate Content
So, what is internal duplicate content?
Internal duplicate content occurs when you have many URLs pointing to the same material, this could be an issue with search engines.
Duplicate content can have many causes; not all duplicate content is editorially created. In most cases, website owners donโt deliberately create duplicate content. However, the web is estimated to be 30% duplicated content.
The main causes of internal duplicate content
- The session IDis the unique identifier of a session. A session is a history of what the visitor did on your site, and this needs to be stored somewhere. The most common solution is to save with cookies, yet search engines donโt usually store cookies. At that point, some systems fall back to using Session IDs in the URL. This means that every internal link on the website gets that Session ID added to its URL, and because that Session ID is unique to that session, it creates a new URL and therefore duplicates content.
- URL parameterssuch as click tracking and some analytics code, can cause duplicate content issues. Itโs often beneficial to avoid adding URL parameters or alternate versions of URLs.
- Site has a “www.mysite.com” and a “mysite.com”If your site has a “www.mysite.com” and a “mysite.com” (with and without the “www” prefix), the same content lives at both versions. The same concept pertains to sites that maintain both http:// and https:// versions. If both versions of a page are up and running, you can run into a duplicate content issue.
- Parameters and faceted navigation
- Trailing Slashes
- Index pages
- Alternate page versions such as m. or AMP pages or print
- Dev/hosting environments
- Pagination
- Country/language versions
External Duplicated Content
If you have been blogging for a while, you have probably heard of content scrapers.
Content scrapers are often the cause of external duplicate content. These sites steal your content without permission, either manually or via automated programs.
How you can stop content scrapers
Create Google Alerts: create a Google Alert using your postโs title by putting the title in quotation marks, this way you can receive regularly delivered emails with the results.
Setup Trackbacks: If you use WordPress, you can receive trackbacks from sites when someone steals your content. Trackbacks are WordPressโ way of letting you know that another website has linked to a post on your blog.
Use Webmaster Tools: If you use google webmaster tools, look under โTrafficโ, you will see a page that says Links to your site, your scrapers will probably be shown there.
Report Them: File a DMCA (Digital Millennium Copyright Act) with their host.
Do Nothing: You won’t be able to keep up with the number of scrapers on the web. The fact of the matter is, it will take too long to fight them all. Just chill, have fun, and focus on creating quality content.
Non-Malicious Duplicate Content
There is duplicate content that will not hurt or jeopardize your rankings whatsoever. According to Google these include:
- Discussion forums that can generate both regular and stripped-down versions of the pages targeting mobile devices
- Store items that are shown or linked via multiple distinct URLs
- Printer-only versions of web pages
How to Avoid Internal Duplicate Content
Use a Tool to Assess Your Site
I Recommend Using Siteliner.com – (www.siteliner.com/)
Use this tool to receive a site report containing information on duplicate content, broken links, and more.
example:
Use the Rel=Canonical Tag to Signify Original Content
A canonical tag (aka “rel canonical”) is a way of informing search engines that a particular URL represents the master copy of a page. Using the canonical tag obviates problems caused by identical or “duplicate” content appearing on various URLs.
So for example, on your site, you have the canonical URL, mysite.com/duplicate-content.
Then you have a duplicate of that URL for any reason. It could be that it’s there on purpose or its a complication in the site structure. Possibly could even be there for some tracking or testing purposes. That URL is mysite.com/duplicated-contents. Many other versions can be picked up by Google to be ranked the highest. If you want a specific page ranking the highest, you need to add this line to your code.
link rel=”canonical” href=”mysite.com/duplicate-content”
The href is telling Google what page; you put the code in the header tag of any document. There are many different ways to canonicalize multiple URLs, but we’ll go over that another time.
Create Original, High-Quality Content
Want to know one of the most effective solutions to avoid internal duplicate content?
It’s quite simple. Website owners need to focus on pushing out high-quality, expert content. Create original content, and you will be fine. The more pages on your site that are original content, the higher Google positions them.
Now that your a duplicate content wiz, check out the tool mentioned above to see how much of your site is duplicated, and try to find sites who have scraped from your site. Otherwise, just focus on churning out more quality content and you will be fine, Google will know who the original author of a post is in the end.