Duplicate content for SEO and SEO for duplicate content

by Michael Martinez on February 26, 2009

2010 SEMMY Nominee

Duplicate content came up at SMX West in more than one panel. Of course, Google, Microsoft, and Yahoo! dropped a bombshell announcement in the “Meet The Search Engines” panel on Thursday, where they launched their new Canonical URL meta tag.

There are times when people in the SEO community act like duplicate content is a death sentence for their Web sites. I’ve never understood that attitude, as there are reams of SEO articles that discuss what duplicate content is, how to deal with it, and what you can expect from having duplicate content. Despite the fact that duplicate content is one of the most well-documented issues in search engine optimization, it continues to receive a bad wrap– er, rap.

People just don’t get duplicate content. Let me say that again: “People just don’t get duplicate content”.

Duplicate content occurs naturally for many perfectly acceptable reasons. I cannot list them all here but we can categorize duplicate content in the following ways:

  1. Natural Duplication resulting from CMS functionality.
  2. Natural Duplication resulting from keyword-injection into boilerplate templates.
  3. Natural Duplication resulting from accessibility and usability functionality.
  4. Natural Duplication resulting from mirroring, mismanaged domain replication, etc.
  5. Natural Duplication resulting from syndication, redistribution, and self-promotion.
  6. Unnatural Duplication resulting from sneaky framing and hijacking.
  7. Unnatural Duplication resulting from unauthorized republication.
  8. Unnatural Duplication resulting from scraping.

Scraping is a very specialized form of unauthorized republication, and I break it out here to differentiate it from the kind of unauthorized republication that occurs because of naivete rather than intentional replication.

Natural duplication of content causes more issues for the average SEO than unnatural duplication, but some scrapers are so aggressive (and so good at optimization) they manage to outrank the source sites.

Not all duplicate content is bad. Also, there are situations where duplicate content can help you. The SEO community is really struggling to craft a common message about how to manage duplicate content efficiently, and not so much about eliminating duplicate content. Unfortunately, some SEOs focus on the less important issues so the signals from our industry remain mixed.

Let’s take a look at some of the cons and pros.

Why It’s Bad To Have Duplicate Content

It screws up site search – This is where duplicate content inflicts the most harm, particularly in Google. In fact, Google will decide your pages are duplicate if ONLY your (page titles and meta descriptions) are the same. You get that infamous “Omitted Results” marker because all the listings turn out to be the same thing.

Solution To Duplicate Titles/Meta Descriptions – Insert unique page titles and descriptions into every page. Many CMS applications make this difficult, I know, but unique page titles DO also help improve your relevance for targeted searches — the long-tail queries, especially, where there is relatively little competition, benefit immensely from optimized page titles.

Splitting PageRank – SEO Consensus says this is a Bad Consequence of Duplicate Content. Bah! Humbug! All PageRank can be channeled through internal linkage, so the fact your link profile is pointing to multiple versions of the same content is not the kiss of death. Good optimization can deal with this in several ways. (Aside: Matt Cutts once made a comment that led me to wonder if pages actually don’t pass on all their PageRank in Google’s algorithm ….)

Old Solution To Split PageRank Problem – The most efficient way to deal with this would have been to use the robots meta tag to say “noindex,follow,noarchive” (you need to include a link on the page to the canonical URL). Okay, not everyone was in a position to do that. Some of the bad solutions I’ve seen recommended included: Using the robots meta tag to “noindex,nofollow,noarchive” (wastes any internal on-page link power), using “rel=’nofollow’” on internal links (doesn’t help with external links), using 301-redirects (sort of defeats the purpose of having useful duplicate content), using robots.txt to block indexing (wastes link power).

New Solution To Split PageRank Problem – Use the new canonical URL meta tag (but also use the old solution for search engines that don’t honor the tag). The canonical URL meta tag is a much cleaner, more “elegant” solution. However, until the canonical tag becomes universally supported I recommend using both solutions for at least a couple of years.

Might Incur A Penalty – I think most practicing SEOs now realize that duplicate content is more likely to be filtered rather than penalized. Some sites that use duplicate content excessively get panelized or banned. However, simply having a lot of duplicate content doesn’t mean you’re being excessive. Of course, some search engines are more tolerant than others.

Correct response(s) to “Might Incur Penalty” – Nothing has really changed. We still want to reassure people that penalties are not likely to happen but they still want to optimize their content to achieve the maximum possible visibility and relevance for appropriate expressions. That usually means eliminating most if not all duplicate content from a site.

On-site Navigation Is Overcomplicated – You rarely hear people discuss this issue, but the biggest problem with most duplicate content is there are internal links pointing to it. Those links get there through occasional odd links that really don’t fit well into the intended navigational scheme. It’s important to recognize bad navigation for what it is. If it’s causing problems with your SEO, it’s almost certainly causing problems for your visitors.

Old Solutions To Bad Navigation – Option 1) Fix your navigation. This is often painful, especially for large, well-established sites that have developed overly complex navigation through accretion. It can be extremely expensive to fix navigation. Option 2) Redesign your site. This can be almost as painful as Option 1.

New Solutions To Bad Navigation – Fix your navigation if it’s not too painful. Otherwise, consider using the new canonical URL meta tag. Redesigning the site may still be the best option, but at least now we have a third alternative.

Unnecessary Maintenance – Some companies publish multiple copies of important content on their sites because they have set up “marketing sections” or otherwise compartmentalized their sites. They decide it’s easier to just add an informational page to each section rather than to devise a complicated, integrated navigation hierarchy. Leaving that argument aside, more than one large corporate site has been burdened with updating multiple copies of the same page in many different sections.

Solutions For Unnecessary Maintenance – Consider updating or replacing your CMS so that you can manage “classes” of documents in addition to document locations. That way, if a class extends to 15 locations, you can change all 15 at once. If you define the page class functionality correctly, you should be able to manage the meta tags efficiently too. I would embed the canonical URL meta tag in the class definition.

Or, you can redesign your Web site so you don’t have all that duplicate content.

Why It’s Good To Have Duplicate Content

Targeting Functionality – Yes, Virginia, it’s okay to have duplicate content URLs for printing, accessibility, different templates, search functions, etc. Especially in database-driven, CMS-styled sites, it’s common to offer functionality that changes a page’s physical URL with parameters.

Methods For Managing Targeted Functionality – Some people use mod_rewrite to streamline page URLs by hiding parameters. Some people use robots.txt to block crawling for the duplicate pages. But you have to manage external URL references as well as those on your own site. Frankly, I think the canonical URL meta tag makes a lot of sense for managing targeted functionality that produces duplicate content.

Syndicating Content – Many a merchant has lived to rue the day he distributed content to his affiliates because often the affiliate farmers do a better job of optimizing than the source merchants. I don’t think the canonical URL meta tag is going to help here unless you modify your affiliate program’s terms of service and enforce the requirement strictly. That’s expensive.

That said, syndicated content increases your visibility and (depending on how well the content is organized) extends your brand value. Those interminably long sales pitch pages with all the little special information boxes and calls to action do nothing to extend or build your brand value. They may bring in sales but they’re not making you look good.

Solutions For Syndicated Content – Although it creates more work, I strongly believe in keeping source site content separate from affiliate syndication content. If you’re going to push content to your affiliates, push content that isn’t found on your site.

Furthermore, push content that affiliates MUST integrate into their own copy. Don’t use cookie-cutter templates for your affiliate program. They make your affiliates look cheap and smarmy and therefore they make you look cheap and smarmy. More importantly, when your affiliates integrate your syndicated content into their own site designs and content they break the duplicate content syndrome.

A lot of syndicators now provide RSS feeds for their affiliates. I think this is probably the best way to go.

Enhancing Visitor Experience – Yes, it’s good to offer accessible and printable versions of your otherwise horribly designed– I mean, your graphically enhanced Web pages. Cascading style sheets offer sites ample opportunity to hand handle these issues, but sometimes you just don’t have the option of relying on multiple style sheets.

Solutions For Enhanced Visitor Experience – Use the canonical URL meta tag and the robots meta tag with “noindex,follow,noarchive” for your duplicate pages (include a link to the canonical URL on the page for the robots meta tag). If you have a LOT of these extra pages, you can exclude them with robots.txt but people may still link to them (in fact, I have often linked to printable versions of pages for people’s convenience). I would prefer NOT to block these pages with robots.txt.

Keyword Injection Into Boilerplate – This is a great time-saving means of creating lots of “relevant” content, isn’t it? How many thousands of “national” companies and information services have done this already? We see it in news, travel, real estate, medical and dental care, auto services, big retailer store industries, and more. Virtually any company or association with a physical presence in a local market has been pitched or tried to use keyword injection. A lot of online retailers, advertising affiliate farms, and other information leveraging sites have also used this method.

It’s not really good for search engine optimization as you may sacrifice a lot of long-tail traffic for the sake of populating 10,000 pages of content with city names, business names, etc. Time-savers let you focus on the hyperoptimized keywords but if you want to go after the majority of the traffic that is actually available to you, keyword injecting is the least efficient way to do it.

I’ve worked with people who have optimized keyword injection by supplementing the repurposed boilerplate with additional content. That takes a lot of work but I think it’s worthwhile. In essence, you’re creating a mashup (a site which combines text from multiple sources).

Mashups have been abused but they can help you get past the duplicate content hurdle.

Methods for Managing the Mashup Mix – On large sites which cover many topics, use multiple templates and vary the mashup feeds. Yes, it’s more work for you but it improves the user experience by letting you target their specific interests. It also helps with your search engine optimization by expanding your approach to the long tail keywords you need to harvest traffic from.

Even small- to mid-size sites benefit from varying the content.

You should also vary the boilerplate text. Yes, that’s more work for you and it may limit the number of CMS options you have but, frankly, if you think one boilerplate template fits every city in America you have a lot to learn about marketing. Your demographics change by region, age, dialect, and other factors. You’re shooting yourself in both feet and through the eyes by relying on one template for an entire large geographic region.

Temper your time-saving with efficient marketing. Otherwise you’re just wasting resources and avoiding traffic that could boost your bottom line. If you haven’t done the demographic research to begin with you’re in no position to be telling yourself that using a single boilerplate template is the best solution for your business. Ignorance never guided anyone to the right choices.

Making sure people see the right info – This is a tough one. Let’s say you have a 2,000,000-page site and you’ve divided the content into categories, sections, sub-sections, etc. You have products, corporate information, press releases, blogs, maybe support documents, perhaps a forum, etc. Each section is so large maybe you’ve even put it on its own sub-domain. The navigation structure for a site this large and complex is enough to drive most SEOs insane.

Many companies replicate “important” content across their sections. It’s basically the same news, the same biographical profiles, the same company information, just inserted into the navigational structure to make it easy for people to get to it. You have three options to consider, each with pros and cons.

Use the Canonical URL tag – This helps reduce duplicate content in your site search results but doesn’t help you in any other way. With a site this large you don’t need to fuss over PageRank (okay, PageRank sculpting is ALWAYS stupid). There is no reason for a site trying to improve access to information to use this tag, as it may exclude some pages in sectionalized site search.

Consolidate everything into one area – Some companies, after experiencing the downside of duplicate content bloat, go through massive restructurings to eliminate their duplicate content. One of the pros for this approach is that it streamlines your management of the information. If you have to update profiles and company info, doing it once is preferable to doing it fifteen times. On the other hand, consolidating information forces you to revise your navigation. You might consider splitting your navigation into two consistent themes: section-wide navigation and site-wide navigation. Make sure to provide your visitors with self-explanatory navigational markers.

Leave it alone – As I suggested above, you might be able to manage multiple copies of articles through a custom CMS interface that lets you designate “classes” of documents. However, implementing this kind of logic may require more work than it’s worth.

An alternative to grouping documents by class is to inject some section-relevant copy that helps people understand why you included the copy in that section in the first place.

Another alternative is to gradually retire sections. Rather than consolidate your duplicate content, replace the sections requiring duplicate content and update your Web site’s architecture.

These types of projects require a lot of thought and usually a fair amount of coordination. Even if you’re just one guy running a huge site or network, you have to make sure you move stuff around gracefully. I update my own network structures and designs on a section-by-section basis. I’ve been doing that for years. It’s the only way to ensure I don’t zap everything with one huge colossal mistake.

Conclusion: Duplicate Content Is Manageable

The Duplicate Content Bogeyman will probably never go away. Not everyone in the industry reads all the well-informed blog posts, and not everyone who reads well-informed blog posts believes they are relevant to a specific situation. Good advice cannot cover every variation and detail but that doesn’t mean duplicate content has to be treated like some monstrous bad thing.

In most cases the worst effect you’ll see from your duplicate content is an “omitted results” marker in site searches. The split PageRank issue won’t hurt you nearly as much if you include good navigation on all your pages. As far as search engine optimization is concerned, duplicate content is NOT that big of a deal — just the work it takes to clean it up, optimize it, and improve the customer experience. But you can take these jobs one step at a time, and you have options.

{ 4 comments… read them below or add one }

jmillrod 02.27.09 at 8:13 am

People love to worry about duplicate content because they lack in content. I totally agree that duplicate content is an easily avoidable and in some cases not really a problem, but a blessing in disguise. Great post.

tobto 03.03.09 at 2:24 am

Thanks for smart reading on duplication! I believe Mark Twain will be Mark Twain after being million times copied. I believe into Google filters which will tell who is the author (starter) of the text. And I don’t believe into copypasters.

oakees 07.05.09 at 11:40 pm

Hi,
great post! Eventhough I didn´t find an answer for my original problem/question! I noted your reference to what Matt Cutts might have said on passing of pagerank and thought I´d give you a link to where he talks about the “decay” of pagerank before passing it on!

Infinite PageRank isn’t that helpful :) so Larry and Sergey introduced a decay factor–you could think of it as 10-15% of the PageRank on any given page disappearing before the PageRank flows along the outlinks.

Knowing this I think there is a good reason why duplicate content should be avoided as far as possible! Simply passing on the pagerank through good internal link structure (from duplicate pages) actually deducts 10-15% of link-power that doesn´t reach the target page. If there are duplicates on many levels there will be plenty of link-power lost on the way!

Michael Martinez 07.06.09 at 1:08 am

The decay factor has been noted in PageRank papers since the very first one was published by Larry and Sergey in 1998. It’s not that big an issue (as allowing the decay to pass on continually would have the same effect in reverse — all your PageRank would vanish).

There is no way for SEOs to know where the PageRank is, where it flows, and when it stops flowing (or decaying).

Generally speaking, the less attention SEOs (and everyone else) pay to PageRank, the better.