Duplicate content - I said duplicate content

Posted by Michael Martinez on April 1, 2008 in Content Theory

Duplicate content has been getting a lot of bad press lately. People in the SEO community have been advising Webmasters to avoid duplicate content for years, and now people are being told to block it off from spiders so they can sculpt their PageRank. (Note: I recently told someone I had said all I can say on nofollow for now, so this post is not about nofollow.) This post is not about nofollow.

This post is about duplicate content, so let’s put our duplicate content frames of reference on and see where we can share the road. Let’s share the road on duplicate content by staying in the same frames of reference.

To a search engine — wait, I cannot speak for search engines.

I think that search engines feel that duplicate content in general is not a bad thing. Duplicate content, like original content, can be abused. After all, you can use either duplicate content or original content as a smokescreen for redirection to a page concerned with a totally separate topic. That’s deceptive and it really doesn’t matter if you use duplicate content to be deceptive.

You can also slap AdSense or some other advertising on duplicate content or original content and promote the heck out of that content even though it really doesn’t offer anyone any value. Just because content is original doesn’t mean anyone is looking for it. So a search engine can be just as skeptical about the value of original content as about the value of duplicate content.

Historically there were a number of issues with duplicate content that have either been resolved or else have been put on a backburner by SEOs who have become engrossed in bad search engine optimization theory. For example, there used to be scraper and hijack sites that replaced original sites in search results. I think most people would agree that the search engines are much better at figuring out who the bad guys are and not showing them than in years past.

Another type of duplicate content that caused problems (and which may still be an issue to a search engine) is autogenerated muck that really serves no purpose unless someone comes along and makes it useful. A calendar application is a perfect example of this. Suppose you set up a Web site with some events and you use a general purpose calendar application. You may end up with 20,000 empty calendar pages on your site.

“But look at all the PageRank I can derive from those 20,000 pages!” you might conclude.

Hm. Well, if you can persuade a search engine to keep 20,000 empty calendar pages in their index, maybe you’ll get some benefit but there is a cost to you that should make you think twice. For example, every time a search engine crawls those 20,000 pages your server will be hammered. Some Webmasters have reported losing their Web sites when the servers crashed because of deep crawling activity. In fact, I’ve come close to losing servers and accounts because of deep crawling. I don’t like deep crawling, as I believe I have said before (so I am repeating myself when I say I don’t like deep crawling — although I really mean I don’t like overwhelming deep crawling that has adverse impacts on my server performance).

But what if YOU are running the search engine that has to index and search your 20,000 empty calendar pages? What if you offer your visitors a site search function? Do you have any idea of how much extra effort that site search will devote to indexing and searching empty calendar pages? Do you have any idea of how many useless empty calendar pages it will serve up to your visitors, forcing them to slog through a wasteland of non-content?

I’ve actually had to suffer through endless site search results on sites that have severe duplicate content issues. I use the calendar example because everyone can easily picture empty calendar pages (and because this was really a problem for many Web sites a few years ago). But there are many tech industry news sites whose site search functions suck big time because their tools cannot distinguish between remotely relevant and irrelevant content (usually the most relevant content is nowhere near the top of the query results).

That is, duplicate content may not look like duplicate content to you because you’re stuck in the rut of thinking that duplicate content is all about someone reusing the same page templates or someone scraping your site or something like that. I could take 1,000 words, randomly distribute them in 16 different patterns on as many pages, and I would have 16 duplicate content pages as far as most site search functions are concerned.

That is, to the average tech industry site search tool, “michael martinez is the world’s greatest seo” is the same thing as “the world’s greatest seo is michael martinez” which is the same thing as “the world’s michael martinez is the greatest seo” etc. Let me put it more delicately: MOST SITE SEARCH TOOLS CANNOT DISTINGUISH BETWEEN THOSE THREE EXPRESSIONS.

That is duplicate content. It doesn’t look like duplicate content because the words are in different orders. Okay, maybe because it’s one sentence saying the same thing three ways it looks like duplicate content to you. But when you’re jumbling up 1,000 words you end up with very different meanings quickly:

  • red cars jump over distant hills in the twilight
  • jump over the red cars in distant twilight hills
  • jump over the twilight hills in distant red cars

Does that make my point more clear? Now, Ask, Google, Live, and Yahoo! are all pretty sophisticated search engines and they generally do a better job than the average site search tool (all things considered). Still, even these great search tools can become a little confused if you throw enough content pages at them that can achieve similar algorithmic relevance scores for random queries.

If the majority of your search traffic is coming in through your specifically targeted queries, you’re either targeting a huge number of queries or else you’re terrible at search engine optimization. When you get away from the “seo”, “pizza hut”, and “britney spears” mentality in keyword research you find that most sites are relevant for an immense number of queries. And you’ll find — if you look closely at your statistics — that many people do actually use site search tools to find what they are looking for on a site with more than a few dozen pages.

If you can organize your duplicate content so that an average site search tool helps people find what they are looking for most of the time, you should not have any problems in the major search engines. I didn’t say anything about using “rel=’nofollow’”, did I? All I said was that organizing your duplicate content to work well wth site search utilities is a good way to ensure that your major search visibility will be okay.

What that means is that you cannot think of duplicate content as a bad thing. Rather, duplicate content is an asset that needs to be managed. Maybe your search visibility challenges stem from the fact that you promote your content through duplicate URLs, but most people don’t actually do that. In fact, most people who ask if they should block out their duplicate content don’t stop to see whether it’s actually getting any links or search traffic.

If you have duplicate content pages on your Web site that draw no search referrals and no links, you’re wasting your time by trying to hide those duplicate content pages from the search engines. People don’t usually link to the monthly archive pages of blogs, but if someone did link to the monthly archive page of this blog I would assume they found value in that archive page and I would NOT tell the search engines to ignore the page.

You’re not likely to be confused by a site search for a specific expression on SEO Theory (although it’s quite possible you’ll find many similar posts on the same topic). For example, if I start talking about external links versus internal links in this post (and I did just start talking about external links versus internal links), then this post becomes relevant to all queries about “external links versus internal links”.

Now, maybe the people who search for external links versus internal links every month really want to find the SEO Theory post about external links versus internal links (actually, I called that post “Internal links versus external links” but you people cannot seem to make up your minds which way you want to search for it). In any event, about the best thing I can do is link to the correct post every time I use internal links or external links and things versus other things as examples to help the search engines find what people are looking for.

This post is really about duplicate content, not internal links versus external links, but how is a poor site search tool to know the difference? You, the searcher, have to figure out for yourself what the search tool doesn’t understand. That’s inconvenient.

You can fuss over whether you’re splitting your PageRank with a few duplicate URLs but if you’re not dealing with a technical canonical issue (where you use the “www” and don’t use the “www”) then all those duplicate content pages may only be helping get your site crawled. Or not. If they don’t have links pointing to them from other sites then maybe blocking them off will help crawlers find unique content more efficiently.

But you know what? I’ve found myself searching more than one blog for its monthly archives through the years. Monthly archive pages can be useful things indeed, so I don’t agonize over them.

I probably would block off a calendar application if I found it was dominating my site search results. I would also block off other autogenerated mush pages if they cluttered site search, but once you get rid of the low-hanging rotten fruit you may well find the tree is still overwhelmingly burderned with other rotten fruit. It’s not so easy to get rid of unique content pages that just happen to be more relevant for random expressions than the pages that you thought you had optimized for those expressions.

There are other aspects to duplicate that have been discussed: article distribution, profile pages, and other popular ways of distributing your visibility across the Web have achieved near-controversial status because of the conflicting opinions about the value they provide. If you want to find every copy of an article you put out for distribution, do you really want to rely solely on Copyscape? I wouldn’t. I would want the search engines to help me find those copies so I could estimate what my true audience reach might be.

Copyscape has become a crutch for SEOs who don’t understand how complex duplicate content really can be. I see people referring to it in many forums now, where they help “solve” other people’s search visibility problems by finding “duplicate content”. Duplicate content is not the real problem, here. Conventional SEO wisdom is the real problem.

And you can quote me on that: “Conventional SEO wisdom is the real problem.”

Comment

Log in or Register to post a comment.

More

Read more posts by Michael Martinez

About the Author

Michael Martinez is the Director of Search Strategies for Visible Technologies, Inc. A former moderator at SEO forums such as JimWorld an Spider-food, Michael has been active in search engine optimization since 1998 and Web site design and promotion since 1996. Michael was a regular contributor to Suite101 (1998-2003) and SEOmoz (2006).

Search engine optimization secrets The SEO Theory self-test