ReBlog: Why some sites MUST block Archive.Org

by Michael Martinez on November 12, 2009

UPDATE: Matt Cutts clarified what he said/meant for me in a Tweet: “if I’m already investigating a site which is spammy-looking and appears off-topic/expired, then IA block is very noticeable.”

I think this is too important for just one blog post. I know how the SEO community manufactures myths quickly, and it could take up to two years to undo the harm Matt Cutts may have just (UNINTENTIONALLY) inflicted on the Web community with a casual remark he made in a PubCon session today.

First, let me cite what I just wrote on Best SEO Blog (this is only the first few paragraphs):
I was following the organic site review at PubCon on SE Roundtable this morning when Matt Cutts apparently let slip a jaw-dropping comment (as reported by Barry Schwartz):

Barry Schwartz: This is a huge red flag!!! Matt said, this is the best source of spam leads. You block archive.org in robots.txt file, you are caught in no time, Matt basically said

Okay, first let me remind people that Google owns its index and only Google sets its Webmaster guidelines. But if Matt really believes that blocking Archive.Org is a spam signal, he has MUCH to learn about Webspam.
I DO block Archive.Org on some sites, and many other people do as well. And there is sometimes compelling reason to do so.
The intent behind Archive.Org is a good one. I’ve used it to save my sites many times when, after a hard drive failure, I’ve needed to retrieve older copies of live pages to do some work. I love Archive.Org because it’s a great resource for researching how Web sites have behaved through the years.
Unfortunately, Archive.Org violates intellectual property rights on a massive scale and when site owners become aware of that they take action by blocking Archive.Org. I see forums do this. I see blogs do this. I see article archives do this. I see news sites do this. And I do this (on some sites, not all).

Now let me explain a couple of things.

ReBlog – I’m just experimenting with the idea of calling this a ReBlog (based on ReTweeting). Maybe it’s a waste of time, but a lot of blogs do pick up stuff from other blogs like this.

SEO Myths Spread Like Wildfire – There are so many things that search engineers have said or written through the years that the SEO community takes the wrong way and runs with. Look at Hilltop and LocalRank. To this day they are still being blamed for the 2003 October Google update (dubbed “Florida”). Google employees have specifically denied that the IP address grouping methodology proposed in both papers is not used in Websearch, but the SEO community refuses to let go of the myth.

Other myths that SEOs have clung to tenaciously include: PageRank Sculpting tests showed it was working in late 2008 and early 2009. Google engineer Matt Cutts announced in June 2009 that Google had changed the way it handled “rel=’nofollow’” in early 2008, possibly late 2007, because so many SEOs were screwing up large Web sites with PageRank Sculpting.

Despite this important statement from one of the most trusted (and dissected) voices in search engine technology, many people in the SEO community claim that PageRank Sculpting is still worth pursuing through other means. Never mind the fact that all the tests which supposedly showed it was working clearly were bogus, based on false assumptions, and proving nothing of the sort.

Duplicate content has also been making the rounds in SEO mythologies. Everywhere I look these days, people in the SEO community are advising other folks to avoid creating duplicate content. That message got out of hand. Duplicate content cannot be avoided, but it can (and should) be managed. You can show search engines (and visitors) which duplicate content is the most important (out of the duplicates).

In fact, Adam Audette summarized Google employee Joachim Kupke’s duplicate content presentation at SMX East. Google essentially said “let us crawl your duplicate content”.

Google says “please do not use Disallow: directives in robots.txt to annotate duplicate content.” Content Google can’t get to, Google can’t know about, and they don’t like that. Their preference seems to be “put it all out there” and we can decide what’s best, and anytime content is excluded from search engines they lose that ability. My personal preference is to take more control, not less, but I understand the thinking behind this and why they’d want to say this.

Should people be concerned about splitting link value between duplicate pages? Sure. But be reasonable. Duplicate content occurs naturally and the search engines understand this. When 40 sites publish a press release and they all get indexed, telling a site owner NOT to include that press release (HIS press release) on his site because “it’s duplicate content” makes no sense.

Manage the duplicate content. Don’t fear it.

SEO mythologies will always exist. They are not always bad. We construct mythologies about everything. Even science constructs mythologies. You have to set boundaries on your frames of reference for the world you experience and mythologies are how the human psyche does that.

But when myths are drawn out of mythologies to the extent that the myths take on the weight of fact, those myths become destructive. You have to question everything and have the courage to raise your hand and ask, “Excuse me. Why does God need a starship?”

My point is: Just because Matt Cutts said that blocking Archive.Org in robots.txt is a big red flag does NOT mean that sites will automagically be nailed for spamming Google’s index. It’s ONE signal out of 200+ signals.

Please read the directions and use carefully. Do not attempt to drive heavy machinery while under the influence of SEO mythology.

{ 0 comments… add one now }