Yesteryday’s SEO advice at today’s prices

by admin on February 13, 2007

This post outs a major content thief and spammer that has slipped under Google’s radar for a very long time. Happy Valentine’s Day, Googlers. Don’t get so caught up in being Googlers at SES London that you overlook the obvious, blatant content-theft spam that is hurting the quality of your search results.

Every day I see new evidence that the SEO community remains entrenched in pre-Bigdaddy Google analyses and strategies. Even many of the pre-Bigdaddy ideas that the SEO community has long embraced were wrong, but some of the leading names in the industry continue to hold to these unproductive ideas.

For example, if you drop into any SEO forum today and ask why your pages are showing as Supplemental in Google’s results, you’ll most likely see people talk about duplicate content.

Duplicate content continues to show up just fine in Google’s main index. The Supplemental Results index was never the Duplicate Content Zone. Just because a lot of duplicate content was placed there doesn’t mean that is all that was placed there. But now in the wake of both Bigdaddy and Thanksgiving 2006, Google’s Supplemental Index is home to many unique content pages.

Several high profile “thought leaders” in the SEO industry are also advising people to “reduce sitewide repetitive features”. Apparently, some people have mistaken Google’s clustering effect for a sign of Supplemental Results. That is, if you run a query that generates a truncated results list terminated by a message similar to:

In order to show you the most relevant results, we have omitted some entries very similar to the [insert number here] already displayed.

If you like, you can repeat the search with the omitted results included.

Some of our more illustrious SEO gurus take the position that you have tripped a duplicate content filter. And if it’s duplicate content that must mean the filtered content is in the Supplemental Results Index.

However, I can easily devise queries that display this message for clearly obviously unique text that is in the primary index. I can also devise queries that display this message for uncrawled content (URL-only listings). With a little more effort, I can get this message to display for duplicate content and/or Supplemental Results pages.

As long as your SEO competitors remain confused about why pages hit the Supplemental Results index and why some relevant results are omitted, you have an advantage over them. You don’t have to know why a page is treated the way it is just as long as you don’t substitute “collective SEO wisdom” for what is actually going on. Given a choice between admitting you don’t know why something happens and trying to explain it with ideas that contradict the available facts, you should always go with “I don’t know why”.

For reasons I have never understood, SEOs like to contradict the available facts at every turn. Take Google’s trademark feature, PageRank, for example. It’s a very simple concept: one link equals one “vote” for value — value in the sense of “I think that page is worth linking to”. But PageRank is not mathematically defined in terms of voting and popularity. Mathematically, PageRank only represents an approximation of the chance that a person randomly clicking on links will arrive at any given page.

Google long ago gave up trying to estimate actual PageRank. The fact that manipulative links have been put into place by search engine optimizers doesn’t mean no one would ever click on those links. In fact, many spammers count on random link clicking to drive traffic and commissions their way. The fundamental concept — that people will click on links about which they know virtually nothing — remains unchanged.

But Google’s founders, Sergey Brin and Larry Page, obviously didn’t understand what they were describing (they wrongly assumed that links could be trusted) and it’s obvious that Googlers today are dedicated to the proposition that you can make square pegs fit into round holes if you just eliminate enough inconvenient facts (by filtering all the links they think should not really count). And Google’s system is geared to eliminate as many inconvenient facts as it can.

The sad irony is that the industry which staunchly defends its rocket-science value acts like it doesn’t have a clue about how Google works. In the “Anatomy of a Large-Scale Hypertextual Web Search Engine” document, Brin and Page devoted 2/3s of one section to describing PageRank (section 2.1) and anchor text (section 2.2). When you read tutorials about PageRank and link building, see if you can find the following citation:

Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words….

This information has been available to the SEO community for more than 8 years. How often have those words been cited? Oh, look. We have “omitted results”. Ah well. It’s just (non-Supplemental) duplicate content.

The actual search engine architecture was described in section 4 (section 4.1 to be exact). Most SEOs haven’t been able to get past the sentence that mentions anchor text, so naturally they have trouble assimilating the paragraph in section 4.2.5 which says: “A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.” Worse, they don’t know what to do with the information about “fancy hits”: “Fancy hits include hits occurring in a URL, title, anchor text, or meta tag.”

Question: Are fancy hits the only relevance indicators explained in this document?

Answer: No.

So what should you do with “fancy hits”? Try not obsessing over them. There are other indicators of relevance and the document gives you a pretty firm idea of what to look for. Take a look at section 4.5.1: ” Every hitlist includes position, font, and capitalization information.” That isn’t the first time they mention font and capitalization. They also say: “Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, …), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list.”

They are looking at a lot of on-page information. PageRank gets tossed into the equation at the end.

But if you browse the SEO forums, blogs, and tutorials, how much will you see people take these facts into consideration? Some forum gurus will tell you, “That information is eight years old. How reliable can it be?”

Hm. Tough question. Since we never knew for sure how they were weighting anything, we really don’t know. But testing has shown conclusively that Google is not currently incorporating meta data into its relevance algorithm. Sure, they use the description meta tag for search results, but keywords is pretty much worthless.

This information is freely available on the Web, so it’s by no means rocket science, and yet even today most well respected SEO experts would absolutely fail the test if they had to answer a short list of basic questions about Google’s algorithm. Google has been documenting pieces of the puzzle for years without giving away the secret formula that puts them all together in the right order.

Of course, the fact that the essenial A-list SEO guru could not possibly tell you as much about the Google algorithm as I just have doesn’t change the other fact: that you can still manipulate Google’s search results through link anchor text.

And make no mistake: it’s link anchor text, not PageRank, that is helping you most when you manipulate Google’s results. We know that PageRank helps most with crawling. Your page’s (internal, non-Toolbar) PageRank tells Google how often to crawl your page. It probably now also tells Google whether to include the page in the Main Index.


Clue: Spammer outing follows


But getting a page into the Main Index is only part of the equation. You still have to make that page relevant to searcher queries, and even asserting relevance through on-page content won’t guarantee you the best performance. Maybe PageRank still makes the difference. It’s hard to say, although a spam site like ErrorForum can scrape articles from the SEO Theory blog and place their entire contents on its own pages (without permission or authorization) and get them to rank higher in some queries.

Clue: Did you get the name of that spammer?


Is that really an issue of PageRank, or does the ErrorForum spammer just happen to have a leg up on Google? Hey, he’s probably copying your content without your permission, too. He’s got a LOT of internal pages and it doesn’t look like he wrote any of them from what I can see.

So duplicate content, and lack of links, don’t necessarily kill you. They certainly don’t guarantee that you’ll end up in the Supplemental Results Index. Nor do they guarantee that you’ll be clustered under the annoying “omitted results” (hey, Googlers, I wouldn’t mind being abe to turn that off in general, unpersonalized search).

Spamming Google is easy. Cleaning up the spam is a constant chore and I respect the fact that not all spam gets caught right away. But Google would find that its search results would be vastly improved and much more reliable if it stopped trying to make the dead dog fly and just give up on the whole concept of passing link anchor text and weighting search relevance by PageRank.

Until they get a clue, however, those of you who pay attention to the little details will have an advantage over the back-slapping good ‘ole boys of SEO who continue to prattle on about “duplicate content”, “aging factors”, and other nonsense.

Oh, and if you’re a spammer: you do not have permission to replicate this article anywhere on any Web sites or in any other fashion.

{ 4 comments… read them below or add one }

fabianoblog 02.13.07 at 8:56 am

Michael,

Thank you for yet another great article. One quick questions though..being new to SEO (6-8 months), where do Meta tags fit in to the scheme of things. You mention in your article that key words are pretty much worthless and I am curious as to why and how? Aren’t they the core of your “relevant”, “fresh” content?

A newbie…a bit confused.

Thanks,

mourabiano@gmail.com
Fabiano Moura
CNS MARKETING INC.

Christopher 02.13.07 at 11:40 am

Great blog. Your writing is excellent, clear, educational and to the point. Thanks for this great resource.

Michael Martinez 02.13.07 at 9:23 pm

“You mention in your article that key words are pretty much worthless and I am curious as to why and how?”

The keywords meta tag has been so easily abused so often that most search engines no longer pay attention to it. Only Ask and Yahoo! still rely on the tag to any degree.

Big Bill 02.13.07 at 10:42 pm

A renaissance for on-page SEO? I’ve been waiting so long…:-)