The Links, the PageRanks, and the Domains

by Michael Martinez on March 6, 2007

By definition, PageRank is a quality ranking for each web page. You won’t find many people in the search engine optimization industry who get that exactly right. In fact, you’ll find a diverse selection of definitions if you search for the definition of ‘PageRank’.

The voting metaphor is cited very often, but PageRank is not an electoral system. It’s a map of collective citation weight. Citation-based value only shows that a document has earned recognition among its peers, not that the peers confer authority upon the document. That is, PageRank does not measure or confer authoritativeness.

And yet many SEOs use authority site to describe sites with high PageRank. This confusion most likely arises from the Hilltop: A Search Engine based on Expert Documents paper, which says PageRank “computes a query-independent authority score for every page on the Web and uses this score to rank the result set.”

However, the authors of Hilltop proposed a different use for authority in writing: “We believe a page is an authority on the query topic if and only if some of the best experts on the query topic point to it.” They defined experts as pages that are “about a certain topic and [that have] links to many non-affiliated pages on that topic. Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations.”

These definitions, however, are ambiguous. And to resolve the ambiguity many SEOs have turned to Jon Kleinberg’s Authoritative Sources In A Hyperlinked Environment to explain what an expert should be. Kleinberg wrote: “Our model is based on the relationship that exists between the authorities for a topic and those pages that link to many related authorities—we refer to pages of this latter type as hubs.”

So SEOs have looked at the similar language in these papers and concluded that Kleinberg’s hubs are essentially the same as Hilltop’s experts. But this identification would only be acceptable if both papers defined authority the same way, and they do not. Hilltop stipulates that a document is an authority if only some experts point to the document, whereas Kleinberg requires that many hubs point to one of his authorities.

Another distinction between the two models is that Kleinberg’s collection of hubs and authorities can be crawled from any authority. That is, the authorities have to recognize the hubs through links. But in Hilltop they only require that experts be ” pages that have been created with the specific purpose of directing people towards resources.” Directories immediately come to mind because they categorize information by topic. A directory category is a nominative expert in a topic because the category’s editor(s) prequalified all the listings as being relevant to the topic.

Spammers recognized the flaw in this approach immediately: you need only create your own directories to have your own experts. Hence, in a Hilltop search engine, the pages that would be selected as experts would be the pages with the most outbound links (whose anchor text or destination pages were) most relevant to the query. Classic spam crawl pages (hallways) come to mind. While Hilltop would not be vulnerable to multitopic link-farms and reciprocal link programs, it would be easily manipulated by any topic-specific collection of directory-like pages. All you have to do is create more outbound links than the real directories.

With Kleinberg, the spammer hits a dead end unless he can create a Web community large enough to look like a natural region of Web space with its own local hubs and authorities. To be indexed, the Web community has to be findable through a channel or gateway page that acts as a switching station between the (artificial) community and the rest of the Web. A natural Web community should have many such gateway points, like streets entering a semi-private sub-division that has more than one street leading into it.

When you are generating 10,000 pages of content per day, create Web communities is not a problem. Building switching stations for those Web communities from the main Web is more difficult. The most popular means of building a switching station (that I know of) is to create what people now call “honey pot” sites: sites that legitimate content links to. The honey pots then link to the artificial Web communities. To make the artificial community look even more real, the member sites can all link out to a variety of well-established, link-rich sites.

At some point, the artificial community falls away from the rest of the Web. That is, if it doesn’t have enough honey pot switching stations, it looks fake. And that is where link spam enters the picture. What if you can bypass the switching station by getting thousands of deep-links to your content? Then your spam pages become members of other Web communities and you no longer have to create an artificial island in the Web space.

Of course, link spamming has been done to death and now we have to live with the foul taste of ‘rel=nofollow’, which doesn’t address the core issue. The problem for search engines is that the artificial islands still exist. Some of them were created years ago. Many are now springing up as people grow desperate to build their PageRank.

And it all comes back down to PageRank in the end because SEOs think it’s all about links. Now we have people like Rand Fishkin purveying the nonsense that PageRank flows from domain to domain. Rand is popular and influential. His idea will be passed around and propagated in SEO mythology for years to come, even though one need only look at all the Supplemental Results pages that appear for highly linked domains like mattcutts.com to see that PageRank is clearly not passing from domain to domain.

A well-linked domain will pass PageRank from one page to another. For some reason, many SEOs act as though their own pages are not good enough to confer PageRank. The spammers have known for a long time that if you want more PageRank all you have to do is create more pages. It’s that simple. Each new page can link to a most highly valued page that collects a lot of PageRank. That PageRank makes that page more important.

Google has taken steps to halt or slow the passing of PageRank because — until recently — it has been so easy to pass it from one site to another. The more internal linkage a site has, the more easily the site can pass PageRank around its pages (achieving what I call a “PageRank Equilibrium”, since any collection of pages that all link to each other will normalize their PageRank).

But what happens if you knock the majority of those pages out of the Main Index into the Supplemental Index? Supplemental pages don’t pass value and therefore they don’t pass PageRank. They can accrue it but they cannot pass it on. Hence, many artificial Web communities have been cut off from the main Web because their switching stations were often located in deeper content.

How did Google do this? We can speculate a hundred ways from Sunday. I doubt that artificial Web communities were such a horrendous problem that Google felt compelled to take action against them. Instead, I think Google took on several issues at once, one of them being that with all the link spam that had already been created, simply recrawling the Web would not solve the problem.

My personal estimate is that Google has moved about 80% of the Web into the Supplemental Results index. I can’t prove it, but I think that is where things stand today. The remaining 20% of pages that can still pass PageRank are therefore the Web’s influencers. But while Google can control which pages get PageRank it cannot determine where those pages will link. One obvious issue they had to struggle with was what to do with PageRank sinks: pages that don’t link out (or pages that don’t exist and therefore cannot link out).

Traditional academic papers dealng with PageRank sinks propose that pages with no outbound links should be treated as if they link to every other page on the Web. For all we know, Google has been doing this for years and continues to do it. But with dead links the issue becomes more complex because Google either has to pretend those outbound links don’t exist and therefore risk creating more PageRank sinks or else it can suggest an alternative destination for the PageRank that would be passed to non-indexed pages.

We do actually see Google pass link anchor text to non-existing pages, when URL-only listings appear in search results because of the anchor text pointed at them. But what if orphaned pages which don’t link out reside on well-populated domains with good interlinkage? Why not simply pass all the PageRank to the root URL? In that case, could you not do the same thing with clear 404 links?

Voila! Suddenly, Google looks like it is passing PageRank from domain to domain when, in fact, it is only passing PageRank from existing pages to pages substituted for non-existing or orphaned pages. But what would be the benefit to Google (and its users) in passing PageRank to a root URL? Maybe it helps preserve PageRank — keeping it out of the Supplemental Results index. After all, many people report that their root URLs show up in the Main Web index while every other page on their site shows as Supplemental Results. It would be self-defeating if all the PageRank sinks were conferring value to every page on the Web.

It’s easy enough to test whether PageRank is now being passed to the domain rather than to a page. Just point a massive number of links at a domain and see if all its pages — orphaned and connected — come out of the Supplementnal Results index. Matt Cutts shows Supplemental Results around page 80 when I do a site search on his domain today.

I even see one recent blog post that is Supplemental:

Matt Cutts: Gadgets, Google, and SEO » Quick February hitsQuick February hits. February 4, 2007 @ 9:48 pm · Filed under Google/SEO. The website for this year’s Superbowl stadium, www.dolphinstadium.com, …
www.mattcutts.com/blog/quick-february-hits - 21k - Supplemental Result - Cached - Similar pages

Clearly, all the link love Matt gets isn’t pushing his domain out of the Supplemental Results. Notice the lack of Supplemental Results pages in his off-domain URL references (which may or may not all be backlinks). Disclaimer: Results may vary over time.

If things are so tough for Matt Cutts in today’s Google index, is it any wonder that spammers are having problems too?

{ 1 comment… read it below or add one }

JGreen 03.12.07 at 11:51 pm

A well conceived and compelling argument against the “PR passes from domain to domain” nonsense. Unfortunately though the masses will always choose a simple fallacy over a truth that requires thought.