Matt Cutts, Michael Martinez, PageRank, and Link Flow

by Michael Martinez on September 4, 2007

It’s not often when Matt Cutts directly comments on anything technical that I’ve posted. He caught my attention today with the following comment (emphasis is mine):

I agree with much of Michael’s ideas, but his habit of flatly asserting things can undermine his arguments. For example, he says “So when Google sits down to calculate PageRank for 20 billion pages, you automatically get a certain amount of that PageRank… The sum of your starting PageRank is X divided by 20 billion, but each of your pages has the same starting PageRank: 1 divided by X divided by 20 billion.” That’s just not how it works. So I took this article with a grain of salt.

The technical papers that I’ve read aside (that is a .PDF file — see my note below), there is a lot about PageRank that either hasn’t been published or which has been published where I haven’t read it. So the question then follows, how does it work?

It’s not clear to me what Matt means by “it” but I’ll assume he means “Google doesn’t divide 1 by however many pages for which it intends to calculate PageRank” — which, if that is the case, invalidates a lot of academic papers. But it wouldn’t be the first time academics were shown to be wrong about something. And while I feel it would be nice for Google to document how it calculates PageRank, I’m not going to agonize over the issue.

On the other hand, Matt also wrote:

Hmm. Michael says “SEO Myth: You can control the flow of PageRank on your site” and then later “You cannot control the flow of PageRank on your Web site but you can control your own internal link flow. Link flow is not PageRank.”

To me, the second statement (choosing how to link within your site) clearly does control the flow of PageRank on your site.

Nope. You’re just plain flat wrong, Matt, because as I said link flow is not PageRank. Not the way I defined it in both that article (”Link flow is comprised of the pathways you build between your pages”) and in Manage PageRank by managing link flow, where I wrote “Link flow is the pathway that links forge throughout your Web site or network”.

Since Google strips pages and/or links of the ability to confer link value (anchor text and PageRank), link flow may or may not pass PageRank. But Webmasters also strip pages and/or links of the ability to confer link value. Do Javascript links pass value? Do Flash links pass value? Does Google index pages that are blocked by robots.txt? (NOTE: I use both Javascript and robots.txt to prevent Google from crawling both internal and external links.)

The June article I cite above specifically focuses on how to use Link Flow to influence (manage) PageRank in exactly the way Matt suggests. You use your internal linkage to point to your most important pages. But whereas you can stop Google from following your own links you cannot stop those links from leading from one point to another.

In other words, Link Flow exists regardless of whether pages are indexed by any particular search engine and regardless of whether any particular search engine allows those links to pass value. “Rel=’nofollow’” stops your page from conferring link anchor text and PageRank but it does not destroy the link pathway. As long as there is a link pathway there is link flow. Furthermore, as I pointed out previously, if you put “rel=’nofollow’” on some of Page A’s links, you’ll beef up the PageRank on the rest of those links but that PageRank will still eventually get down to the rest of your site unless you block off every pathway — and on a large content site that is like trying to plug holes in a sieve.

Link flow is not a good euphemism for PageRank because both Google and Webmasters intentionally block the flow of PageRank. I currently have blocked Google from crawling almost 200 pages on Xenite.Org. I may block them from crawling more pages.

But I don’t block those crawls to control the flow of PageRank.

In the same Sphinn discussion Andy Beard wrote: “One factor that could be a problem with a number of Michael’s sites is how deep they are rather than wide. I have seen both Matt Cutts and Vanessa Fox talk about making sites wider rather than deeper.”

My sites are not nearly as deep as they may seem. There are many thousands of inbound links pointing to deep content on Xenite.Org. Architecturally (at a linking level) there is no difference between a page 6 directories deep from Xenite’s root URL and a page 6 levels deep from the root URL on a multiuser site like Geocities, Wordpress, Blogger, etc. I design many of the sub-directories on Xenite.Org the same way I would design a user Web site on a hosting service like Geocities, Angelfire, etc.

But I also strongly cross-promote my sub-directory sections across Xenite. And there are certain links (such as the Xenite home page, the Xenite news page, and the Xenite site map pages) that are found on every or nearly every page.

Why does Google occasionally dump pages from its index? I don’t know. It’s only happened 3 times in the past 2-1/2 years and the first two times those pages came back as Google subsequently crawled the Web. This week I am seeing an increase in the number of Xenite pages in the Google index. I have always assumed that Google would restore the pages. It’s not like I lost any significant rankings. The pages that have the most value on Xenite (to other people) remained well positioned in search results.

The bottom line is that, while I agree with many of Matt’s comments, his habit of flatly asserting things undermines his — oh, wait, that’s his criticism of me.

Actually, I wrote about this very sort of thing back in March 2001 in a Tolkien article titled A funny thing happened on the way to the canon. In that article I complained that:

The problem with defining a canon for Tolkien is that no one wants to share your canon. A few people have tried to be open-minded, but they inevitably get sidetracked when discussing someone else’s canon. “Well, you see, in my canon….” Tolkien didn’t make the task easy by any means. He kept starting and abandoning projects throughout his life, and because they all shared something in common (though one would be hard-pressed to identify many elements common to all the projects), there are people who glibly dip into one project to borrow material for discussing another project.

It’s the same thing in search engine optimization. No one wants to share my canon. But what is my canon? When I write about search engine optimization, I fall back into the same mode I use with Tolkien:

The Silmarillion is a book, composed or compiled by Christopher Tolkien. “The Silmarillion” is a story which J.R.R. Tolkien began working on about 1930. The story became the book, but the book is not the story. That is, the story was never completed, and has never been published. The Silmarillion is not even presented as an attempt to reconstruct the story. It’s an attempt to keep J.R.R. Tolkien’s fans happy. He had promised to publish The Silmarillion but no one really knew what that was. Tolkien himself never produced the Silmarillion because he would get only so far on a Silmarillion and then would start all over again. And there were so many associated texts which were never intended to be a part of the Silmarillion, but which inevitability became a part of The Silmarillion.

Confused? Now you know why I don’t try to define canons. Well, okay, I define them all the time. I wear them like disposable wrist-watches. I use them until the batteries run dry and then discard them. The canons I use today may look like the ones I used yesterday, but they are really different in some subtle, obscure fashion.

I have never changed the way I think (and write). Today I define one canon for “link flow” and tomorrow I define another canon for it. They may look the same but they may be very different. And for that reason people often conclude that I contradict myself in SEO theory.

It’s a reasonable conclusion if an incorrect one, because I have been schooled (lectured, browbeaten, however you want to describe it) into looking at the same issue from as many sides as possible. My professional career as a computer programmer and a search engine optimizer could be summed up by four words: That Is Not Acceptable.

No matter how hard you work, no matter how flexible you feel you are being, someone inevitably comes back at you and says, “That just won’t do. Make it better.”

Making my best efforts better for 30 years has forced me to assume that whatever I was thinking yesterday won’t work today. Being flexible doesn’t mean trying to explain the same thing in different words, it means explaining the idea from a completely different point of view. And people do this all the time, but most of the time we get it wrong when we try to explain something for someone else.

After all, people don’t share canons. We share ideas, not the frameworks in which we hold those ideas. My definition of “link flow” doesn’t necessarily agree with someone else’s definition of “link flow”. Now, if you search Google for PageRank and “link flow” you’ll find plenty of pages where people used them interchangeably long before the SEO Theory blog came along. So it’s perfectly valid to equate “link flow” with PageRank but I’m under no linguistic obligation to do that.

Equating PageRank with “link flow” is too rigid because PageRank just does not have anything to do with “link flow”. Link pathways are not measured in terms of PageRank (at least not in any public tool or paper I have found). You really could measure a link pathway in terms of PageRank but that’s not something anyone outside of Google is in a position to do. Don’t even hope you can use the Toolbar PR to do it. You don’t have enough linking data to assign the PageRank correctly.

Link flow has a place in managing PageRank. It also has a place in managing site design. It has a place in content promotion. These are distinct concepts, separate contexts. The canons are different.

I do my best to establish the canon — the context — for what I am saying in each article, but it’s not very helpful if you compare an article I write today to an article I wrote two months ago. Worse, if you try to hold me to a standard set by someone else (such as in the use of “link flow” where I have clearly provided my own working definition) you’re only going to draw the wrong conclusions.

Some people in the SEO industry wallow in drawing wrong conclusions. But I don’t think that’s what any of us realy want.

NOTE ON TECHNICAL PAPERS AND PAGERANK
So, as far as how Google begins its PageRank calculation process goes, I can only say that they have to use some sort of non-zero values to begin with because adding, subtracting, and multiplying by zero only results in a zero value. I chose a paper for my illustrative link above on a random basis because Google Scholar makes it extremely difficult to find publicly accessible information in any sort of chronological time frame. The particular paper I linked to says:

You start with an arbitrarily guessed vector r (e.g. a vector of ones, all divided with number of pages present), that describes the initial PageRank value ri for all pages Pi.
Then you iterate the recursive formula until two consecutively iterated PageRank vectors
are similar enough.

There are plenty of other technical papers that suggest the same starting values be used, but if I understand the math correctly, you could probably start with anything and just iterate your way through the process and eventually you’d arrive at some approximation of PageRank. And I should also point out that, so far as I can determine (I don’t have a Google employee roster), none of the authors of these technical papers work for Google.

How complex can the math get? Well, one paper documents three possible starting valuation methods for calculating PageRank in parallel (this is another .PDF file):

Another strongly investigated research area is the parallelization of PageRank. Existing approaches to PageRank parallelization can be divided into two classes: Exact Computations and Approximations. For the former ones, the Web graph is initially partitioned into blocks: grouped randomly (e.g., P2P PageRank [189]), lexicographically sorted by page (e.g., Open System PageRank [195]), or balanced according to the number of links (e.g., PETSc PageRank [111]). Then, standard iterative methods such as Jacobi or Krylov subspace [111] are performed over these pieces in parallel, until convergence. The partitions must periodically exchange information: Depending on the strategy this can expose suboptimal convergence speed because of the Jacobi method and result in heavy inter-partition I/O. In fact, as the Jacobi method performs rather slow in parallel, we modified the Gauss-Seidel algorithm to work in a distributed environment and found the best speed improvements so far (see Kohlsch¨utter, Chirita and Nejdl [145]).

Did you get all that? They’re trying to distribute the computation of PageRank across multiple resources. I won’t even pretend to be able to explain that.

So, in the future, when I make a comment about PageRank calculation, I’ll probably stay with the simple model even though it’s not the correct one (as Matt says) but I’ll disclaim myself a little better going forward.

NOTE TO GOOGLE: Here is an example of the queries I have to use in Google Scholar. I realize you may expect most Scholar users to have subscriptions to pay sites, but that’s really cost-prohibitive (even for someone like me, who could ask that the company pay for access). Would you PLEASE make it easier to search non-subscription sources only?

2007 google calculating pagerank starting value -site:acm.org -site:springerlink.com -site:ieee.org -site:wiley.com -site:computer.org -site:iop.org -site:ieeecomputersociety.org -site:sciencedirect.com

See also:
The PageRank control myth and the nofollow for SEO myth

How to screw your Web site with nofollow

{ 10 comments… read them below or add one }

Halfdeck 09.05.07 at 5:13 am

“I understand the math correctly, you could probably start with anything and just iterate your way through the process and eventually you’d arrive at some approximation of PageRank.”

Exactly. You can start with a value of 0 or 1 and end up with the same set of PageRanks after a few dozen iterations.

Michael Martinez 09.05.07 at 7:23 am

You cannot start with 0. You have to start with a non-zero value.

dodito 09.05.07 at 8:38 am

Halfdeck in fact.. iterations are not THAT simple.. with a bit of bad luck it really matters how you start out (and as Michael said definitely NOT zero.. you can’t devide by zero for one.. ) and how stable or unstable the formalism is, and how much you will ignore before you call it “converged” etc etc..

Halfdeck 09.05.07 at 4:58 pm

“You cannot start with 0. You have to start with a non-zero value.”

Yes you can.

Michael Martinez 09.05.07 at 9:59 pm

Half, a document with no PageRank value won’t confer any PageRank value regardless of how many (or how few) outbound links it possesses. Hence, if every document starts with 0, there is no way any documents can accrue greater PageRank as none of them will have any starting PageRank for their outbound links.

Halfdeck 09.06.07 at 8:45 am

How do I explain this?

Here’s the pseudo code for calculating PageRank (assuming an imaginary, perfect world where Google calculates PageRank for every URL on the web, no paid links are discounted, and we’re back in 1999):

For each URL-X in a webspace {
Deterime its “Increment” value (PageRank passed per link, or PageRank of a URL / number of value-passing links on that page);
For each URL that URL-X links to: {
Add Increment to that URL’s PageRank.
}
}

Loop that code block a few dozen times till PageRanks stabilize.

How do you determine the Increment value?

1. Figure out a URL’s PageRank.
2. Increment = PageRank / number of links on that page.

And how do you figure out a URL’s PageRank?

It depends on which formula you use, but if you use the original formula:

PageRank = (1 - d) + …….

Even with a starting PageRank value of zero, the dampening factor creates a padding (if d = 0.7, then you get .3) so that the result can never be a zero value.

You don’t need a default starting PageRank value, because a URL’s PageRank isn’t determined by the defalt value, its determined by the PageRanks of URLs linking to the URL. But you don’t know those PageRanks till you calculate them. And how do you do that? You calculate PageRanks of the URLs linking to the URLs linking to this URL…

In that iterative process, default PageRank value is irrelevant.

Michael Martinez 09.06.07 at 4:31 pm

Even with a starting PageRank value of zero, the dampening factor creates a padding (if d = 0.7, then you get .3) so that the result can never be a zero value.

And if d = 1, what do you get?

To start with a PageRank of 0, you have to zero out the value. Hence, allowing (1 - d) to be greater than 0 means you start out with a non-zero value.

Halfdeck 09.06.07 at 6:26 pm

“And if d = 1″

I assume d’s gotta be always less than one.

“To start with a PageRank of 0, you have to zero out the value. Hence, allowing (1 - d) to be greater than 0 means you start out with a non-zero value.”

In my crawl tool (I’m not claiming PageRankbot is remotely similar to Googlebot), PageRank value of every URL is initialized to 0 (though it could be initialized to anything). During the first iteration, the tool calculates the first set of PageRanks. Because it doesn’t know the sum of inbound PageRanks to a URL, the PageRank formula is reduced to (1-d)/N[number of pages in the link graph]. So you can say the “default” PageRank value is (1-0.85)/N billion, not 0.

But to my mind, default value is a variable’s initialization value before any calculation is done.

Michael Martinez 09.06.07 at 8:39 pm

Half, your tool is not a PageRank tool. It’s your tool.

If you want to help people in the search engine optimization industry, you’ll help wean them off of Toolbar PageRank. It’s not providing any useful information anyway.

Halfdeck 09.07.07 at 5:40 am

“Half, your tool is not a PageRank tool.”

Where did I claim my tool calculates PageRank accurately? PageRank calculation is far more complicated than just running a bunch of urls through a PageRank calculator. First, just calculating PageRank on a single domain as if the rest of the web doesn’t exist will return inaccurate results. Second, with pages in the main and supplemental index, we don’t know what portion of a domain is included in the daily PageRank iteration. Third, we don’t know how much juice a link actually passes, because Google can devalue or discount PageRank flowing through a link depending on how much Google trusts a link. Forth, to increase efficiency, Google might guess a URL’s PageRank instead of wasting resources crunching numbers. Fifth, we sure as hell don’t know what algorithm Google is using to calculate PageRank.

So no, I’d be nuts to claim that my tool generates accurate PageRank numbers. Like Alexa, its just an indicator. Take it or leave it.

“If you want to help people in the search engine optimization industry, you’ll help wean them off of Toolbar PageRank.”

Jeez. First, that tool has nothing to do with TBPR. It underplays TBPR. In fact, it drills into people’s heads that PageRank looks more like .00000000000123000105079 instead of 0-10. Some people asked me where the PageRanks were coming from because the PageRanks they saw sometimes didn’t mesh well with what they saw in his toolbar. Good. At least the tool’s got them thinking twice about what they think they know about PageRank.

Second, did you even try running the tool on your site? If you did, you’d realize you (used to have) inconsistent non-www/www references in your internal links and no 301 redirect in place. In your case I assume its completely intentional, but for someone else it could make a mess of things. I also had someone complain the tool isn’t picking up nofollow tags; when I looked the guy is using “rel-nofollow” instead of “rel=nofollow.” Seriously, if the tool crawled a little faster, I’d stop recommending Xenu to people and recommend my tool instead.

Third, try writing a code that just parses robots.txt and returns “disallowed” or “allowed” when fed a URL. Writing just that piece of code forced me to reread Google’s documentation on robots.txt (e.g. when given user-agent: * Allow: /dir/ and user-agent: Googlebot Disallow: /dir/index.html Google will ignore the directive declared under user-agent:*; depending on allow/disallow declaration ordering, Googlebot will behave differently, not always the way you might expect; which regexp Google responds to and which one Google ignores). It’s not terribly hard coding, I admit, but writing stuff like that forces you to look at things in more detail. If the SEO community doesn’t appreciate that, at least I got something out of the experience.

Forth, you’re getting off the subject.