Why your link analysis methods suck

by Michael Martinez on December 24, 2008

Simple probability theory tells us that if we take two coins and randomly toss them into the air, the chance of both coins landing heads up is 25%. You can illustrate this probability by looking at all the possible permutations for the coin toss: Heads + Heads, Heads + Tails, Tails + Heads, Tails + Tails. The probabilities for a coin toss’ results are often denoted by p and q, where p = the probability that a coin will land heads up and q = the probability that a coin will land tails up.

If you’re tossing two coins, you multiply the probabilities together. That is, the probability that a coin lands heads up = 0.5, so the probability that two coins land heads up is 0.5 times 0.5 — or 0.25 (which is a 25% chance of occurrence).

The probability that you’ll get a heads up and a tails up equals 50% (0.25 + 0.25) because you add the probabilities of heads + tails and tails + heads occurring.

The math falls into a neat quadratic equation of p2 + 2pq + q2 = 1, which can be rewritten as (p + q)2.

If you’re tossing three coins and want to know the probabilities, you work out the equation (p + q)3.

But what if you’re not working with coins? For example, suppose you’re working with a 3-sided thingamajigee that has a heads, tails, and sideways state. Now your basic p, q, and s probabilities work out to 0.33, 0.33, and 0.33. In other words, for every possible state or result of your thingamajigee toss, the probability of any precise state occurring is 1 divided by n, where n = the number of total possible outcomes.

Our thingamajigee doesn’t have to be something that is tossed. It could be, for example, a series of data items. Let’s say you have 100 tokens, each uniquely numbered from 1 to 100. What are the odds of, say, picking up any particular 5 tokens out of the 100? There are two possible outcomes for each token in this process: selected and unselected, so we can represent those outcomes by our old friends p and q.

Okay, that’s a little complex. However, we can turn to permutations without repetition to figure out the probabilities of choosing any particular 5 tokens from the 100.

The possible number of permutations for 100 tokens is quite large — 100 factorial (100! = 100 * 99 * 98 … * 1) but to find the possible number of combinations of 5 unique tokens we need only work with (100 * 99 * 98 * 97 * 96) or 9,034,502,400.

So the chance of any specific 5 tokens being selected = 1 divided by 9,034,502,400 (or 1.1068678226262909620788854956749e-10). That is a very, very, very small number.

But you may not want to know which specific 5 tokens are being selected. In that case, given 100 tokens and 5 selections, you would find 1005 possible permutations, which comes out to 10,000,000,000. Let’s work with the second (larger) number of possible permutations because it’s a well-rounded number.

Now, let’s assume we create three lists of tokens from our 100 available tokens. The probability of any one of these lists being created randomly is 1 divided by 10,000,000,000 (or 0.00000000001). We can denote that value as 10-10. That’s a pretty small number and, quite frankly, I don’t like small numbers.

So far these calculations assume that some specific order must occur when tokens are selected. There is a method for removing that requirement, thus reducing the possible number of permutations. Given n tokens where you are selecting r tokens from the group, you compute n! divided by the product of r! and (n - r)!.

Did you get all that? Sorry. I don’t know how to easily implement mathematical notation in a blog. But we’re dividing 100 factorial by the product of 5 factorial and 95 factorial. This is called a combination and can be denoted as C(n,r) (where n = 100 and r = 5).

Okay, let’s cheat on the factorial math.

And let’s change the number of selected tokens from 5 to 95. That is, we’re looking for C(100,95). According to the Stat Trek calculator, C(100,95) = 75,287,520.

Is your head ready to explode? Well let’s take it to another level.

Let’s create a list of 60,000,000,000 things (call it the Master List). Furthermore, let’s derive 3 subsidiary lists from that list and call them List A, List B, and List C.

Now let’s make some assumptions about these lists:

  1. None of the lists contains all the items from the Master List. That is, each list has fewer than 60,000,000,000 items in it.
  2. List A is larger than List B.
  3. List B is larger than List C.

So here is the million dollar question: What is the probability of all of List C’s items being found in List A?

I’m not going to do the math for you, except to say that the probability of that happening is extremely small.

Now, we can improve our odds of finding all of List C’s items in List A by reducing the size of the Master List and by setting a lower boundary for the size of List C. Or we can just narrow the differences between the sizes of the four lists. For example, we could assume that Master List = List A + 1, List A = List B + 1, and List B = List C + 1.

Now, that really takes all the fun out of the calculations, in my opinion, so let’s add a level of complexity. We’re going to add three List Pickers to our scenario, and they must follow these rules:

  1. Each List Picker can only choose from a temporary, randomly generated subset of the Master List, never able to see the entire Master List at any time no matter how often he tries to build his list.
  2. No List Picker may see what the other List Pickers’ choices are.
  3. If you ask the various list pickers to show you some of their choices, List Picker A will only show you a random sampling of choices; List Picker B will show you lots of choices but some of them will be fake; List Picker C will either show you a random sampling of choices OR it will show you lots of choices but some of them will be fake.

Now, let’s add one more level of complexity: Before each picker starts building his list, the Master List is randomly sorted. Hence, at no time can we be certain that the list pickers are choosing the same items — they cannot simply “start at the beginning of the list” and count through to whatever their arbitrary limits are.

If you’ve been following the SEO Theory blog for any length of time, you probably already know that Yahoo! cannot tell you what is in Google’s database. That is, there is no way you can reliably use Yahoo! to analyze backlink patterns in Google’s database because no search engine has complete knowledge of any other search engine’s data and algorithms.

That statement holds true for any link analysis tool you may favor: there are no tools available anywhere on the Web that will tell you what any major search engine knows about a given page (or site’s) backlink profile. They are all completely useless for that kind of analysis because the odds of their having the exact same data as any major search engine (like Google, Live, and Yahoo!) are microscopically small. If you’re guessing the probability of any link tool having reliable knowledge of any major search engine’s link data is in the billionths, you are WAY overestimating.

Now, that is not to say that all link analysis is bad. You can certainly use link analysis to see if someone has been building links. In fact, the more tools you use to evaluate linking data the better you will understand a site’s link profile. But if you limit yourself to using only one or two tools you’re pretty much wasting your time because you don’t know how much of a site’s link profile is being measured by any given tool.

The whole point of this exercise, however, is to show that all link analysis tools are extremely unreliable sources of information. They can only tell you what they know, not what other databases know. Of course, some people still feel strongly that if you find a site providing a link to one of your competitors you’d do well to also obtain a link from that site.

That’s one of the most inefficient approaches to link building, but the reasons why will have to wait for another day.

{ 2 comments… read them below or add one }

Sean Revell 12.31.08 at 2:20 am

Excellent article, look forward to the follow up.

ericward 12.31.08 at 11:13 am

You wrote…

some people still feel strongly that if you find a site providing a link to one of your competitors you’d do well to also obtain a link from that site. That’s one of the most inefficient approaches to link building…

Getting links a competitor already has is the low hanging fruit of backlink analysis-driven link building. It’s why sites with no discernable differentiating content are fated/doomed to end up with similarly undiscernable undifferentiating inbound link profiles. They are like monkeys playing leapfrog inside an MC Escher painting, all equally exhausted and no closer to any reward.

Eric