Have you ever wondered how water finds its way through a sponge down to the surface beneath the sponge? Have you ever wondered why rain droplets form paths as they flow down your windshield, rather than just drop straight to the lowest point on your vehicle?
Percolation Theory attempts to explain some aspects of these phenomena.
There are two branches of Percolation Theory: Bond Percolation and Site Percolation. Bond Percolation deals with the connections formed by edges in a matrix and Site Percolation deals with the patterns formed by connected nodes. in a matrix.
In Percolation Theory you can take any system comprised of separate but equal parts (equality being essentially identical with uniqueness) and gauge the degree of connectivity between the parts. Suppose you can map all the parts in a matrix, each part being identified by its own node. We say the system has perfect connectivity or P = 1 if you can reach any side of the matrix from any node within the matrix. That is, there is at least one pathway connecting every node to every other node.
We say the system has no connectivity or P = 0 if the nodes are all completely disconnected — that is, you cannot reach any node from any other node in the matrix. More simply and intuitively put, when you have zero connections you have no connectivity.
Percolation Theory holds that in every system there must be a threshold point, a value Pc where things change. When measured for any system, if P is less than Pc you have broken connectivity. In other words, starting with no connectivity at all, you can connect nodes at random and not create a pathway from one side of the matrix to the other as long as you have fewer then Pc connections.
Conversely, if you start with a completely connected matrix where P = 1, you can break connections randomly and as long as P remains greater than Pc you’re guaranteed at least one pathway across the matrix (every node is still somehow connected to the rest).
Percolation Theory has been used to study materials and particle physics. It has also been used to explain the Fermi Paradox (which is a faulty scientific question that asks, “If there are X space-faring, colonizing civilizations in the galaxy, why haven’t they found us yet?”).
Brief Tangent – Why is the Fermi Paradox a faulty question? Because it it presupposes that any space aliens visiting Earth would want to land in Washington, D.C. and make contact with our government. This assumption does not explain how space aliens would know where to land or why they would want to make contact. The question (the paradox) assumes that all UFO sightings are bogus.
If it’s not obvious by now, then let me say that Percolation Theory can be applied to the Web in several ways. We can use it to measure connectivity between Web documents, Web sites, and Web communities. We can use it to measure crawl pathways. We can use it to explain why search engine A is less likely to crawl and index the exact same set of documents as search engine B.
In short, Percolation Theory is the lens we must use to understand why two backlink profiles differ (or why they are essentially equal). You can easily see Percolation Theory in action on the Web by playing with tools like TouchGraph, which uses Google’s related: report to map connections between Web sites. (TouchGraph now offers reports for Amazon and Facebook, too.)
TouchGraph and similar tools provide you with a high-level view of a site’s connectivity. You cannot see a complete picture of the Web or even of the communities in which most individual sites participate but these tools provide approximations that, when compared to each other, reveal some interesting relationships. If you wonder how a site might be algorithmically classified as of “type A” versus as of “type B”, the relationships graph may provide you with some insight.
Of course, the analysis cannot be better than the data provided to it, so take all such analyses with a heavy dose of salt.
Percolation Theory can also be used to analyze social media use. For example, suppose you want to know how active I am in social media. You can set up a matrix of social media sites and plot where you find my activity. Do I appear on Sphinn, DIGG, StumbleUpon, LinkedIn, Facebook, etc.? Do my various social media profiles link to each other?
Let’s give this matrix of my social media profiles a name — let’s call it a Social Media Matrix. So the Michael Martinez Social Media Matrix can be evaluated in terms of connective saturation. That is, the more connections there are between my social profiles, the more saturated my connectivity is. Connective saturation is not necessarily a good thing. After all, a link farm displays a connective saturation value of 1 (if we look at connective saturation as a probability distribution).
In other words, we need to ask the question, “How likely is it that a set of connections between Web sites of a given class is artificial?”
Don’t fuss over what a class is. It’s just a collection of objects grouped together by arbitrary (as opposed to random) criteria. So you can define classes of Web sites all day long and measure their connectivity saturation. The more saturated a group of related sites’ connectivity is, the more likely that connectivity is artificial.
However, this is a thesis that has not been proven to hold true in all cases. For example, in a very young community where all the members know about each other, you should expect to have a connectivity saturation equal to 1 without any arbitration. Arbitration is the action of an individual (a person or an organized group). When individuals connect through mutual interest without deliberate coordination they are not being arbitrary, they are being random.
The distinction might seem very narrow on first glance but we can illustrate with some numbers. Suppose, for example, you teach a class over the Internet. Let’s say you have 30 students. There is a probability Ps that your students will gradually link to each other’s sites on their own simply because they get to know each other.
Now let’s say you have 300 students. It should be intuitively obvious that Ps is smaller for 300 students than for 30 students. Why? Well, for one thing it takes more time to index 300 Web sites and for another thing it’s not easy to get to know 300 strangers all at once.
In a natural situation, Ps decreases as the size of the class (group) increases. That means there is an inverse relationship between Ps and the group for which it is measured.
Now, let’s suppose we can compare 3 or more classes, each with about 3000 students, and we find that 1 of those classes has a higher connectivity saturation than Ps predicts there should be. That is an example of an arbitrary connectivity saturation.
It may seem like Percolation Theory offers more value to the search industry than to the SEO industry. After all, the potential benefit for Web spam analysis seems self-evident to me. Although Percolation Theory is more easily applied to infinite groups the Web itself is sufficiently large enough that a major search engine should be able to develop some interesting modeling algorithms using Percolation Theory.
On the other hand, the SEO industry can apply simplistic Percolation Theory to gauge the naturality of its linking profiles, social media connections, and Web site neighborhoods. In other words, we can develop methods (and perhaps tools) to help us measure the extent to which we are relying on any specific resources or techniques. If, for example, you follow three main practices in your optimization and you use Percolation Theory to compare your connectivity saturation values, if one of the three has a higher value than the other point then you may reasonably conclude you’re overdoing it.
It’s something to think about as we enter the new year.
{ 4 comments… read them below or add one }
Ben_McKay 01.08.09 at 12:58 am
I think you’re totally right about the naturally occurring metrics used to analyse web behaviours.
One addition though, there are some heavyweight sites that must be close to that saturation point but whom would not be severely penalised. Using your examples Facebook and Amazon, these must have a very abnormal presence on the web regarding a variety of different saturation metrics.
I suspect that certain brands/companies websites have special treatment. Search engines could not afford to not serve certain results. For instance, in the case of BMW spamming in Germany – they were penalised for a day I think, whereas others are penalised permanently. (I think this was actually for hiden text but the point of special treatment still exists).
It could be inferred that there are two levels of algorithms or at least a trust metric or human intervention (or all of the above) for search engines to adjust to these heavyweight sites.
p.s. I’ve been reading your site more and more lately – really enjoying it! It’s hit my favourites list on my rss feeds, so thanks for putting all this great content out there!!
Michael Martinez 01.08.09 at 12:29 pm
Believe it or not, but Amazon has been laboring under a special algorithmic penalty or filter for years. Most SEOs have forgotten (or have been in the industry too brief a time to know) that Amazon once dominated most search results for which its listings were relevant across multiple search engines, including Google.
Some Web-aged news sites were also once very dominant. Google and the other search engines responded to complaints about the problem from people like Danny Sullivan and fixed their algorithms.
Although Googlers might object to my calling it a penalty, whatever keeps Amazon from dominating the search listings like it once did is equivalent to a penalty for Amazon.
You and I, had we lost all those search listings, would have considered it a penalty.
I think there are indeed layers of complexity in today’s search algorithms. The search engines must, I am sure, be using metrics the SEO industry has never heard of.
Ben_McKay 01.10.09 at 1:37 am
That’s a great reminder – thanks for that Michael.
So using that example, maybe there isn’t a two tier metric, and indeed search engines have factored in their ‘overly’ saturated presence (and metrics) to create more interesting SERPs as they would with any other site.
I hear people suggesting that they rank well for every primary keyword in their niche, but maybe search engines would consider penalising a site (regardless of how large they are) when they go beyond this niche and into the realms of mass markets…this appears to be what has happened with Amazon at least in previous years.
After all, the same results for multiple searches around a topic would not necessarily make the user happy, let alone when we are talking about mass markets.
Food for thought, thanks Michael!
Ben
Michael Martinez 01.12.09 at 10:06 am
Food for thought indeed. Many people have suggested that the search engines have identified a lot of niches. In fact, I’ve even read some technical papers proposing methods for how to algorithmically categorize the Web. But whether there is a general algorithm that limits all sites’ abilities to stretch beyond their niche or maybe just a “Super Large Sites” algorithm, I don’t know.
The search engines have had several years to refine the ways they handle these Super Large Sites, and except for Wikipedia they seem to do a pretty job of protecting the search results from low-quality listings. It’s a pity the search engines don’t want to do anything about Wikipedia, which is one of the worst sources of information on the Web.
Imagine an encyclopedia sitting on your bookshelf where every time you open the page to your favorite article you have no idea of whether the “facts” are what they were the last time you opened it. Would you want the government to manage your information (titles, registrations, tax records, etc.) the same way?
You must log in to post a comment.