You’ve heard it a million times. You’ve probably blogged about it enough until you’re sick to your stomach. PageRank is based on citation statistics. Every document gets a “vote”, and the “democratic process of the Web” allows documents to vote for the most important documents. Larry Page and Sergey Brin were not the only proponents of this idea. IBM’s Jon Kleinberg also suggested that citation-based measurements would ensure selection of higher quality documents.
As search engine optimizers we know this is all hokum and black magic. In reality, if you point enough value-passing links at any empty document, any link-influenced search engine will promote that empty document to the top of search results for a targeted query despite the fact the empty document provides no value.
We could call this the Flash-bombing Effect, since the technique is most often used to promote Flash and other non-text content pages to the top of search results. The technique works so well because some search engines try to influence their relevance scores through citation-based measurements, what the mathematics community calls Citation Statistics.
In a landmark study published in June 2008, researchers conclude that citation statistics are misleading when it comes to determining quality.
But let me digress for a moment and point you to another study that is very interesting. Put your Game Theory hat on because we’re about to play ….
Public Good Games – Games that test the effectiveness of cooperation versus non-cooperation.
In general Public Good Games tend to show that cooperative strategies out-perform non-cooperative strategies. The classic Prisoner’s Dilemma illustrates this point very well. Two criminals are arrested and taken to separate rooms for questioning. Each prisoner is told that his buddy has confessed to the crime and that things will go poorly for him if he doesn’t confess, too. In reality, the police have no evidence and if they cannot obtain a confession from either crook they have to let both crooks go. So which strategy works best for both crooks?
The worst-case scenario is that both criminals confess and neither is given a break for helping solve the case. The middling-case scenario is that one criminal talks and walks and the other criminal serves the time. The best-case scenario is that neither criminal talks and both go free. That is an example of how cooperation offers the best payoff.
In a citation-measured environment cooperation clearly works to the advantage of the people seeking citations. The more citations you obtain, the more credible your work appears. That’s the fundamental flaw of link-based search algorithms (which predate Google, by the way). The incentive to inflate citations is provided by the ranking mechanism.
However, what if everyone cooperated and provided only honest, truly earned citations. Would quality still rise to the top? The theory is that people who cooperate tend to succeed more often because they obtain more connections throughout the population — hence, they have more resources and allies to call upon than do people who don’t cooperate very often.
The act of cooperation is viral (as is the act of non-cooperation). Two competing strategies, cooperation and non-cooperation, can polarize a population. Public Good Game Theory holds that, in the long run, the cooperators will win out over the non-cooperators. Why? Because cooperators obtain more rewards than non-cooperators.
Think of a population passing through several phases. In the first phase the population divides into three groups: cooperators, non-cooperators, and everyone else. The non-cooperators will appear to achieve some early successes comparable to the cooperators’ early successes. In the second phase the rest of the population will join either the cooperators or the non-cooperators. However, the non-cooperators have few social connections; hence, the majority of the population joins the cooperators, thus shifting the bulk of their resources to the cooperative part of the population.
Now it just becomes a numbers game. It doesn’t matter what the mechanism for competition is, the end result will always be the same: the Union has more troops, ships, factories, and bullets than the Confederacy. It doesn’t matter how good the non-cooperative population is, the cooperation population outperforms the non-cooperative population.
In phase three the non-cooperative population dies out and the cooperative population survives.
In terms of search engine optimization, the more allies you bring to your campaigns, the better off you are. The fewer allies you work with, the less likely you’ll achieve much long-term success. Does that sound like social media squaring off against spam scripts? Sure it does, but it also sounds like link farms, SEO blogs and forums sharing tips and tricks, reciprocal linking, free-for-all pages, and other time-honored SEO strategies.
That part of the SEO community that bonds together tends to experience the most success. That part of the SEO community that eschews alliances tends to experience the least success. Now, this principle has nothing to say about what types of allies you obtain. It only shows that the more connected you are the more likely you are to reap the rewards of working with other people. In search engine optimization, that translates to obtaining more competitive rankings.
People emulate perceived success and since cooperative strategies produce more successes overall people will eventually build relationships with other people in order to obtain greater success. There is, of course, a levelling or smoothing effect. In fully social systems the rewards tend to be spread out pretty evenly over the long run.
One strategy that can help you stay on top is to practice cooperative techniques in the early phase, build a strong network, and then become less cooperative. In fact, that is what many of the so-called “top” SEOs do. People grow tired of their favorite dispensers of advice because the advice becomes jaded, repetitive, and uninformative. The leaders in the field start speaking more about “fundamentals” and provide fewer and fewer insights into achieving success.
Public Good Game Theory recognizes a middle group in more than one way. You could have non-cooperators (free riders), moderate cooperators (reciprocators), and full cooperators. Most people tend to be reciprocators (moderate cooperators) and their aggregate influence affects the rewards that the entire group earns.
PageRank cannot cope with free riders. You have to filter them out. Nor can it handle reciprocators very well because their aggregate choices won’t be based on quality but rather on a mix of self-interest and the greater good. In testing, full cooperators constitute the smallest percentage of game-players (and the experiment I cite at the National Institutes of Health suggests that we are usually stable in our choices of type).
We thus have two areas of mathematics that challenge the assumption that link citation directs search engines toward higher quality documents. In fact, the numbers just don’t add up in favor of using link citation as a basis for determining quality. That might explain why Google has struggled to filter out link farms, faux directories, paid links, and link drop resources (like forums and blog comments).
So let’s go back to Citation Statistics. Let me cite the Executive Summary:
Executive Summary
This is a report about the use and misuse of citation data in the assessment of scientific research. The idea that research assessment must be done using “simple and objective” methods is increasingly prevalent today. The “simple and objective” methods are broadly interpreted as bibliometrics, that is, citation data and the statistics derived from them. There is a belief that citation statistics are inherently more accurate because they substitute simple numbers for complex judgments, and hence overcome the possible subjectivity of peer review. But this belief is unfounded.
- Relying on statistics is not more accurate when the statistics are improperly used. Indeed, statistics can mislead when they are misapplied or misunderstood. Much of modern bibliometrics seems to rely on experience and intuition about the interpretation and validity of citation statistics.
- While numbers appear to be “objective”, their objectivity can be illusory. The meaning of a citation can be even more subjective than peer review. Because this subjectivity is less obvious for citations, those who use citation data are less likely to understand their limitations.
- The sole reliance on citation data provides at best an incomplete and often shallow understanding of research—an understanding that is valid only when reinforced by other judgments. Numbers are not inherently superior to sound judgments.
Using citation data to assess research ultimately means using citation?based statistics to rank things—journals, papers, people, programs, and disciplines. The statistical tools used to rank these things are often misunderstood and misused.
- For journals, the impact factor is most often used for ranking. This is a simple average derived from the distribution of citations for a collection of articles in the journal. The average captures only a small amount of information about that distribution, and it is a rather crude statistic. In addition, there are many confounding factors when judging journals by citations, and any comparison of journals requires caution when using impact factors. Using the impact factor alone to judge a journal is like using weight alone to judge a person’s health.
- For papers, instead of relying on the actual count of citations to compare individual papers, people frequently substitute the impact factor of the journals in which the papers appear. They believe that higher impact factors must mean higher citation counts. But this is often not the case! This is a pervasive misuse of statistics that needs to be challenged whenever and wherever it occurs.
- For individual scientists, complete citation records can be difficult to compare. As a consequence, there have been attempts to find simple statistics that capture the full complexity of a scientist’s citation record with a single number. The most notable of these is the h?index, which seems to be gaining in popularity. But even a casual inspection of the h?index and its variants shows that these are naïve attempts to understand complicated citation records. While they capture a small amount of information about the distribution of a scientist’s citations, they lose crucial information that is essential for the assessment of research.
The validity of statistics such as the impact factor and h?index is neither well understood nor well studied. The connection of these statistics with research quality is sometimes established on the basis of “experience.” The justification for relying on them is that they are “readily available.” The few studies of these statistics that were done focused narrowly on showing a correlation with some other measure of quality rather than on determining how one can best derive useful information from citation data.
We do not dismiss citation statistics as a tool for assessing the quality of research—citation data and statistics can provide some valuable information. We recognize that assessment must be practical, and for this reason easily?derived citation statistics almost surely will be part of the process. But citation data provide only a limited and incomplete view of research quality, and the statistics derived from citation data are sometimes poorly understood and misused. Research is too important to measure its value with only a single coarse tool.
We hope those involved in assessment will read both the commentary and the details of this report in order to understand not only the limitations of citation statistics but also how better to use them. If we set high standards for the conduct of science, surely we should set equally high standards for assessing its quality.
Joint IMU/ICIAM/IMS?Committee on Quantitative Assessment of Research
Robert Adler, Technion–Israel Institute of Technology
John Ewing (Chair), American Mathematical Society
Peter Taylor, University of Melbourne
Well, that was a mouth full and this paper takes not just one but two pot-shots at the whole Google premise because it directly challenges the “source helps determine better wisdom” concept. That is, Google’s claim to superiority was that it didn’t simply count citations, it held that citations from more important sources of information were better arbiters of value.
Not so, says the Math Union. However, it’s not Google that is really getting slammed by these conclusions.
It’s the search engine optimization community that suffers the most. After all, search optimizers tend to value links on the basis of source (CNN-quality links versus spam-quality links) and search optimzers tend to trust Google’s Toolbar PageRank and Yahoo!’s link reports more than any other metrics — both of which are provably false purveyors of information. Toolbar PageRank doesn’t tell you if a page will pass value through its links; and Yahoo!’s link reports don’t show you which value-passing links Google knows about.
In short, the search engine optimization community is losing the game for three reasons:
- Most SEO metrics favor Toolbar PageRank and Yahoo! link counts (false information)
- Most SEOs fall into the reciprocators group (the group with mixed self-interest and altruistic goals)
- No major search engine, not even Google, weights search results solely by citation statistics
In other words, the SEO community sacrifices competitive knowledge in favor of false knowledge (Toolbar PageRank and Yahoo! link reports), practices self-defeating strategies (mixed self-interest and altruism), and doesn’t focus on the full spectrum of search ranking algorithmic factors.
Which is not to say that SEOs don’t look at other data. But you’ll have a hard time finding any well-known SEOs who consistently talk about factors other than PageRank and backlinks. You’ll also have a hard time finding anyone who openly shares all his knowledge and experience. And don’t even hope you’ll find out what all the ranking factors are.
Which leads us to where there may be some apparent incongruence between the applicable points. On the one hand, it’s easy to manipulate Google’s search results if you can point enough value-passing links at a given document. On the other hand, the search engines are not transparent machines and most of the people trying to influence them are doing so rather inefficiently.
Which means that the most effective competitive strategy in search engine optimization may be to lean mostly toward altruistic practices (be open and fully cooperative) until you have accumulated enough resources to take advantage of your allies and the reciprocators. In other words, share openly everything that is already shared openly, but find the quickest path toward acquiring knowledge or insight that is not shared openly and then keep it to yourself.
This Turncoat Strategy creates a springboard effect that propels former Cooperators ahead of the pack. The trick to pulling this off successfully, however, is to revert to your altruistic nature when there is a change in the environment. In search engine optimization, any new technology or methodology resets the clock. People who remain locked in Free Rider mode will eventually fall behind, whereas people who make the transition from Free Rider back to Altruist at the right time can rebuild their connections, reap the rewards of early cooperation, and then springboard ahead once again by reverting to Free Rider mode.
The best time to become a Free Rider is probably after all the previous Free Riders have died out.
In other words, people who promote social linking schemes (that don’t trip spam filters) the soonest reap the most benefit from the schemes. By retreating from the limelight as the SEO community adopts the social linking schemes, the early promoters move forward and begin developing new advantages. When the old advantages have played themselves out the early promoters can roll out the new advantages and start the process over again.
This principle works because, in reality, PageRank does not work. It cannot provide any guidance toward quality and search engines in general have to settle for the economic compromise of promoting content that is minimally satisfying. If search engines are classified as players in a Public Good Game, they have to be considered Reciprocators. They will never have an advantage because of their mixed self-interests and altruistic principles.
But why a social search engine (like Wikia) or resource (like Wikipedia) fails to provide quality results is a topic for another day.
{ 0 comments… add one now }
You must log in to post a comment.