Determining relevance, authority, and accuracy in search engine optimization

by Michael Martinez on June 27, 2008

There are many definitions for relevance, some of which have been proposed specifically for search engineering and/or search engine optimization. Any definition of “relevance” that speaks of quality or importance or authority misses the mark, in my opinion. Relevance has nothing to do with qualitative measurements with respect to how much a document satisfies a user’s query.

Let’s pretend you’re the Answer Man. You sit at a desk all day and people either call you, email you, or come by in person to ask you questions on all sorts of topics. Your job requires that you exercise due diligence and provide the most reasonably obtainable but satisfactorily correct answers to these questions.

A search engine, by contrast, is not expected to provide satisfactorily correct answers. That is, there is no minimum standard of truth, correctness, or authority to which any search engine can be held. The standard to which the search engines are held is best expressed as “this set of results contains the documents that we believe may be most likely to provide you with a satisfactory search experience”.

Ideally, I think every search engineer would tell you, “Yes, we want to provide our users with the absolutely best, most authoritative, easily understood and highly usable documents relevant to their queries.” But search technology is just not anywhere near that neighborhood. It needs some help.

Hence, people employ search optimization techniques in all three areas of the search environment: searchers modify their queries, search engines hedge their bets by serving multiple results and testing different algorithms, and content providers work to improve both the quality and apparent relevance of their documents.

We cut the search engines some slack for a variety of reasons. At some human level we accept that none of us is perfect and yet the search engineers do some amazing things with their technology. But we can also sit at the keyboard and type in endless variations on our queries until we find something that looks like what we are looking for.

As searchers and as search engineers we are not held to the same standard that we would be held to if we were the Answer Man. The Answer Man is expected to know or find the right source of information. He has to be better than anyone else at his job, otherwise he isn’t qualified to be sitting at the Answer Desk.

Hence, we ask no more of any search engine than that it provide satisfactory answers. We settle for “feel good” answers if we cannot find anything better than that (the so-called Wikipedia Principle). As search optimizers, we create answers of convenience rather than answers of authority because authority cannot be optimized or indexed.

That is, the number of links pointing at a document in no way establishes the correctness of the document, but it may (in a pseudo-scientific way) confer a communal badge of authority upon the document — provided the links are drawn from informed sources of opinion. Does that sound like a mish-mash of PageRank and HITS? It should.

In an ideal search environment you would indeed want to rely upon the informed opinions of expert sources of information to help you find the most authoritative resources about any topic. Unfortunately for Jon Klein and Ask, J.K. Rowling identified a very serious issue that afflicts the Web in volume. In Harry Potter and the Goblet of Fire (the fake) Professor Moody introduces the fourth year students to the three Unforgivable Curses, including the Imperious Curse.

During Lord Voldemorte’s reign of terror, Moody informs his students, many Witches and Wizards claimed they were doing “you-know-who’s bidding” only because of the Imperious Curse. “Here’s the rub,” he then says: “How do you sort out the liars?”

Search engines are dealing with millions of Web sites operating under Imperious Curses. I can make any Web site relevant to any topic by placing content on that site about the topic. I can make that Web site an expert on the topic by having it point to hubs about the topic. I can make those hubs recognize my expert Web site and give it preference.

I can create 100 experts and 100 hubs and have them all link to each other. Naturally, they will link out to other experts and hubs, but my experts and hubs will be the most important experts and hubs in this huge cloud of Web sites. I can do this with sites I control and with sites other people control. I am a master of the SEO Imperious Curse.

Search engines are only as good as their programmers at identifying my subservient sites, and every time a programmer creates a filter that banishes one of my slave sites to the nether regions I learn something about what that programmer knows about spam.

He, however, reveals just as much to me about what he thinks is relevant, and search is all about relevance.

So HITS is (in concept) an easily manipulated algorithm.

PageRank, on the other hand, is apparently used to manage both trust and crawling. I can leverage that to my advantage too by building a lot of sites that link to a small number of sites consistently — or by placing links across a lot of sites. Most spammers favor the “put links on as many sites as possible” approach, as do most white hat SEOs. In essence, there is no difference between the white hat and black hat approach to manipulating PageRank since most people lob truckloads of links across the Web.

PageRank attempts to sort out the liars by figuring out who really is trustworthy. You can build only so many links on pages that already have PageRank. The rest of the pages have to earn their trusted status. It used to be much easier to slip links into the PageRank pool. Now you need to plan ahead and be patient, methodical, and intuitive.

Intuitive link placement is more efficient than volume link placement. After all, one link from CNN’s front page should confer more PageRank than 1,000 links from previously non-existing domains.

But does CNN really know any better than you or I whether a particular Web site is the best source of information on a topic? The news media, like the search media, have to operate within the limits of their knowledge and capability. The news media are not the Answer Media — they cannot be held to as high a standard as the Answer Man, although most people would agree that western news media are held to a higher standard than search engines.

CNN will indeed link to a site that is relevant to a particular topic but that site may not be very informative, or it may not be correctly informing. Link popularity in no way reflects who knows what they are talking about. PageRank, therefore, is not a reliable source of information about information. It’s just a rough indicator of where trusted content may be found on the Web.

The quest for relevance has been derailed by the quest for authority — or, rather, the quest for relevance has been made dependent upon the quest for reliably accurate information. However, the quest for reliably accurate information has been superseded by the quest for trustworthy sources of information — trustworthy in the sense that they are not being deceptive (or Imperioused to look like they are trustworthy), not trustworthy in the sense that they are right.

No Web site is automatically right about any topic. Just because you find taxpayer data on the IRS.gov site, for example, doesn’t mean that the data is correct. It could be someone tampered with the site, that the data is corrupted, that the data has been altered or truncated to protect taxpayer privacy, etc. We don’t know how to measure correctness, except through the weighted opinion of experts (which can be faked).

In short, there is no Answer Man for the World Wide Web. We have Answer Wannabes, or we are Answer Wannabes and that is all the Internet gets.

Still, as Answer Wannabes we create our answers of convenience and then we project relevance onto them. We can measure the level of relevance only crudely (or with great precision if we choose not to take search engines into consideration). For example, we can speak of four types of general relevance:

  1. Strong Relevance
  2. Weak Relevance
  3. Presumed Relevance
  4. Null Relevance

A document has strong relevance to a given query if and only if the document’s primary topic precisely matches the query. The query expression must exist within the document and must be used within the document’s text sufficiently that any casual observer would reasonably conclude that the document’s main focus is on the topic described by the query expression.

But what if the query is a question? Should the document’s relevance be assessed by how much emphasis (and repetition) is placed on the question or by how much emphasis (and/or repetition) is placed on the answer?

You can only have strong relevance if there is a precise match between the document and the query. There is no precise match between a question and answer.

A document has weak relevance to a given query if and only if the document is relevant to the query but is not strongly relevance. That sounds like a Valley Girl definition to many, I’m sure. “Duh!”

Is there no middle ground between strong relevance and weak relevance? Relevance may be measured according to many different conceptual scales, but a document is either relevant to a query or it’s not. These definitions are not concerned with the question of whether a document is relevant but rather with how relevant the document may be.

For any given query all you can be sure of is that a document either precisely matches the query or it doesn’t. Hence, strong relevance tells us which documents precisely match the query and weak relevance tells us which documents do NOT precisely match the query. Nothing more, nothing less.

Within those two scopes, of course, you can imagine with complete justification scales of strength and weakness. A document would be completely relevant to a query if the document only precisely matches that query in its totality (not to the exclusion of matching other queries). In other words, if your query is “cats and dogs” and the only text in your document is “cats and dogs” then your document is completely relevant to “cats and dogs” (although it is also weakly relevant to “dogs and cats”).

Complete relevance is rarely if ever a desirable state for any Web document, but you never know. Maybe there is some scientific or quotations database out there that produces Web pages which consist of nothing but single, entire expressions (say, a popup window). I reserve opinion on whether there can or should be documents that are completely relevant to real queries.

Most people in the SEO industry rely on presumed relevance. This is an aspect of algorithms like PageRank, HITS, CLEVER, ExpertRank, and Edison. Presumed relevance is conferred by link anchor text and/or by the text surrounding link anchor text. We look for evidence of presumed relevance in link profiles, although there are few if any link profiling tools (not provided by search engines) that show whether links use specific anchor text.

Even Google’s Webmaster Tools don’t tell you much about presumed relevance (you cannot see how many links are passing anchor text or which links are passing which anchor text).

Presumed relevance is a very crude tool for search optimization. Nonetheless, there are many situations where it is a necessary inconvenience. Many Web site operators, for example, simply refuse to allow SEOs to change or optimize their content; they insist the sites remain untouched, intact, and unoptimized. What’s an SEO to do in a situation like that, except choose between building links and moving on?

Third-party SEO also relies heavily on presumed relevance. This is a two-edged sword as it can be used to both harm and help. You can link bomb almost any site to the top of search results for any query. Most SEOs rely upon low-level link bombing for their campaigns. However, supposedly in some verticals you can also link bowl (or Google bowl) sites out of the search results.

Search reputation management employs third-party SEO when it promotes relevant, favorable content to the top of search results. Hence, a lot of search reputation management is based on presumed relevance.

Third-party SEO often appears in the form of grass-roots linking campaigns, such as when a community of people link to a specific Web site without regard for anchor text and search engines — they are simply promoting some site they like or want to help but in doing so drive a lot of PageRank and random anchor text toward the site.

Third-party SEO also occurs in the form of news media links, where news sites provide links to resources or sources of information about their stories. These links and stories help create and build interest in query spaces as well as drive PageRank and anchor text. Many SEOs use press release distribution services this way, too (although they don’t appear to achieve as much as new stories do).

Presumptive relevance is the least trustworhy form of relevance because it cannot be measured. At least with strong and weak relevance you can see the words on the page and figure out quickly if the page is a document you want to trust.

Null relevance should be self-explanatory. A document has no relevance to a query. No search engine wants to show its users irrelevant documents but that sometimes happens. Why? One reason may be that a query is complex enough that one or more (but not all) of its search criteria are satisfied by at least some documents while no documents completely satisfy the criteria. Of course, such query results occur quite rarely, right?

Null relevance provides its own set of challenges and opportunities. You can use null relevance to design new query spaces. Search engines can use null relevance to test the accuracy of their algorithms. Searchers can use null relevance to prove points in arguments (keep in mind that queries can be structured to be misleading).

Yes, there are queries that mislead. They lie, deceive, and fool people. They are most often constructed to subtly omit highly relevant content. And deceptive queries do from time to time become popular. Try searching on “failure in the white house” and you’ll see that people must be linking to the White House Web site with anchor text of “failure in the white house”. Oh? That’s just another link bomb, right?

Technically, it just appears to be a side effect of the “miserable failure” link bomb. The page is strongly relevant to the expression “white house” and it’s presumably relevant to the word “failure”. You won’t find many pages in the search results that actually say “failure in the white house” and of the ones I checked none were linking with that anchor text. However, I cannot say for sure, as there are ways to hide links from casual queries.

Nonetheless, the “failure in the white house” query appears to be satisfied by a page on the White House Web site that is associatively relevant. That is, it is relevant because the search engine is associating unrelated factors with the query. This is an aspect of null relevance although some people might argue that there is a figurative connection between the query and the result — but I seriously doubt any search engine is capable of making such intuitive connections based on today’s technology. Even semantic search doesn’t promise that kind of query resolution.

There are people who want the search engines to be the Answer Man. There are people who want the search engines to think of them as the Answer Man. And there are people who want everyone else to think a third party is an Answer Man.

Sadly, there is no Answer Man and I doubt we’ll have one for a very long time. The technology is not yet there to help us become or create an Answer Man. And today’s technology has vulnerabilities that are exploited both intentionally and unintentionally by searchers, search engines, and search optimizers alike.

The lines between relevance, authority, and accuracy have been blurred. Those blurred distinctions help us see just how vague the lines between good guys and not-so-good guys can be on the Web. After all, even the good guys can be the Wrong Guys.

{ 1 comment… read it below or add one }

selectsplat 06.28.08 at 5:33 am

Outstanding article. In particular, I agree with the targetted, intuitive link placement approach, rather than mass spamming multitudes of sites out there to get a jump in PR and traffic.

One interesting thing I’ve noticed recently, is that Google is actually de-indexing sites that seem to gather links too quickly. Perhaps this is becuase the site owner has used one of those mass spamming ‘publish one article with your link accross 158 blogs’ or ’submit to 17 zillion directories’ rip offs, but I’m wondering if you aquired these backlink naturally, if the same thing would happen. Is Google deindexing sites that aquire links too quickly, or is it just a matter of duplication content and lnks on less than trustworthy sites? In either even, Google seems to prefer a tempered approach to link building.

Select Splat

MODERATOR NOTE: This comment has been edited to comply with blog standards and style.