Hakia is a true semantic search engine
Posted by admin on December 10, 2006 in Search Engine Optimization
Digital Ghost — one of the big-time old-timers of search engine optimization — has taken up blogging again. A couple of days ago he posted about Hakia, a new semantic search engine. Oh yes, I said it: “semantic search engine”.
But before you get out your crusty old “Latent Semantic Indexing” paintball guns and start nailing me, these guys are not attempting to use LSI. Or, rather, say they are trying to simplify the process by taking a short cut. Think of warping time and space to cut a path across the galaxy so that your trip only takes a few hundred years rather than 10s of thousands of years.
That’s the concept they are pursuing.
I posted my first technical analysis of Hakia at Spider-food and then thought, “I should probably talk about this on SEO Theory.”
Sorry. Old habits die hard.
But let me take a different approach with this post so I don’t exactly replicate my efforts.
Let’s assume that Hakia is presently only concerned with semantically indexing the English-language content of the World Wide Web. The English language uses about 2 million words, of which most are very obscure jargon terms used almost exclusively in industrial or scientific discussions. Call these segments of English “professional semi-dialects”. A true dialect may utilize a slightly different grammar from a mother language. Eventually, when enough word sounds, meanings, and grammatical transformations have occurred, a dialect becomes a new language.
A semi-dialect is never going to leave the mother language.
But when you discard (at least temporarily) all the semi-dialectal words in English, you’re left with a much smaller subset of words. It’s still quite large and probably constitutes several hundred thousand words, although you and I do not know most of them. My personal vocabulary may extend to somewhere between 50,000 and 100,000 words at most. The majority of Americans get by with a vocabulary of 20,000 to 50,000 words. I would guess it’s similar for most other English-speaking peoples.
So a semantic search engine doesn’t have to address every word in the English language, and in fact it may be able to get by with a very small core set of words. Let’s say that we want to create our own semantic search engine. We can probably index most mainstream (non-professional) documents with 20-30,000 words. There will be a few gaps here and there, but we can tolerate those gaps.
But still, the task of computing relevance for millions, perhaps billions of documents that use 30,000 words is horrendously monumental. If we’re going to base our relevance scoring on semantic analysis, we need to reduce the word-set as much as possible.
Fortunately in English, we employ many synonyms and metaphors with similar meanings. Unfortunately, we also reuse many words and metaphors in many ways. So reducing word-sets is not as easy as it sounds. You cannot simply use straight-forward substitutions. Take the word “dog”, for example. It can describe an animal, be an insult, a compliment, describe a toy, a cartoon, etc. It can also be used as a noun, an adjective (to modify a noun), a verb, and as an adverb (to modify a verb or adjective).
And “dog” is a pretty common, simple, easily understood word.
Semantic analysis theory holds that you can reduce the content of all documents in a collection to a matrix of relationships. Words become less important in themselves, so that you can actually perform meaningful substitutions. But the resources to build the matrix, transform documents to references to the matrix, and process queries against the matrix even for 30,000 words probably exceed by an order of magntitude the resource that any search engine today can bring to bear.
And Google may be using as many as 1,000,000 computers as I write this to serve their search results around the world.
What I believe the founders of Hakia have done is borrow the concept of Lambda Calculus from compiler theory to speed the process of reducing elements on pages to their conceptual foundations. That is, if we assume everyone writes like me, then most documents can be reduced to a much smaller subset of place-holders that accurately convey the meaning of all the words we use.
Computer programming languages that are compiled from one “high level” language to a “low level” (machine) language are passed through such a reduction process. But the more complex the computer language is, the longer it takes to compile a sophisticated program. Various technologies have been developed through the years to make the process of compiling programs easier. Intermediate languages (often called “P-languages” after the first one, which was developed for Pascal in the late 1960s) have been used a great deal. So has object-oriented programming, where tasks and data sets are divided into very small “atomic” components that stand by themselves. Both Java and Javascript incorporate object-oriented programming design.
But computer programming languages at most have only a few hundred “words”, maybe a couple thousand at the most. So scaling up compiler technology to analyze Web documents runs into the same limitations as scaling up semantic analysis algorithms. The whole process becomes extremely complex.
Computer scientists therefore set themselves the task of reducing the time it take sto reduce a high level command set’s meaning to a set of intermediate level components. The P-languages I referred to above can act as universal transformers. More than one programming language can be reduced to a P-language, and the P-language therefore simplifies the task of translating many programming languages into machine code.
Computer science turned to mathematics for help (as often happens) and implemented the concept of a Lambda Calculus which uses a three-step process to reduce instruction sets to a core group of meaningful components that are grammatically and syntactically independent of all programming languages. The hardest part of this process is called BETA reduction. But research and practice have helped computer scientists figure out a clean, quick way to perform BETA reductions on large command sets.
And that is where the benefit comes in for full semantic analysis. With an efficient BETA reduction algorithm, you can expand the size of your word sets. Hence, it is now possible to quickly transform documents composed in a 30,000 word vocabulary into P-documents that preserve the meaning of the original documents.
If I understand what Hakia is doing, these P-documents are indexed and queries are processed against them. That means their relevance scoring should be better than Ask’s (much less Google’s, Yahoo!’s, and Windows Live’s). They can ignore many of the window-dressing elements that the other search engines look at (HTML page structure components, bolding, italics, etc.) or they can at least give less weight to such artificial indicators of “importance”.
The technique is not fool-proof by any means. But it allows for a certain freedom in evaluating relationships between relevant documents. That is, they’ll know better which documents are relevant to a query and then they can use other factors to decide which documents are the most important in the chosen results.
But wait. Don’t get out your PageRank Toolbars just yet. Remember that I said they can reduce our 30,000 words to a P-language. There is no reason to stop with individual words. They can also reduce metaphors to logical P-components. And metaphors include hyperlinks and URL references.
Say good-bye to backlink counts and anchor text, folks. They just became irrelevant. Documents that simply refer to each other by URL, title, general reference, or through hyperlinks can all be evaluated on the basis of P-citations. Traditional search engines are definitely looking at hyperlinks and may be looking at URL references but they don’t go further than that.
Documents can still pass value by reference in a semantic index, but the mechanics of reference work differently. You have more options, so less sophisticated writers who don’t embed links in their text can have just as much impact on another document’s importance as professional search optiization copywriters. Paid links may not become a thing of the past very quickly, but you can bet your blog posts that buying references is going to be more sophisticated if this technology takes off.
That is what is so exciting about Hakia. They haven’t just figured out a way to produce a truly semantic search engine. They have just cut through a lot of the garbage (at a theoretical level) that permeates the Web. Google AdSense arbitragers who rely on scraping other documents to create content will eventually find their cash cows drying up. The semantic index will tell Hakia where the original content came from more often than not.
They still face many technological challenges. They’ll have to deal with issues such as URL canonicalization, privacy, session IDs, virtual content, dynamic content, building out their network of data centers, etc. And as search engine optimizers stumble across Hakia more and more often they’ll begin experimenting and poking around, looking for weaknesses in the algorithm to expoit. I am sure there will be possibilities for exploitation, but for now Hakia is ahead of the game.
We’ve got a new kid in town. I think it’s only a matter of time before the Big Four begin implementing similar technology. They may already be experimenting with it.
Merry Christmas, SEO community. There is something new in the stocking this year!
Comment
Log in or Register to post a comment.