Gaming semantic search

by Michael Martinez on April 11, 2008

The impetus to adopt semantic markup standards in Web page content has faded off, an inevitable consequence (I feel) of the clash between the proposal for structure and the reality of the chaos that is today’s World Wide Web. Although semantic Web advocates remain passionate about their cause they have not yet persuaded a significant minority of Web site operators to adopt semantic standards.

In March Yahoo! proposed that search take the lead in supporting the Semantic Web. They almost got it right. Web search in general cannot compel people to adopt an awkward structure (anything new is awkward to people who have become comfortable with their ratty old shoes). The whole point behind Semantic Markup is to make the discovery experience as smooth and seamless as possible for searchers.

Where we need a breakthrough in semantic standards is site search rather than Web search. The fact that large ecommerce sites struggle with managing huge inventories suggests there is an opportunity for a large-scale site search tool that can leverage Semantic Web markup into its data management structures. Think of a general-purpose indexing application that specializes in identifying and cataloguing semantically marked up content.

Instead of putting a general purpose search tool on your 1,000,000-page retail site you instead buy an off-the-shelf semantic search tool and use your database-driven publishing system to markup your inventory listings with only the vital information that consumers actually need. Instead of searching random user-generated content like forum and blog discussions or product and vendor reviews, people will actually be able to search for products and vendors.

Replacing general purpose search tools with semantic search tools doesn’t mean that people will lose the ability to find that random content. Rather, the general purpose tools can reside side-by-side with the semantic tools, or the semantic tools can be used to kick in general purpose tools. That is, reviews and discussions could be semantically tagged so that a semantic search tool knows it’s not conducting a database search but rather a general purpose search.

By empowering users to zoom directly into the product information they want (bypassing or reducing the endless scrolling through irrelevant site search results) retailers will drastically improve the consumer experience, making their sites more useful and valuable. Conversion rates should improve so the economic incentive to incorporate semantic markup and search into on-site tools is already in place.

As large retail and wholesale inventory sites adopt semantic search the general purpose search industry can begin to incorporate semantic functionality into its own technologies at a gradual rate. The incentive to accelerate development of semantic search technology arises from the major search engines’ desires to remain at the top of their game, but an ecommerce semantic search tool would have an opportunity to create a brand value over a couple of years — probably enough that someone would buy it just to own both the brand and the technology.

In the meantime, forum software vendors can also look at semantic search from their own site search perspectives. Forum search tools generally suck and I often find myself resorting to Web search tools in order to find old forum discussions (a task that is complicated by the fact that none of the major search engines attempt to index every word in every forum discussion). What we need from the forum software industry is a combination of semanitic and heuristic search technologies that have not yet been developed for commercial use.

Semantic markup has limited applicability for forum discussions because the content tends to be very random but it can be used to identify users, forum topics, thread topics, and other structural aspects of discussions. If forum software allows thread-starters to create tags (and posters to add tags), moderators would actually benefit from improved thread management opportunities (assuming the thread management tools took advantage of the semantic markup).

The heuristic technologies are needed just to improve forum search tools’ abilities to scan discussions for keywords. MySQL is a pretty rotten tool to use for text-based queries but virtually every forum product now relies on some sort of SQL back end. A heuristic tool would at least be able to scan the tags and other markup elements and make some guesses at which threads are most likely to contain what the user is searching for.

This would require the addition of a dynamic dictionary tool to the forum search application because the heuristic tool would have to learn how people associate words and expressions. Every forum develops its own idiom and an idiomatic heuristic algorithm would improve forum search applications considerably. You won’t always find what you’re looking for with the “best guess based on tags and semantic markup” approach but you’ll spend less time looking at 5 pages of forum thread listings.

Semantic search does not have to be driven by markup language. In fact, a general purpose semantic search engine MUST rely upon heuristic algorithms and semantic analysis (I did NOT say “latent semantic indexing”). The Web speaks many languages, and within those languages there are many dialects, and within those dialects there are many slangs. We can approach the problem by slicing the Web into quadrants where specialized tools make more sense than general purpose tools.

Later, when we have the specialized technologies, we’ll be able to develop more generalized infrastructures to compose truly semantic search. The process will work best when driven from the grass roots level because it’s easier for one retailer or one forum software vendor to create a semantic infrastructure than it is to persuade everyone to do this at the same time (or even in some sort of vague staggered progression).

The risk we take as a community is that we’ll be presented with a classic technology competition (ala Betamax versus VHS or HD-DvD versus BlueRay) that will splinter Web markup standards for a few years. Eventually the technology will have to solve the problem by implementing compatibility modes. Unless we evolve into real-time search (which today’s technology is completely incapable of delivering) using offline or batch processing to adjust incompatible markup standards into a common format should not disrupt the adoption of semantic tools and standards.

In the meantime, if we drive the adoption of semantic search from the Web site operator site we’ll be able to close the door (for a while) on semantic search manipulation. That is, the best semantic search tools will be those under the control of Web site operators and Web spam teams will have neither the incentive nor the opportunity to develop semantic search spam until general purpose search tools incorporate semantic technologies into their products.

Google should literally be the last resource on the planet to adopt semantic search, because when they offer it we’ll be thrown back to the 1990s’ era of meta tag spamming. All you have to do to fool a semantic search engine is place false and misleading information in your markup language (or, rather, inside your markers). Link analysis won’t defeat semantic spam. Instead, the search indexes will have to look at agreement and congruence (which I believe they already do to a limited extent — certainly Ask is doing this).

Game theory tells us that we can create a simple model consisting of three players: the searcher, the indexer, and the provider. If the provider lies to the indexer can he still satisfy the searcher? Think about that carefully because this game has already been played out billions of times through the years. I’m referring to cloaking, or IP-based delivery. If you show the search engine an optimized page but show an unoptimized page to the searcher, can you satisfy the searcher?

Any search engineer would be quick to say, “Yes, as long as what you show to both the indexer and the searcher is conceptually the same thing”. That is, given two different content structures, both structures must be conceptually equal in order to satisfy the searcher, who was presented with a result based on information the searcher never saw.

Providers who lie to the indexer in order to deceive the searcher have to be thrown out of the game; otherwise, untrustworthy providers will win the game every time, and the searcher and indexer will both lose (but how often trustworthy providers win over untrustworthy providers is a different issue).

In our game the goal is to achieve a WIN+WIN+WIN scenario, but it would be acceptable if we achieved a WIN+LOSE+WIN scenario in which the loser is the indexer. That is, if the searcher has to work harder to find the right content from the indexer then the indexer loses because the searcher has a poor experience. The searcher and the provider will eventually be paired up but they won’t want to rely on the indexer again.

In our game the trustworthy provider only wins if the searcher wins, but the untrustworthy provider wins even if the searcher doesn’t win. In this scenario there is too little incentive for people to play the game. That is, neither searchers nor indexers will want to see the untrustworthy providers again. You either get rid of the cheating providers or you stop playing the game.

In a game where the indexer loses you still can have the game continue because the searcher and provider eventually win.

This means that the success of semantic search is not dependent upon the search tool but rather upon the content provider. Hence, the search tool can be replaced (thus transforming a WIN+LOSE+WIN scenario into a WIN+WIN+WIN scenario).

The only problem is that the provider cannot control the index unless the provider becomes both the provider and the indexer. That is, either we all stop publishing Web sites and let Google publish them all or else we all have to provide searchers with alternative search tools. Site search is where the playing ground is level (or should be). You can use a major search engine for your site search function but you won’t give the searcher a very satisfying experience. However, you have the option of installing your own site search tool, thus making you both the provider and the indexer.

As we play out these little WIN+WIN+WIN games across the Web the major search engines will develop technologies that help them win so they can play the game with us. They’ve already done this kind of straggling adoption. The earliest search engines were very limited in what they could index about any given site but some large content sites were able to develop progressive site search tools. That process may never fully go away.

The key to success is to keep semantic search pure for as long as possible because if semantic spam pollutes the indexes before they are widely adopted searchers won’t want to play the game and indexers won’t be able to win the game and we’ll all just give up and go back to what worked before.

It’s time for a new game, then, one in which people whose incentives to game search engines become incentives to help search engines. That transformation will take a little more thought and discussion than I can provide here.

{ 6 comments… read them below or add one }

wibbler 04.11.08 at 5:59 pm

“heuristic”

What is that exactly?

wibbler 04.12.08 at 4:15 pm

Hey MM – whats “heuristic” mean?

Im not daft (generally) – I know a load about Special and General Relativity which you probably dont know – and infact most people on earth dont know – but you know something here which I dont know – whats “heuristic”????

For instance MM – do you know anything about the properties of the orbit of Mercury around the Sun?

SEO Ranter 04.13.08 at 5:56 am

Are there any existing semantic engines out there? You’re right, from a user experience point of view, it’s a complete win. Looking for genuine product reviews is really painful, as is using any kind of shopping comparison portal. The data simply doesn’t fit into a rigid, tabulated format, which is how these sites currently accept their items; even their slightly more descriptive XML formats are optional, ignored by many merchants, and are still limited by the in-house creator of the format.

Michael Martinez 04.13.08 at 2:45 pm

Wibbler, a heuristic algorithm determines a probable solution to a problem based on a set of rules or analyses of patterns. Heuristics don’t deliver perfect solutions but they can deliver workable solutions.

I cannot say I know much about Mercury’s orbit.

SEO Ranter, there are a couple of search services that claim to be semantic. Hakia is probably the most well-known service. I’m not thrilled with their user interface, however.

Mark 04.17.08 at 4:03 am

Autonomy have a product called IDOL that would fit the bill. Basically it’s a database & indexing engine with a series of access tools and APIs that allow rich corporations and governments to mine out all sorts of data. I say rich cuz it ain’t a cheap proposition, nor is it easy to install & configure. I believe Endeca is the other player in this market.

A little while ago WebSideStory acquired the Atomz search engine, & among other things they put together a neat solution for eCommerce site search. It was in no way heuristic search, but it was a good point solution. Now that WSS got acquired by Omniture, who knows where this technology is now.

Michael Martinez 04.17.08 at 8:17 am

Not to take anything away from the current product, I tried Atomz years ago and found it to be very limited. Indexing all the content on a Web site really takes up a disproportionate amount of resources. Nonetheless, I think that site search technology development can and should play a significant role in the evolution of search.

I might actually have an opportunity to look at IDOL. I’ll have to check it out if that opportunity is there. Thanks for the tip.