There are thousands of academic papers dealing with topics such as Web data mining, spam filtering, and improving the Web indexing and search processes in general. Many of these papers are clearly written in an isolated environment that fosters myth-making which in turn diminishes the potential value of the research being published. Keep in mind that many of these papers are peer-reviewed and published in professional journals.
I’m going to look at two examples, neither of which was published by native-English language teams (so please bear in mind that the non-American idiom used by these papers is not a subject for ridicule or criticism).
The first paper I want to examine is titled “Web mining techniques for automatic discovery of medical knowledge”. Published by two Spanish researchers (David Sánchez and Antonio Moreno), the paper proposed “an automatic and autonomous methodology to discover taxonomies of terms from the Web and represent retrieved web documents into a meaningful organization. Moreover, a new method for lexicalizations and synonyms discovery is also introduced.”
In brief, the researchers used the Yahoo! Web directory, AllTheWeb, and the Clusty meta search engine to test their thesis, that classes of documents can be identified by examining the text which surrounds a query’s keywords in the document. That is, they ran a query, examined the returned documents, and extracted associated expressions from text immediately preceding and following the keywords used in the documents.
Using this data, the researchers identified semantic relationships to help them improve the quality of document selection (specifically with a medical research applicability). The idea on the surface seems very clever, and I have no doubt there are other research projects that have investigated this technique (in fact, I know of at least one other technical paper that puts forth a similar technique for semantic analysis).
However, the researchers’ work would not function well with Google’s search results where link-happy SEOs have been contracted to promote medical content in search results. The process is self-filtering, and valuable documents that lack the actual keywords in on-page text — or body copy that has been “optimized for search” — are at high risk of being ignored by this methodology.
When dealing with substantive information archives (such as published research, technical data, and other large bodies of useful information) search engine optimizers need to look beyond the consumer and retail marketplace. What if you are hired by a research clinic to make their work more accessible to the medical community? How many medical Web search tools have you actually optimized for? What do you know about the methodologies such tools may employ to extract very specialized information from the Web?
The proposed semantic analysis methodology will not work with mainstream search engines as they exist today because all four of the leading search engines (Ask, Google, Live, and Yahoo!) rely in part upon external factors for determining relevance and ordering search factors. There are only a handful of commercial search engines that actually focus on on-page factors, and most SEOs have little to no experience in optimizing for those search engines.
Your basic link-building skills are not very helpful when it comes to optimizing for search expectations that are built upon research that has no concept of link-based indexing and ranking.
Another interesting paper was published in 2006, so its ideas are a bit dated by today’s standards (we are now dealing with Universal Search across all major search engines). “Tracking Web Spam with Hidden Style Similarity” was written by a team of French researchers (Tanguy Urvoy, Thomas Lavergne, and Pascal Filoche) who studied autogenerated pages in considerable volume.
Their challenge involved separating the wheat from the chaff; that is, they want to recognize legitimately unique and informative documents that share page design templates and boilerplate text (such as is found in Content Management System-based Web sites like blogs and news sites) and at the same time identify spam sites for filtering.
Unfortunately, the researchers are dealing with a faulty set of assumptions, including this gem: “one well known strategy to mislead search engines ranking algorithms consists of generating a maze of fake web pages called link farm.”
Being one of the people involved in the process of developing link farms, I am suitably qualified to say that they don’t know what a link farm is. Link farms are groups of Web sites where every member site links to all other member sites. There is nothing particularly autogenerated about a link farm — that is, a link farm is a link farm regardless of whether any or all of the member sites use some sort of autogenerated content.
It’s true that link farm management systems have produced some autogenerated templates, but if you want to figure out where the link farms are, you will miss the majority of them by looking only at which sites have autogenerated pages.
The authors of the paper quickly stumble into another myth: “the only way to effectively create a very large number of spam pages is to generate them automatically”. Actually, that is far from true. Web sites have been producing large numbers of spam pages by utilizing user-generated content for years, long before Web 2.0 and Social Media became big buzz expressions. The earliest UGC spam sites I ever encountered (and I don’t know if they were the first) were term paper archives created by college students in the 1990s.
Web spam has been around almost as long as the Web, and the old methods of creating content still work as well (sometimes better) than the new-fangled RSS-feed generated garbage. It’s much more difficult for a search engine to identify 1,000 high school and college papers as Web spam (particularly if they have not be replicated across Web sites).
The French team nonetheless address a topic most SEOs are unfamiliar with: Stylometry. Many of us have probably seen Stylometrists in action at one time or another in our lives. If you’ve ever determined for yourself or your friends that someone had “faked” a note or message, you were using crude stylometry. Stylometry is the study of linguistic or similar styling to determine authenticity or authorship.
There are search engines crawling the Web today that utilize stylometry in various ways; it’s not all about Web spam. Nonetheless, we can be reasonably certain that all the major search engines employ stylometry to some degree in their spam detection and filtering processes. It’s impossible to design an algorithm for that purpose without resorting to stylometry.
We actually saw how stylometric spam filters can hurt innocent Web sites in late 2005 when Google introduced an aggressive anti-spam filter that focused on faux directories and similar Web sites. I examined several dozen “innocent” (non-spam) sites that had been de-indexed along with all the so-called “SEO friendly” directories Google targeted and found they bore significant structual and stylistic similarity to the known spam sites.
The search engines probably use stylometric filters to find and delist blogs that rely upon random RSS feeds. The reason why you can see Markov-chain blogs vanish so quickly is that they create a very clear and obvious “fingerprint” that is algorithmically identifiable. The French team proposed a method based on extracting the “interesting” content from the boilerplate content, then compartmentalizing that content for a process called minsampling. They hashed their data.
Hashes are very useful processes. You can hash anything. Computer science borrowed the method from cryptography (and basic Algebra) where you substitute one thing for another. That is, a hash is a metaphor that represents a distinct piece of information. Hashing has its drawbacks, however, because you can produce collisions with virtually every hashing algorithm. A collision occurs when two otherwise distinct pieces of information convert to the same hash value. Hash tables have to allow for these collisions by chaining the references for distinct information together. Long hash chains tend to be inefficient in organizing data.
However, what is detrimental for information storage and retrieval is actually useful for semantic analysis. If you analyze two apparently distinct texts with a semantic hashing algorithm and produce many collisions you probably have two very similar documents. One may be a slightly rearranged version of the other document. A fair amount of Web spam is produced by taking paragraphs and substituting words in them, switching sentences, and otherwise trying to mask the duplication without sacrificing basic human readability.
Markov-chain spam tries to approach the problem from the other direction. Instead of replacing words in replicated text, Markov-chain spam tries to generate human-readable text based on statistical relationships. It’s not a very efficient method but I am sure some Markov-chain software works better than most. I have seen improvements in the technology over the past couple of years, and I am sure the anti-spam gurus have noticed those improvements, too.
I cannot comment on how effective the French technique is in French-language spam detection, as I am not familiar with the French-language Web. However, the fingerprinting technique they describe is similar to several other techniques that have been proposed for English-Web spam detection. Such techniques have been challenged by alternative research for producing many false-positives.
Search optimizers (and Web copywriters) leave their fingerprints all over the Web. It’s virtually impossible to NOT leave a fingerprint. In general, our fingerprints are fairly harmless. That is, just because we follow certain patterns doesn’t mean our Web sites (or our clients’ sites) will start tripping spam filters. However, from time to time relevance algorithm adjustments may produce sweeping changes in search results (especially site search results) based on fingerprints.
One cannot over-emphasize the need for unique, distinctive Web copy, page structure, taxonomy, and meta data (including page titles). People use site search regardless of whether a Web site provides it. Every major and many secondary search engines make it easy to search Web sites, and when searchers are confronted with more than a couple dozen pages they either employ site search or move on to a different query in order to find something that satisfies their internal search criteria.
Simple site-search tools are easily deceived by similar fingerprints. Just search any major news Web site that provides its own search function. You’ll also find that Google, Live, and Yahoo!’s site search functions are easily confused by similar content — especially content that shares page titles and meta descriptions.
A reasonable test for any Web site’s fingerprint confusion is to see if you can quickly identify appropriate content for a representative selection of its pages through site search without having to sift through false-positive duplicates. In other words, if you can use site search to get to exactly what you want on a large site in 5-10 low-traffic queries, the site probably has distinct document fingerprints. If, however, most of the test queries produce inappropriate or duplicate-cluttered results, the copy is not very organized.
Search engine optimization should address the issue of making copy unique, distinct, and correctly identifiable. It may be as simple as getting enough PageRank for full indexing in Google but more often the task goes beyond link building. You need to understand that secondary search tools are looking at Web pages in relatively simplistic ways that don’t have anything to do with links.
The search theorists don’t always know what they are talking about, so you need to make sure their tools understand what you are talking about.
{ 0 comments… add one now }
You must log in to post a comment.