How lexical analysis can combat Web spam

by Michael Martinez on February 13, 2009

Have you ever searched for an article you’ve read before and found it sitting on an unfamiliar site? You might ask yourself, “Is this where I read that piece before?”

That brief moment of doubt has often tipped me off to the fact I was looking at a scraper site. I have found some pretty sophisticated scrapers hanging around the Web, including more than a few that replicate articles from SEOmoz and SearchEngineLand on a regular basis. Some scrapers, however, mix and mingle text from a variety of sources and that may ultimately be their undoing.

Let me ’splain.

Science Daily carried an article about a genomic comparative study process that was validated by using the method to identify authors of free eBooks: Feature Frequency Profile proved to be surprisingly good at grouping eBooks by similar authors together (and this despite the fact that all of the eBooks came from Project Gutenberg, which includes a lot of boilerplate text in its eBooks).

Recapping the experiment, Science Daily writes:

In a test of free online books obtained through Project Gutenberg, they found that this method, which they called the feature frequency profile (FFP) method, was more successful at identifying related books – books by the same author, books of the same genre, books from the same historical era – than word frequency profile analysis. In fact, a good tree can be constructed by looking at a single “optimal” feature length, such as nine letters, where the “vocabulary” is very large, instead of looking at all possible lengths.

“I was just stunned when I saw this,” Kim said. One of the reasons this method works better, he said, may be that, while word frequency analysis treats each word independently, feature frequency analysis picks up syntax.

“Here, if I take a 9-letter window and slide it along the text,” he said, “I am actually picking up the relationship between the first and second words – the local syntax – which was impossible to pick up from the word frequency method. Apparently, that is very important.”

Sophisticated Web search just got easier. Imagine a tool where you identify a particular writer and then tell that tool, “Go find all articles by this writer”. The tool can search its database of articles for likely matches.

Can Rand Fishkin really contribute content to 900 Web sites? Maybe. After all, there are free press release and free ezine article archives where people pick up free-to-distribute content all the time. But there is boilerplate copy that usually identifies these types of redistributable items. If you filter those out, all that’s left is the authorized copy and the unauthorized copy.

The last piece in the complex puzzle is how you distinguish between authorized and unauthorized copies of Web content. One potential candidate mechanism is link validation. If I give you permission to use my content, you and I could link to each other to show that you are authorized by me (and acknowledging that authorization) to use my copy.

No search tool can assume this is the preferred method but there are semantic markup techniques to help us establish a protocol for authorizing duplicate content. It’s as simple as two sites agreeing to use a microformat to identify each other.

Now, should a site only obtain copy by reproducing what other sites create — even though authorized — the chances are pretty good that it won’t be of very high quality. Some people might balk at the idea of embedding microformatted code on their pages that boldly proclaims, “I did not create this copy”. But you can always dress up the copy (without altering it) by supplementing your own copy — a mashup technique that is widely used and generally deemed acceptable by the search engines when it’s not used excessively or to game their search results.

No scraper site would be able to easily fake a validation scheme because there are two layers of protection: first, the copy is “fingerprinted” by the creator’s writing style. Second, the structure of the reciprocal linking method (two microformats acknowledging the authorized reproduction of copy) acts like a checksum.

That is, if you identify a body of articles written by Michael Martinez and they use the microformat to validate syndicated copies, any sites that try to fake the syndicated arrangement (by exchanging links between two controlled scraper sites) will fall outside the network. In effect, a scraper network would be hoist on its own petard. It would either have to entirely rely upon a single author (thus violating intellectual property rights and taking a heavy risk of being subjected to a takedown notice) or it would be blatantly counterfeiting an authorization link exchange that would stand out like a sore thumb.

It’s not like people have to panic over these kinds of links exchanges (”Oh no! What would that do to my link profile?”). Either the use of the microformat would be sufficient (if acknowledged by the search engine) to keep the links from passing PageRank and anchor text, or a link attribute could be included (such as “rel=nofollow”). If a specialized attribute were used, such as “rel=authenticate:site-1.example.com/authenticate.html+site-2.example.com” where site-1.example.com/authenticate.html and site-2.example.com represent the syndicator and authorized publisher, it would be easier to track syndicated content through a variety of applications.

A syndicator’s authentication page could carry one or many authenticating links. A publisher would have to embed the authentication link with the republished copy.

By providing Webmasters a means of identifying authorized replication, search engines could ignore the replicated copy’s value-passing links while still indexing the copy (there are situations where people WANT to find duplicate copy, so “omitted results” should work just fine for these types of authenticated duplicates).

The process would only work, I think, if search engines promoted it gracefully and in tandem. Implementation of the authentication links would have to be simple and quick (no doubt plenty of such tools would be developed and made available for free). The search engines would only have to assure people that using this kind of authentication would in no way result in a ban or penalty. It’s just a way of telling search engines, “Hey, I have a legitimate right or privilege to include this content that was written by someone else on my Web site”. The Web site would not be considered suspicious simply because it included many such authenticated articles.

No system is perfect but the more difficult we make it for people to scrape and reuse our content, the better off we’ll be. Scraped content only helps the scraper, who doesn’t have to work hard to get the content.

Maybe it’s time to raise the cost of scraping.

{ 3 comments… read them below or add one }

devdotcom 02.13.09 at 10:37 am

There’s no way to combat spam, and Google is responsible for that. With Social Media, the problem is much worse.

You see Joe the SEO and all of his buddies with “The Best SEO Program Ever” Blah Blah Blah… So you have a million idiots telling people to create 100 social media profiles, submit 1,000 social bookmarks a day, and oh, don’t forget to use this handy dandy blog commenter that’ll blast out 5,000 comments at a time!

We’ve seen it before with Google when every body was chasing links. Now link spam was bad but it’s nothing compared some schmuck creating Squidoo lenses, profile pages, and blogs everywhere.

Google has a fundamental problem with indexation because the rate that content is growing at in the social media sphere. The more “spam” that they index, the harder it’s going to be to deliver quality results.

Michael Martinez 02.13.09 at 10:49 am

Web spammers leach traffic from more legitimate Web marketing. I believe it’s important that we SEOs understand it and take measures to counteract it whenever it threatens our own campaigns and services.

Maybe I feel a strong personal connection to the Web spam issue because I’ve had to fix problems that spammers have created for me through the years. They aren’t all nice kids who are just looking for some extra income. Some of them are hardened criminals. They don’t care who they step on in the process of making a buck.

devdotcom 02.13.09 at 11:25 am

Excellent points.

I think that people who are looking for extra income don’t realize the difference between tactics and strategy. Creating profiles, blogs, bookmarks only make sense within the context of a good marketing strategy. There are ways to use Facebook, Twitter, Stumbleupon, and other sites as part of a good marketing strategy.

My definition of spam is something that serves no purpose or offers no value to the user. It’s not just off-site though. You see people who create spam on their on sites with 200 links on the homepage because they don’t understand Usability.