Amazon Web Services has become the largest source of spider spam in my server logs. Their cloud service creates the perfect environment for incubating armies of crawlers that go out and fetch content from every Website imaginable.
It is not easy to block Amazon Web Services. Many people have tried. Most are probably failing. They appear to be expanding their IP address pool to accommodate their growing list of clients. Perhaps they even realize they are being blocked by many Websites and are compensating for the blocks by obtaining new IP addresses. Or maybe they are just slowly rolling out IP addresses from a reserve pool, biding their time, waiting for IPv6 to go live across the Internet.
I have no idea of what the people at Amazon Web Services think they are doing. I do know, however, that they are creating a huge problem for many independent Website operators — especially those of us who operate our own servers and are forced to manage spider activity on our own. The companies that send out these rogue crawlers pay absolutely no attention to the robots.txt exclusion standard. What’s worse, many of them trample intellectual property rights without a glance at the laws on the books.
As I mentioned on Twitter yesterday I started blocking a large number of Amazon AWS IP addresses while I figured out what to do about Readability.com, who were republishing SEO Theory articles on their Website with what appeared to be CMS-injected copyright notices (in the name of Readability, not me). To their credit Readability.com complied with my takedown request immediately.
If you find your content is being republished by Readability with improper copyright and you want to do something about it, try using their Opt Out of Readability Procedure before going one step further. Given that Amazon Web Services has pretty strict Terms of Service you might be able to get them to pull the plug on clients who are stealing content through the cloud but I haven’t tested Amazon’s enforcement. Some ISPs turn a blind eye to customer misbehavior and demand you bring them a court order.
Amazon Web Services is supporting rogue crawling from a multitude of companies that may or may not be as compliant as Readability with takedown requests. In fact, Danny Sullivan took Readability to task in 2010 for their aggressive crawling. Not knowing the entire history of the service, I will give them the benefit of the doubt and say they have been flexible by creating a communications channel for publishers — that’s responsibility on the side of “ask forgiveness afterwards”, whereas I as a publisher who must personally monitor and police almost 200 million Websites would prefer to see everyone else be responsible on the side of “ask permission first”.
There are many responsible companies using Amazon Web Services — at least, so they seem to me. URL shortening services like Bit.Ly and Awe.sm run through the AWS cloud, for example. So does Twitterfeed. I am sure that many social media startups get onto the AWS cloud or some competing service as soon as possible; the alternative to paying for expensive cloud hosting is investing in even more expensive data center construction.
Nonetheless, despite all the good things that are coming out of the cloud, the fact remains that the cloud is a huge pain in the neck for Websites that are being scraped. Unlike the typical low-budget scraper who is using his personal bankroll to lease server accounts from small hosting vendors, cloud customers usually have significant investment behind their operations. Those millions of dollars add up to hundreds of millions and billions of dollars quickly, and the cloud hosting companies (including HP, Rackspace, and a growing field of other competitors) are betting that cloud computing is the future of Web hosting. In other words, cloud services will only continue to grow and expand as more money is poured into them.
Economies of scale thus make it easy for any startup whose founders believe God created all the content on the Internet for the benefit of their personal financial success to start scarfing up blog posts, forum discussions, image galleries, and everything else that was created for “real people”, individuals. The robots are now so pervasive they comprise over 50% of all the traffic that hit my server — and I AM blocking many robots, both rogue and well-behaved.
Baidu is another source of heavy crawling activity. I sometimes block BaiduSpider and sometimes not, because even though I don’t yet get much traffic from Chinese Web surfers I MAY if I allow Baidu to index my content. Yandex is becoming a significant player in my search referrals, and I expect the role of international search engines to become even more important to my personal Web marketing goals. Most companies can probably still afford to block Baidu but Baidu is starting to move beyond the chinese Web.
Still, when I look at the crawling activity on my server, it’s obvious that high school kids and college students are not my most frequent visitors. Some of the scrapers are pretty obvious. Some of these crawlers are not so obvious. Why do they need to hit a Website several thousand times a day? They come from North America, Europe, and Asia. Heck, there are even some IP addresses from Comcast I’m not sure about — there are just too many BotNets out there.
Some days I just use robots.txt to ask Baidu and Bing to leave my server alone. I’ve even blocked Googlebot on occasion. When you have a lot of content that isn’t changing, it’s okay to cut back on major spider activity for a few days. I’ve gone as much as two weeks without allowing some major search spiders to crawl my content. They “throttle down” when I do that, giving my URLs a lower crawl priority and me some breathing space. Unfortunately, all the links people point to my Websites eventually send even the legitimate crawlers back from dozens of IP addresses every day.
But these Cloud-based Crawlers — they’re not welcome. At least not the ones I don’t invoke. They don’t honor robots.txt so I cannot ask them to politely back off. And if I block all the IP address blocks from the cloud services I block legitimate tools and maybe even some users.
I understand why people might want to use a service like Readability. As the mobile Web becomes more active and important Mobile Users want to be able to read good content without having to fight their way through bandwidth-hogging crap like widgets and ads. I’m not yet so desperate that I HAVE to show you ads all day long. I use the WPTouch WordPress plug-in on all my blogs, so Mobile users can read my content fairly easily. They don’t need services like Readability.
Most Webmasters have probably not yet gotten the memo on Mobile. I can’t do much about that, except to remind people constantly that we MUST pay attention to what is going on with the platforms of the users who visit our sites. One theme does not fit all. And if you cannot monetize your mobile display properly then don’t monetize it at all until you can get the right tools in place. Don’t block out the Mobile users just because you cannot drive them into your sales funnel yet.
Meanwhile, I still haven’t figured out what to do with all the crawlers coming from .amazonaws.com, .cloud-ips.com, and other power services that have transformed the Web into a crawler heaven. If I have to subscribe to a robot management service for this server my costs will increase dramatically. Since I’m not currently in business for myself that just doesn’t make any sense.
One possible alternative might be to pay for a Managed Server hosting service — but I don’t know if those types of services aggressively monitor rogue crawlers. They should be. If you want someone like me to give up root-level access to my server the least you should be doing is filtering my traffic so I get scraped less often and don’t feed rogue startup companies who don’t respect intellectual property rights.
Fair use doesn’t cover crawling and scraping — there was never any concept like that when the courts developed their tests for fair use. But even if some legal experts want to argue that there is room for interpretation with social media scrapers, all Safe Harbors and legal obscurity vanish the moment a company embeds its own copyright notice on an unauthorized copy of someone else’s Web documents. That is felonious theft of intellectual property rights — felonious because it’s Deliberate, Intentional, Commercial, and Systematic. You can kiss your sweet freedom good-bye if you steal my content like that — I WILL see to it that you are prosecuted.
The bottom line here is that as cloud hosting becomes more popular more startups are going to be doing whatever they please with everyone else’s content — and sooner or later independent Web publishers will start shutting down because they cannot afford the expense of helping other people make their fortunes without compensation.
Independent Web publishing is threatened by social media in ways most people cannot imagine. It is not merely the theft of intellectual property rights that threatens independent Web publishing — it’s also the theft of bandwidth, the increased burden of monitoring and blocking rogue activity, the increased strain on leased or co-located server resources, and the added expense of having to admin activity that isn’t serving the needs of the independent Web publishers.
The solution is not to join the cloud communities. The scrapers will continue to scrape, and cloud hosting is NOT cheap. Well, that’s not entirely correct. Amazon offers a free hosting option, believe it or not. I’m just not sure I would qualify. Heck, it takes half a gigabyte just to create a compressed backup of my SQL databases. And 1 GB of Regional Data Transfer? I might be able to fit one blog into that free service. The moment you cross a tier boundary your costs of operation increase significantly.
Cloud hosting is probably a good solution for someone with a marketing budget, some solid financial backing, and a goal of building traffic quickly. It’s not an attractive choice for an independent Web publisher who actually NEEDS a dedicated server.
Which leaves me sitting squarely outside the cloud, watching all those rogue crawlers coming at me, scarfing up my content. That’s not acceptable, either. And I know I’m not alone but like I said above, I still haven’t figured out what to do about all this.
Not yet, but I will. I have to.
- Can Robots Click on Ads? (April 2013)
- When the Numbers No Longer Add Up (May 2012)
- How to Configure Webalizer for SEO (August 2008)
- Leveraging Small Resources for Traffic (August 2008)
Follow SEO Theory
Do you want more than just reposts of the week's SEO discussions and news?
Get the LARGEST weekly SEO newsletter now ...