When Google dumps the Web from its index

by Michael Martinez on August 29, 2007

Google has been dumping millions of Web pages from its index for at least two weeks, perhaps since the end of July. I have opened threads about the August 2007 Google Update (for lack of a better name) at Highrankings (August 2007 Google Update) and Spider-food (August 2007 Google Update).

People are welcome to private message me through either forum with Web sites that may have lost pages in Google’s index since the beginning of August 2007. I’d like to build an aggregate picture if I can. I won’t solicit business or share your information publicly.

This is not the first time Google has dumped the Web from its index, and I suppose it won’t be the last time. Search Engine Roundtable is following a discussion at a forum where people are reporting many missing pages. But the first time I saw something like this was in early 2005 (previous Google “disaster updates” had significantly different characteristics).

In February 2005 Google began dumping millions of pages from its index. We ended up seeing many queries for 2-3 months that were filled with URL-only listings — you saw no descriptive snippets or titles. The URL-only listings indicated that Google knew about the URLs but had not indexed them. At the same, many of the cache entries for pages still in the index displayed old images, some as much as 2 years old (going back to early 2003).

When Google performed a similar house-cleaning last year they spared us the endless pages of URL-only listings. When I asked Matt Cutts (possibly on his blog) whether Google was going through the same procedure as had occurred in early 2005 he suggested the 2006 data dump was different without elaborating on how or why.

Nonetheless, many pages were dumped from Google’s index and it took some sites months to recover their lost search visibility.

It appears to me that another such broad data dump is under way, for reasons Google probably won’t disclose. In the past two massive dumps many spam sites were actually left in the search results and many hobbyist, small business, forum, and ecommerce sites lost page visibility. Some sites reported losing all but their home pages (root URLs).

It may be that the pages most likely to remain visible through this process will be the link-rich pages that Web sites sport. But during the first massive data dump even Xenite.Org lost significant visibility. Our traffic dropped by abou 20% for 2-1/2 months and our Google search referrals were almost non-existent. It was nearly impossible for me to find any pages in the Google index. In 2006 Xenite fared better than most sites.

So far in this year’s data dump Xenite has lost about 40% of its Google indexed pages. Of the pages that I think may be in the Main Web Index, I think maybe 1/3 have gone missing. So most of the lost visibility is coming from my Supplemental Results pages. These pages don’t get much traffic from Google anyway so I haven’t taken nearly as much of a hit in traffic as I did in 2005.

That’s the good news.

The bad news is that I have client sites that have also taken hits, although not all of them. In fact, some client campaigns are soaring along. One could easily ask why anyone might complain, if one were to look only at those well-performing client campaigns. But any time a search engine drops pages from its index it will continue to serve up other content for queries. So naturally when you have many losers in the SEO game you’re going to see many transient winners, too.

Only time will tell if we see the Google search index return to its pre-August state. Frankly, I’ve been disappointed with Google’s query results for several months. Finding specific information has become increasingly difficult as they have moved more pages into the Supplemental Results Index. But through the past two massive data dumps the index results saw significant improvement in quality after Google finished churning.

In the meantime, what can you do if you are losing pages? Normally when a search engine launches an update I tell people not to analyze too much and not to agonize. It’s usually best to do nothing because the updates don’t normally last long. But this data dump has already gone on for 2 weeks or longer and I see no indication of a reduction in stress. So it may be prudent for people to try a few things.

First, even though Google is dumping old content, it is still adding new content to its index. We know that blogs, news sources, and high traffic sites will continue to be indexed even if their sites lose some older content. So it follows that any site should be able to launch new content with a reasonable expectation of seeing it indexed at some point. I don’t yet know if Google is suppressing XML sitemap content. I’ve put up some test sitemaps to see if I can get new content and maybe some deindexed content into the index.

If you publish an RSS or XML feed, try adding some of your dropped pages to it. That’s the experiment part of the SEO Method. Then wait a week and see if you can find your lost pages (or new pages). That’s the evaluate part of the SEO Method. If you don’t see any progress then re-evaluate your linking strategies. It may be that Google is not recrawling your links as much as it had.

New content has always worked best for me. I can bring people into my Web sites and send them to older pages if those pages are still relevant to current searcher interests. And though creating new content is time-consuming, anyone who has been putting off a major site redesign might want to take advantage of their lost visibility to do the dirty work. After all, if Google has dropped your pages, you’re not risking much by revamping your content.

Remember to handle URL changes carefully because old links and other search engines still matter.

And speaking of other search engines, whenever Google dumps lots of pages from the index many people look around to see what they can do to optimize for other search engines. One thing you can do is help your visitors to find content they want through other search engines: update your on-site search tools (remember, Google isn’t indexing your pages) to use search engines that provide better indexing; and mention search engines people can use to find your site in your copy, your advertisements, and your promotional literature.

Now might be a good time to seek out new relationships with other Web sites. I’m not talking about swapping links or cheap articles. I’m talking about forging legitimate business functions between two or more sites so that all parties benefit from the traffic they generate. It’s never too late to build a great network of partners who help you promote your products and services.

Finally, now might be a good time to take a vacation. Fretting over data dumps, updates, and algorithm changes isn’t really productive. Every now and then even old sites have to lose some visibility. People can spend the next few weeks recharging their emotional batteries, putting the finishing touches on their Fall campaigns, and developing new content ideas.

If you make Google less relevant to both your life and your business model you’ll find that occasional losses of visibility don’t matter that much.

{ 10 comments… read them below or add one }

deInternetMarketeer 09.03.07 at 3:03 pm

Michael,

Do you think this is a big update?
I remember BigDaddy and that was a period like you describe now with a big crash a month later with lots of sites in supplementals or totally gone.
(That perhaps was an unexpected error at Google.)

I’ve also seen some changes in indexed pages and i’ve also seen people reporting something like this in a Dutch forum and elsewhere.

Michael Martinez 09.04.07 at 8:02 am

I’ll know it was a big update when it’s over. For now, I just have to sit and wait like everyone else. It’s always risky to analyze search results when the search engine is rolling out changes. It may be that all Google did was dump data from the index and start a rebuild.

deInternetMarketeer 09.06.07 at 3:01 am

If i look at some Dutch results i definately see some big changes for some queries i checked. Mine mostly have improved :-)

So i think they did more then just dump although you’re right and we have to see where this leads to in a few weeks.

dodito 09.14.07 at 3:05 am

Michael,

it is difficult to tell since we also received a lot of new (good) links in June/July, but since July we have seen an uptake of our pages ranking well. Those were supplemental before, and the boom has continued to an explosion in August. We moved from 10-20 results per day from google to over 500. (we have many thousands of pages).

I can discuss more in private, if you wish.. (where to email ?)

However on the other hand, when I did a page analysis I was really shocked to see some pages having a cache date as far back as 25 July or older. This include pages from large museums. Now they are deeper etc.. but I know for a fact that when I created that list this spring these pages were crawled OK, and certainly not in the supplemental. So.. if a page now suddenly has a cache date of 25 July, something has changed.

dodito 09.14.07 at 3:35 am

Michael,

where could I email you if I wanted to send an example ? The contact at 1stquery.com did not work for me..

Michael Martinez 09.14.07 at 12:24 pm

Most of my pages have come back since I wrote that post.

1st Query has had some internal email issues. An alternative way to contact me would be Xenite’s Contact Form.

dodito 09.15.07 at 12:33 am

Michael,

perhaps on here is OK as well, given your last comment. The museum link I mentioned is this one: http://www.nhm.ac.uk/research-curation/library/earth-science-library/index.html

It is cached 23rd July, but the page is in the navigation menu. I would consider it well linked therefore (or am I making a mistake by assuming when it is in the navigational menu, it is as best linked as possible ? ) and the Natural History Museum in London is not exactly a poorly linked to site either nor an unimportant one… (subjective ofcourse). This really came as a surprise to me since I know for a fact it was not in the supplemental before, this spring, and it was crawled more regularly (though unfortunately I did not have the time to keep track of all the caching dates).

Michael Martinez 09.15.07 at 8:18 am

I’ll try to take a look later this weekend. I can tell you that a lot of sites show cache dates from late July or early August, which is consistent with the reported data dumpings’ timing.

I expect Google to recache most of the pages that seem to be stuck in July and August over the next few months, either in the Main Web Index or in the Supplemental Results Index.

They may have had other algorithmic reasons for dumping content or not freshening content from major, well-linked Web sites.

dodito 09.15.07 at 9:00 am

That was also my impression. I did a whole list of pages, and it was quite funny to see how often “July 26th” came up as a cached date.

John Slimak 11.23.07 at 2:38 pm

Thank you for the information. I have been on 3 forums and all over the web trying to find out why my site lost 20 pages. I will take the advice and relax and add some additional content.