Google Throttles Blog Search Indexing

by Michael Martinez on February 12, 2009

UPDATE: 2009-02-12 Every now and then I take off the kid gloves and just lay into Google over something I don’t like. Other people in the SEO industry occasionally criticize them harshly, too. I want to thank Googler Jeremy Hylton for his gracious reply (see the comments). I am what I am and try to improve each day. Ultimately, Blogsearch is Google’s tool to create, manage, and operate as they see fit. I always have the option (however impractical and unlikely to be invoked) of trying to create my own search tools — just like everyone else.

I have posted several complaints about the downturn in Google Blogsearch quality and indexing over the past couple of months. Matt Cutts has rightfully asked me to provide some examples for Google to evaluate to see if my observations are subjective or if I’m on to something.

Unfortunately, if I took snapshots of queries that I could publicly share on a regular basis just in case somewhere down the line a search engine reboots its algorithm, I’d run out of disk space very quickly (and I’d reduce the number of reported queries for the major search engines by a percentage point or two, as it takes time to do a screen capture and save it somewhere).

Nonetheless, it’s important that I do more than just criticize Google in some vague and ring-knocking way. I want to see Blogsearch improve, at least to get back to where it was a few months ago. So I’m taking the time to document the problem as best I can.

Keep in mind that I don’t normally run these types of queries, although I have used similar queries in the past to “clock” Google Blogsearch’s indexing rate. The sites I include in the screen captures below are all sites whose Blogsearch indexing I have clocked in the past, although I don’t have screen captures to prove it.

Also, the screen captures are only date-stamped in their names. I could not think of an easy way to provide a legible datestamp in the screen captures (I could certainly have experimented with putting a clock and calendar up there, but I have to shrink the screen captures and I wanted the search results to be at least barely legible). Maybe I should have uploaded full-size screen captures but I hate having to click on images to see a larger version. That’s just me.

Anyway, the four sites I used in this index clock test are SEO Theory, Sphinn, SE Roundtable, and Search Engine Journal. All four sites post new content every day. I used the site: search function and sorted by date to show what Google Blogsearch is reporting as the most recent posts. I took these screen captures between 9:00 AM and 10:00 AM PST on Wednesday, February 11, 2009.

You’ll see that there are no results from any of these sites for February 11. SEO Theory’s daily post is published at 8:00 AM PST. Search Engine Journal’s main section isn’t even indexed. All I could get to come up were job listings. Sphinn, as everyone knows, updates and pings by the minute. And SE Roundtable was already well into its February 11 story list by the time I took the screen capture.

None of these sites shows any content for today. Why?

Only a few months ago similar searches for these and other sites produced listings that were anywhere from a few minutes to no more than a few hours old. Something big changed at Google and, quite frankly, I don’t like it.

Google relaunched its Blogsearch service in October with a format similar to that of Google News Search. It appears to me that Google has sacrificed the quality of its index for the sake of being able to replicate a DIGG-like functionality for its default Blogsearch results. Marshall Kirkpatrick compared it to Techmeme at Read Write Web.

In November Googler Jeremy Hylton wrote in Google Groups:

We have changed the way we index blog posts to include the full
content of the page. We’ve had occasional complaints about the use of the feed content, particularly the problem with partial feeds that you mentioned. The indexing change has improved the results for a lot of queries, both because we have the full content of the page and because we extract links that are missing from the feeds. The downside of this change is that we see more results that match only the blogroll and other parts of the page that are common to all of a blog’s posts.

We expected some problems from blogroll matches, but may have underestimated the impact on searches using the link: operator or where the query matches a blog or blogger’s name. We do expect to fix the problem you’re seeing. We’ll use the full page content, but exclude the content that isn’t really part of the post. I’m not sure if we’ll be able to make the change before the end of the year, but we are working on it and are pretty confident that it can be solved. We’ll post an update here when we’ve got a solution

Yes, well, Google’s Blogsearch was always a poor link research tool anyway, but they have NOT improved the quality of searches because their full indexing fails to capture many page titles, in-body expressions, and other important content. Your blog content is now more invisible on Google Blogsearch than ever before.

I’ll offer a hypothesis after the screen captures below:

February 11, 2009 SEO Theory site search in Google Blogsearch.

February 11, 2009 SEO Theory site search in Google Blogsearch.

February 11, 2009 Search Engine Journal Google Blogsearch site search.

February 11, 2009 Search Engine Journal Google Blogsearch site search.


February 11, 2009 SE Roundtable Google Blogsearch site search.

February 11, 2009 SE Roundtable Google Blogsearch site search.

February 11, 2009 Sphinn Google Blogsearch site search.

February 11, 2009 Sphinn Google Blogsearch site search.

So what happened? And will any Googler offer an explanation? More importantly, will Google fix this very serious problem?

I’ll answer the last question first: I don’t think Google plans to restore the quality of its Blogsearch results to pre-October performance. They had a superior product at the time but now that they have improved it with disastrous results I am convinced they will stick to their guns and stay the course and prevent Blogsearch from ever becoming a useful resource again.

Why do I feel that way?

Because today’s Google Blogsearch reeks of a PageRank-based filter such as they imposed upon the Main Web Index in early 2006 with the Bigdaddy (Google 2.0) Update. Ever since Bigdaddy Google has consistently sacrificed the quality of its search results by burying the most relevant content beneath less-relevant PageRank-rich listings.

And now we’re seeing the same thing in Google Blogsearch. When you just open up Google Blogsearch you are treated to a “front page” of Blogsearch listings you didn’t ask for, which are clearly and obviously ranked by the number of blogs that carry a similar headline or maybe link to the featured listing for each topic. Call this methodology StoryRank (or TopicRank or ConceptRank), which appears to be a measurement of the number of blog posts about a specific topic (it must be an important topic if 100 blogs are all reacting to it within the last 12 hours — right?). NOTE: Citation-based statistics have been scientifically discredited as a metric of quality.

So, Web spammers, take solace. I have just given you part of the magic formula for appearing on Google’s Blogsearch front page.

But why does it take so long for obviously well-linked, frequently updated sites to get new pages into the Blogsearch index?

Part of the answer seems to be this new-fangled “full feed indexing” Google switched over to. While that may help them find which blogs are writing about the StoryRankiest topics, it’s preventing them from doing what they used to do very well: index the Blogosphere. It takes longer to process full feeds than partial feeds, not only because they are sifting through more text but (apparently) because they are analyzing that text for the StoryRank Effect.

I have noticed other oddities, however. For example, given two blogs of about equal inbound linkage and posting quantity and frequency, one on Blogspot and one on Wordpress.com, the Wordpress.com blog takes longer to get into Google’s index than the Blogspot blog. (In fact, I consistently see the Blogspot posts appear in Main Web Search within minutes — well before Blogsearch shows the listings.)

Why? Perhaps because Google controls the servers for Blogspot and has already integrated the Blogger/Blogspot posts into a database that Google can readily digest. Or maybe just being owned by Google gives a spam-friendly blog site a leg up on the competition.

Other possible explanations that don’t require a so-called “StoryRank” solution include:

  1. Spam blogs are choking Google’s ping-and-crawl queue.
  2. SEOs have persuaded so many people to start their own blogs that Google’s crawling system is overwhelmed by the number of blogs it needs to crawl.
  3. Google has added more RSS feeds to its indexing queues, thus stretching out the priorities (they do index Web forums in Blogsearch, by the way).
  4. I now have so many test blogs that Google is personally zooming in on me through my cell phone and computer to block my analysis of their algorithm.

I don’t know why Blogsearch now sucks so bad. I just know it does. And though I could try to document more queries, I don’t think that’s really necessary. The Googlers know they made huge, significant changes at Blogsearch. They know better than I how those changes were implemented. Frankly, I don’t think they need me to document specific queries at this point in order to understand that they robbed from Peter to pay Paul.

Google has found another way to advance its PageRank mythology. It’s not serving people like me — who constantly search for blog posts for a variety of reasons — any good. In fact, it is detrimental to the quality of blog search results to favor sites that strike it rich with StoryRank-favored topics. Blogsearch should not be focusing on what’s popular, but rather on what’s being blogged.

There’s a huge difference between the too concepts because popularity is fleeting but new blog posts are constantly emerging (and now they are largely being ignored by Google).

{ 4 comments… read them below or add one }

olmei 02.12.09 at 7:29 pm

Michael,
Not exactly related to your article but I’m currently in a heavier mode of research and data accumulation hell than usual and am itching to know your initial take on what the effect the forth coming ‘Canonical Concert’ of msn, google and yahoo might have.

Feel free to expand heavily with respect to the content repetition theme (which I also utilize and enjoy your take on), in copy, urls, etc…

GOOD ARTICLE by the way.
Though next time consider expanding your screen shots so I can better view your programs-of-choice icons.

Mike

Michael Martinez 02.12.09 at 9:54 pm

I think it will help some people having crawl issues. I think the SEO community will also find ways to make up nonsense about the tag. And maybe there is a small chance someone will discover an unforeseen use for it.

jhylton 02.13.09 at 8:29 am

Michael,

Thanks for you feedback on blogsearch.

I read your post about two hours after you published it, and it was
already in the our index. (I checked our backend systems. It looks
like it took almost eight minutes for us to index it after receiving
your ping. I’m not sure why it took so long to get indexed.)

I tried the same queries you did and saw the same problems. There
were a few different causes for these problems. One simple problem is
that searchenginejournal.com was classified as a news site, so we
weren’t including it. That’s just a mistake, and I’ll make sure it
gets fixed today.

The problem with the other queries is a little more subtle. We have
some algoritms that attempt to eliminate very similar results, which
can be fairly aggressive at times. If you get to the last page of
results, you’ll sometimes see a message that says, “We have omitted
some entries very similar to the NNN already displayed.” There is a
following the message that repeats the query with an extra filter=0
CGI param, which shows all the results.

For [site:seo-theory.com] and [site:seroundtable.com], we omitted some
results because we thought they were too similar. The results are
filtered on a per-query basis, so posts will still show up for other
queries. You can see the difference by adding the filter=0 param
yourself:

[site:seroundtable.com]

[site:seroundtable.com] with filter=0

We were already planning to disable the duplicate suppression
algorithm when you restrict your query to a single site or blog. I
think that change will be live in a week or two.

I didn’t get a chance to look at the Sphinn query yesterday, but it
looks okay today. I don’t know what might have gone wrong when you
ran your query yesterday. A variety of transient problems might have
caused us to briefly serve stale results, but I think they are rare.
It’s not unusual for a particular blog to be slow, either, and then
Googlebot may take longer than normal to crawl the new posts.

Jeremy Hylton
Google

Michael Martinez 02.13.09 at 9:47 am

Jeremy, thanks for the timely reply and the explanations. I do see some improvement although I have not checked out all the queries I normally monitor. I watch many blogs in Blogsearch and the overall trend has been for delays in seeing their content appear in Blogsearch.

I’m still not happy with the change in the default behavior, but I realize I’m just one user out of millions. I appreciate your responsiveness to the issues I’ve been able to document.