Manage crawl-to-index lag times efficiently

Posted by Michael Martinez on September 20, 2007 in SEO Theory

Do you need to submit an XML sitemap to the search engines?

Not so long ago many people complained on forums and blogs that submitting XML sitemaps to Google “caused” their sites to fall out of Google’s index. These same people usually said that as soon as they deleted their XML sitemaps their sites were restored to the index.

Let’s assume for the sake of discussion that Google had some problems with its XML sitemap process. Maybe it worked according to Google’s engineering standards but perhaps the process nonetheless caused more than one site to mysteriously vanish from the index.

I don’t really see people complaining any more. Perhaps when Google rewrote itself either last year or this year whatever caused those mysterious losses of visibility was resolved. Maybe they still occur but people who lose their visibility don’t complain much.

It’s hard to say.

On the other hand, a lot of people now seem to be creating XML sitemaps, especially with the help of automated tools. I think the XML sitemap has become a standard part of the Web promotion toolkit. But people wonder if they are not wasting their time with XML sitemaps.

Let’s look at how a new Web site might appear in the search engines. If you create a 20-page site, odds are pretty good you’re only going to promote your root URL. Most people don’t feel they should promote their “About Us” page (me, I might just put it as the link in my social media profiles, but that’s just me). So you go out and announce your root URL on a few Web sites, maybe buy some links, maybe a friend blogs about it for you.

Statistically speaking, which of your 25 pages of content is most likely to be fetched by a search engine first?

That first fetched page should link out to as many of your interior pages as possible. The fewer of your internal pages the search engines find from that first fetched page, the fewer of your pages will appear in the search index right away. That’s what you could call a crawl-to-index lag of the second degree. That is, the time it takes to crawl and index a page that a search engine knows about is the crawl-to-index lag of the first degree. The time it takes a search engine to find out about a page, then go crawl it, is the crawl-to-index lag of the second degree.

For every page that a search engine has to fetch just to find out that there is another page out there, add about a week to 10 days to your lag time. So let’s assume you create a new domain today. The search engines tend to be pretty good about checking out new domain names so your root URL may be crawled within a day or two. It may then take a week to 10 days for the search engines to come back to crawl the next set of pages. And it could happen sooner, but on average I see 2nd round pages show up about 10 days later.

So if your root URL only links down to 4 or 5 section headers, you have to wait another week to 10 days for those deeper pages to be crawled. So now you’re into about 3 weeks of just waiting for a search engine to find and index your pages. And if you know anything about search engine rankings, you know you may have several more weeks to wait before you see where those pages bubble up on the basis of on-page factors.

XML sitemaps don’t reduce the lag time for first degree crawl-to-indexing. Once the pages are fetched you have to wait for the search engines to parse and index the data. But if you submit an XML sitemap for a 20-page site you should see all the pages appear in the index (maybe with some occasional misses) in about 2 weeks. That has been a pretty consistent pattern for us.

With large content sites XML sitemaps can reduce a process that lasts anywhere from 2 to 6 months to just a few weeks. The more content you have the more XML sitemaps make sense. They don’t guarantee that every page will go into Google’s Main Web Index and they don’t guarantee that every page will appear in Ask’s search results. Microsoft’s Live Search and Yahoo! may grab the pages and not seem to do anything with them for weeks. But the sooner you get those pages into the search engines’ databases, the better.

Now, where you may fall down is in failing to deploy a lot of links across your content. If you have 10,000 pages of content you had better be putting a minimum of 10 internal links on every page. Why? Because you want the crawlers to be continually revisiting even your static content as part of their natural crawling patterns. Pages do drop out of the indexes and the frequency of crawls is most often the reason why. If you have pages that are only fetched once every 3 months, that is a problem.

Showing your content to search engines is as important as showing that content to people. If people have to click through six documents to get down to your deep content, you need to consider alternative means of helping people find that content. Asking people to click through six times is asking a lot. But if it takes six clicks for a person to reach a page, that means it takes a search engine 7 fetches to get to it. And 7 successive fetches can translate into more than 2 months’ time to show a page in an index.

If you change your page layouts once a month but they only update in the indexes every 2 months, half your visibility changes will never appear in search engine results.

Some people rely on XML sitemaps and RSS files to ensure that their content is recrawled. But not every site is set up to create or update those files. It becomes a chore to have to generate large link lists every 1-2 weeks. So it’s much more efficient to use your pages to influence your large content sites’ crawl rates through internal linkage.

You can use those links to tell the search engines which ones are most important. Those are the pages you want your visitors to find if they use your site search. If your site search is powered by one of the major search engines then you want that search engine to be crawling and indexing your most important pages often enough that they appear in your site searches.

Now, some people have the means to automate XML sitemap and RSS feed updates and pings. If you can do you should do it but keep in mind that once people find your pages through searches or links on other sites they still need to reach your most important pages quickly. Liberal internal linking is user-friendly and good for accessibility, usability, and searchability.

Building crawl efficiency into your site design is not only for good search engine optimization it’s good for your visitors. But it also helps to make your pages useful resources that are more likely to attract those much-desired natural links. And you can help launch new sites by sprinkling links to them across your pages (you don’t need sitewide linkage if your pages are crawled frequently).

XML sitemaps can help you kickstart a Web site that has crawling issues, but your most viable long-term strategy is to design a crawling pattern that keeps the spiders (and your visitors) flowing smoothly through your site toward your most important pages.

Comment

Log in or Register to post a comment.

More

Read more posts by Michael Martinez

About the Author

Michael Martinez is the Director of Search Strategies for Visible Technologies, Inc. A former moderator at SEO forums such as JimWorld an Spider-food, Michael has been active in search engine optimization since 1998 and Web site design and promotion since 1996. Michael was a regular contributor to Suite101 (1998-2003) and SEOmoz (2006).

Linking patterns: Why link-heavy pages win, lose, or draw Fundamental principles for link analysis