Crawl management is integral to search engine optimization. And yet crawl management is so poorly explained by the Web marketing community that many sites mismanage crawl. What should be one of the fundamental areas of search engine optimization has been clouded by nonsense and false expectations. It’s time to reset the conversation and correct these significant errors in the SEO lexicon.

Basic Crawl Management for SEO: Web marketing has overcomplicated the conversation. How do we scale it back?
There is no such thing as “crawl budget”, at least not as far as the search engines are concerned. You can lump several different concepts together under a collective label of “crawl budget” but they remain separate concepts. “Crawl budget” is one of those expressions that got out into the wild and became a favorite of SEO bloggers and conference presenters who have absolutely no idea of what they are talking about. It’s a phrase that search engineers comprehend better than those who made it a household trademark for Web marketing. You cannot manage crawl budget. If crawl budget existed it would be managed by the search engine. You have a better chance of proving that Bigfoot exists than you do of proving that you can manage crawl budget.
What you can do, as an SEO or a Website operator, is manage crawl. Crawl management from the Website is very different from crawl management from the search engine. You cannot force a search engine to fetch a specific URL, much less schedule it. You can only make a suggestion. On the other hand, you can request that the search engine not crawl a URL. Oddly enough, even Google ignores “robots.txt” on occasion. You can also prevent a search engine from crawling your site, or specific URLs. That’s crawl management but you’re not managing “crawl budget”.
What the Search Engines Tell Us about Crawl
This is an appropriate place for me to mention that Shari Thurow occasionally reminds people that search engineers used to speak about “crawl cap”. It’s not clear to me if they still do. The most recent use of the phrase that I can find in academic literature was published in 2013, but I did not conduct an exhaustive search. That paper defined “crawl cap” as the maximum allowed download limit, but the paper was written by and for Web archivists, not search engineers working at one of the major indexers.
In the January 2017 blog post I linked to above, Googler Gary Illyes introduced readers to a couple of expressions that Google uses internally:
Crawl rate limit is “the maximum fetching rate for a given site”.
Crawl demand is driven by “indexing” (which Gary uses to obliquely describe the systems or processes responsible for creating the indexes used by Google’s search algorithms).
You don’t set the rate limit for your site. Google calculates this to estimate when its crawling may create a negative experience for other visitors to that site. How would they do that? The simplest way would be to track the response times to Googlebot requests. If they maintain a history of crawling activity for the site they could, hypothetically, calculate an average response time, a standard deviation, and other metrics that could be used to adjust their crawling rate. Or they could do other stuff. All you and I can do is speculate (well, we could ask them what metrics they compute and use but they sometimes don’t want to tell us these things).
Nor do you set the demand for your site. You can request some fetches through Google Search Console but the demand requirements from the indexing systems are beyond your control or direct influence.
Hence, ergo, therefore — if you cannot manage or set the crawl rate limit or crawl demand for a Website you cannot manage whatever you think “crawl budget” is to a search engineer. Q.E.D.
Other things Googlers have said about crawl include …
“PageRank drives crawl”, according to former Googler Matt Cutts at a conference long ago in a galaxy far, far away. Since Google still uses PageRank it’s probably a safe bet that their crawling systems take it into account in one or more ways.
“According to our analysis, having many low-value-add URLs can negatively affect a site’s crawling and indexing,” Gary wrote in the above-referenced blog post. I think this is an interesting way to put it. He seems to be implying that Google is not arbitrarily degrading a Website’s crawl. Rather, Websites unintentionally degrade their own crawl. He provided some examples “low-value-add URLs”:
- Faceted navigation
- Session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Infinite spaces (like calendars)
- Proxies
- Low quality or spam content
That’s a lot of stuff to keep up with. Every SEO specialist in the world learns quickly to look for these kinds of problems. But “fixing” these issues doesn’t guarantee that the search engine’s crawl relationship with the site will improve. Whether that happens depends on various things. For example, just because you tell Google to filter out certain URL parameters in Search Console doesn’t mean it won’t find more URLs with those parameters. What happens inside the crawling and indexing algorithms with all those “found links”? The only correct answer from anyone outside of Google is you don’t know. There is no need for you to guess or to argue that your guess is probably right. You don’t know what happens to newly found URLs that contain faceted navigation components.
Fast Websites will be crawled more frequently because they can tolerate more frequent crawl without harming user experience. Gary said so, and hence only the most paranoid of people will reject this statement of fact.
There is no ranking factor or signal based on crawl. The speed and frequency with which a search engine crawls your site may have an indirect effect on your performance in the SERPs when you change your pages.
Generally, any URL that Googlebot crawls will count towards a site’s crawl budget. Consolidating duplicate content through canonicalization will not reduce the amount of crawl a search engine drives toward your site. In fact, I once found Google reporting crawl errors for non-existent URLs that it could only have gotten from misconfigured “rel=’canonical'” declarations. If anything, your canonicalization may cause the search engine to allocate some of your “crawl budget” to fetching URLs it didn’t previously know about.
Google does not honor “crawl-delay” requests in “robots.txt” and it doesn’t guarantee it won’t crawl a URL that you have labeled as “nofollow” in one of your links.
Crawl Basics for Anyone New to SEO
These are Michael’s Laws for Crawler Behavior. I give them in no particular order.
- To a search engine, your Website has no “front” or “back” door. There is no “first” or “last” URL in a site.
- A search engine may begin crawling your site from any random URL.
- A search engine will fetch a URL as many times as it can and desires to.
- A search engine may choose not to fetch a URL no matter how badly you want it to fetch that URL.
- A search engine may fetch a URL from more than one IP address.
- Just because you see a search engine user-agent in the server log entry doesn’t mean that fetch came from a search engine.
- You cannot guarantee you will find all the URLs on a Website by crawling it.
- You cannot replicate a search engine’s knowledge of your Website.
- Search engines crawl URLs for years regardless of whether the URLs still exist or ever existed.
- Search engines can and do ignore “robots.txt” even when they say they honor it.
Twice so far in this article I have said that search engines can and do ignore “robots.txt”. What does that mean? It means that I have watched Bing and Google request pages they should not be seeing. The “robots.txt” files are not malformed. I think it’s fair to say these fetches happen under special circumstances but I have never seen a complete list of those special circumstances.
What I find especially amusing is that Google somehow manages to pull down content from a page serving a 403 status code that my browser cannot or will not. Is there a flaw in the Apache server software or does Google know something we don’t know? Bing slips through the cracks even more than Google. Their unfettered crawling leads to problems in Website performance analysis because the crawlers show up as “ghost hits” to pages and scripts. That’s really annoying. Really very annoying indeed.
These are Michael’s Laws for Crawl Management. I’ll discuss important points below.
- Search engines that look for and honor “robots.txt” ALWAYS have a “robots.txt” to refer to.
- It is better to block a crawler through a server firewall (e.g. “iptables”) than by any other method.
- You cannot sculpt PageRank but you can waste it.
- The proper way to manage the flow of PageRank-like value through a site is through positive link placement.
- Tiered, hierarchical site navigation works better than flat site navigation.
- Every page on a site with more than 3 pages should have links from at least 3 other pages on the site.
- Links are for people first, search engines second.
- The search engine decides which links count, not your SEO experiments.
- The search engine decides whether a link’s anchor text helps with rankings.
If you’re going to manage crawl you need to set proper expectations. If you’re trying to make the search engine’s job easier then you have the right expectations. If you are trying to improve your rankings then you have the wrong expectations. Implementing the best possible crawl management for a site guarantees you nothing in terms of indexing and ranking any part of the site in any search engine. Crawl is not the magic wand you’re looking for.
How can a search engine ALWAYS have a “robots.txt” to refer to? If a search engine doesn’t find a “robots.txt” file it assumes that such a file, if it existed, would allow the search engine to crawl every URL on the site. Hence, the search engine ALWAYS has a “robots.txt” file to refer to.
Why is the iptables firewall better for blocking a crawler? There are two reasons. First, if you don’t want the crawler on your site then you don’t need to be wasting resources on it. Second, some crawlers ignore “robots.txt” (either intentionally or inadvertently). The crawlers may hammer your site before they expire their cached images of your “robots.txt” file. In an emergency, when the crawler is hitting your server too hard, you should block it via iptables. Wait 1-2 days and then let the crawler back in. It will grab “robots.txt” and start assessing your site’s crawl priorities again.
This is a radical response to an overly aggressive crawler. It’s not a first-choice response but if you cannot get a crawler to behave well within 30 minutes just block it. Your search results won’t suffer just because the search engine gets timeout errors for a few hours.
How do you waste PageRank-like value? Any SEO strategy designed to manipulate the flow of PageRank-like value through a Website operates blindly with absolutely no knowledge of how the search engines calculate these values. Whether you add more navigational links to a page or remove them doesn’t matter. You will never find the Goldilocks Zone in PageRank flow. You should not be seeking it.
Most Web marketers assume that PageRank “begins flowing from the root URL to the rest of the site”. That is complete and utter nonsense. It is a pseudoscientific assumption based on ignorance and a total inattentivity to how links connect Websites. PageRank-like value flows to a document URL from the links pointing to it. If you have been earning or building links for your individual blog posts or product pages then all of those pages may be funneling PageRank-like value to the rest of your site.
So now look at all the models that SEO experts have shown you of PageRank-like value flowing from the root URL of a site … and laugh. You don’t know where the PageRank-like value is coming from or where its flowing to.
What You Need to Do to Manage Crawl on a Website
Don’t ever run an SEO crawler on a Website with the expectation that it will tell you anything useful about what the search engines find. You really need to grok the fact that search engines don’t crawl your site from “top to bottom” (because there is no “top to bottom”) and they can queue any URL of your site for crawling because of external links pointing to it. Your SEO crawls waste server resources and that creates a bad user experience. Worse, they miss a lot of things you should be looking for. Hence, SEO crawlers are wasting your time.
Here is how you should be collecting data about a Website’s crawl:
- Get the 404 errors from the server logs.
- Get the other 40x errors from the server logs.
- Get the 30x redirects from the server logs.
- Look in the database for redirects.
- Look at the .htaccess or equivalent files for redirects.
- Get the list of published URLs from the database (if there is one).
- Crawl the server-side directory tree to get the list of published URLs.
- Compare your lists of published URLs to the sitemap files.
- Get a list of sitemap files from the “robots.txt” file.
- Get a list of sitemap files submitted to search engines.
Whole sections of a Website could be orphaned from the root URL of a Website and yet crawled frequently and completely indexed by the search engines because they found links on other sites or the URLs are listed in sitemap files.
If you’re running SEO crawlers on Websites you’re doing it wrong. You might as well be asking someone to tell you all the things they don’t know. The Website itself is the only authoritative source of information about what is being published, blocked, and/or redirected. Everything else only provides you a partial picture unless you crawl a perfect Website. And if it’s a perfect site then how do you land that cushy SEO project?
Website Navigation Should Only Be Useful
“Useful” is very, very hard to define. You need to look at what users are actually using on a site to determine how useful stuff is. But should you look at 1 day’s worth of data, 1 week’s worth, 1 month, or a quarter’s worth of data? Some URLs are only required for short periods once a year. Some URLs are only required once. Hence, how useful a navigational link is depends on who might use it, when, and why.
Even so, you need to ferret out the most useful navigation and make sure it’s visible and easy to find. Accessibility is a great plus as well, but you may find yourself dealing with a Website design that has very visible but inaccessible navigation. That’s not really an SEO issue but it’s something you should be prepared to deal with.
Useless navigation should be removed. No one can ever rationalize useless navigation except by saying, “It’s there to improve crawl”. In fact, it degrades crawl. Worse, other algorithms MIGHT score a page poorly because it has too much useless clutter. Everything on the page contributes to an algorithmic assessment of how the page works, what it is relevant to, and how well it does its intended job (page quality).
Website Navigation Should Always Be Consistent
Shari Thurow and other usability experts hate the hamburger menu on mobile sites. I, personally, love it. Regardless of how you feel about the shape, placement, and format of any Website navigational widget, you should always use the same exact shape, placement, and format for your primary navigational widgets. Use the same exact layout and functionality on every page.
You do this for users and for search engine optimization. If you cannot easily find the primary navigation on a page how are you going to assess the search engine’s ability to do so? You need to be able to look at a page and intuitively understand how users move around it and from page to page on that site. That’s just for your own sanity.
It Is Okay to Have Redundant Navigation on a Page
I wish more sites would replicate their top-of-page menus at the bottom of their articles. It’s just easier for a user to navigate to another page when they have the option to do so at the end of whatever article they have read.
A few people, not many but some, have said to me (through the years) that search engines don’t want to see duplicate menu bars on a Web page. My eyes roll up in the back of my head when I hear or read this. I have no idea of who started teaching people that nonsense but they should have their hands stapled to their butts for a week. Try navigating through life like that and you’ll appreciate having some “lower down controls” much, much more.
The search engines don’t care if you replicate a menu at the top and bottom of the page. And don’t give me any crap about “the first link counts (the most)”. That’s pseudoscientific nonsense and it has nothing to do with creating a good user experience. Remember the law: “Links are for people first, search engines second.”
Let the search engines figure out if those bottom menu links are important to their algorithms. They will not penalize you for doing something user-friendly.
People will write a 5,000 word article full of filler text because some pop blogger says 5,000 word articles rank better but they won’t add a menu to the bottom of your pages for their visitors. Why? Because no one is promising better rankings.
As an aside: WordPress sites seldom have the option of displaying their primary navigation menu at both the top and bottom of the page. While there are clumsy workarounds that allow you to do this, it’s a basic deficiency in WordPress’ core design.
Where Possible, Use Section-specific Secondary Navigation
On a large Website this is a really efficient way to handle navigational needs. Your top-level menu remains small but each section within the site has its own sub-menu. It’s easier to achieve this in WordPress than it is to display the primary navigational menu at the bottom of the page. Install a plugin that allows you to display a secondary menu in the sidebar, filtering by category. The Advanced Sidebar Menu plugin is one example for you to consider but there are others. I don’t want to recommend one above all others.
There are several reasons for NOT using huge expanding menus as your primary navigation. First, no one is going to click on all those links. Second, it’s damned near impossible to get the styling to remain stationary while you move the mouse cursor to the right option. Those giant menus are NOT user-friendly. EVER. Third, the larger your primary navigation becomes the flatter your site structure becomes and flat site architecture degrades a search engine’s ability to identify the most important pages on a site.
Use Hidden Navigation Widgets Sparingly
I don’t want to say don’t do this. Sometimes it makes sense to hide content behind a CSS tab. And while Google has cautioned designers not to do this on the desktop, they say that it will be okay for the “Mobile First Index”. Hiding content on the page via styling that is controlled by the user actually makes sense and on some sites I really like it. If the search engine is following the links then do it. But try to find alternative ways to add navigation options to your deep content.
Include the Words You Want to Rank for On Your Page
Too many people still think they have to use link anchor text to get a page to rank for a desired search phrase. That has never been true. You don’t have to use link anchor text for anything. Link anchor text is an option, not a requirement. It’s better if you include your desired expressions on the page. The inbound link anchor text can complement your on-page content. Search engines really do want to see agreement between link anchors and on-page text.
If you feel the search phrase doesn’t belong on your page then that’s the wrong page to have ranked in the search results.
Crawl Management Begins with the Basics But …
There is more to good crawl management than I can include here. Too few people have mastered the basics. There is way too much discussion of unnecessary “advanced” optimization in the marketing community. Until people get the basic stuff right concentrating on the more advanced techniques will lead to confusing and inconsistent results.
Web marketers need to stop talking about “crawl budget”. It’s a misleading topic and it’s a red mark against the entire Web marketing industry. Crawl management is too important to be sidelined by nonsense like “how to manage crawl budget”. Worse, the truly basic stuff is more about giving crawlers access (or denying it to them) than it is about resolving specific issues such as dead URLs, duplicate content, and such.
Crawl is integral to solving some SEO problems but those problems really stand outside of crawl. People need to stop complicating discussions of crawl management with all these other issues. Crawl is a tool, not an end. It’s time to get everyone to scale back their crawl rhetoric because the conversation is confusing everyone, Web marketing experts and clients alike.
Follow SEO Theory
Do you want more than just reposts of the week's SEO discussions and news?
Get the LARGEST weekly SEO newsletter now ...

Very useful information about crawling of website and crawl management. I have learnt a lot from this post about crawling management. Regular crawling is very necessary to rank a website. Thanks for this info