While not as divisive as topics like Subdomains versus Sub-folders, questions about 404 error code management usually receive multiple opinions and rationalizations on the best strategies. Because search engine optimization has no standards, there are few wrong answers and many good answers for questions on what to do with dead URLs.
I often ask people why these non-existent URLs are being requested. Most of the reasons are the usual, plausible, natural causes:
- Old content was removed
- Someone hacked the site and created a lot of spam
- The site was repurposed
- The permalink structure was changed
You should understand why someone is trying to fetch missing content. That means something, in some important context.
What you don’t need to worry about is crawl budget. Unfortunately, most of the questions I see about how to handle 404 situations are predicated on the assumption that these dead URLs hurt a site’s crawl budget.
They really don’t. Most Websites have too little content to make crawl budget an issue.
Why You Don’t Need to Worry about Crawl Budget
Unless you’re optimizing a site with 100,000 or 1 million URLs, crawl budget isn’t anything you need to worry about.
The search engine computes crawl budget based on how much content it wants to crawl from a site in a specific time-frame, and on how responsive the server is.
A typical busy Website has peaks and valleys in its response times. Your (client’s) site might be slower in the morning (during your/their normal business hours) and faster 12 hours later. Or the peak response times might occur early in the day and the site slows down at night.
It depends on how many people and things are requesting content from the site at any given time. A site with relatively little traffic or a whole lot of processing power will generally be very fast. And the faster your site responds to search engine requests, the more crawl budget the search engine is likely to assign to your site.
But since most Websites have far fewer than 100,000 published, crawlable URLs, the search engines tend to fetch the same stuff over and over again. So you’re not using up your crawl budget as much as you fear you are.
In fact, the vast majority of SEO bloggers and conference presenters who talk about crawl budget don’t know what they are talking about – especially if they’re telling you that you need to manage it (because you can’t manage it – the search engine manages it).
You manage the site’s crawl; the search engine manages the site’s crawl budget.
Crawl is not crawl budget; crawl budget is not crawl. Crawl consists of all the fetch requests the server receives. Crawl budget is how many fetch requests a search engine assigns to a site during a given time-frame.
How Important are 404 Errors to Search Engines?
Many people correctly say the search engines usually don’t care about 404 status codes. You won’t be penalized for having non-existent URLs. Search engines follow links that may be broken, or they find links to old content that has been moved or removed.
But while a search engine won’t hold a 404 status against you – you can’t earn a penalty for having one – it could be missing important content. Maybe the most important link pointing to a page is mistyped, which means the search engine’s discovery process for that page is less efficient.
Worse, PageRank-like value doesn’t flow through dead links to the rest of the site. If you’re looking at SEO metrics that calculate site-wide value, they’re not telling you anything about what the search engines calculate.
So it’s fair to say that search engines care enough about 404 status codes to have their own rules for them. Those rules include refetching non-existent URLs on the chance that they’ll start serving content again in the future. And the rules include not passing PageRank-like value through dead URLs to some other destination (except maybe through a process Matt Cutts once described as evaporation).
What You Can Learn from 404 Errors
First, you should not be relying solely on SEO crawl reports, Google Search Console, or Bing Toolbox to tell you about your site’s dead URLs. There may be many missing URLs that those tools don’t know about.
Your server log files are the only 100% reliable source of information about which missing URLs are being requested and who is requesting them.
Instead of crawling a site as part of your audit you should be reading the raw server log files. Your crawl miss a lot of important information.
When you see a 404 status code in a server log file, it tells you everything you need to know about what is happening. But you may not know enough to understand what you’re seeing.
Many people mistake URLs they don’t recognize as evidence a site has been hacked. Hackers tend to create obvious URLs when they inject content into a site. Typos are also usually pretty obvious.
But there are thousands of URLs you’re not accustomed to dealing with that hacker bots probe for. These URLs are script files, login addresses, and other pages that are found in plugins, themes, and content management systems you’ve never heard of.
The bots are looking for potential vulnerabilities in your site. These probes don’t directly affect your site’s crawl budget with any search engine, but they can affect the site’s performance – slowing it down.
Brute force dictionary attacks also affect site performance.
Some 404 errors are generated by recently updated plugin, theme, or CMS code. Although developers usually test their code before pushing it out to the market, small things can slip their notice. A missing component doesn’t always break a Website.
Unresolved 404 Status Codes Clutter Your Server Log Files
When you spend enough time reviewing server log files – and if you’re auditing sites you should be spending a LOT of time reviewing log files – you’ll become annoyed by repetitive error codes. You should fix them to declutter future reports.
You can’t make the fetch requests vanish. They’ll still happen. But doing something about the missing URLs changes the status codes the server responds with.
You can either publish new content at the missing location or redirect the request to another location. Either way, you’re changing a status code to something other than a 404.
You can also detect the user-agent and respond with a different code. Or you can respond with a different code based on the URL’s structure. Some people convert their 404 status codes to 410 codes.
When you habitually resolve 404 status codes, every new 404 status code you see tells you something different and useful.
Use Redirection When You Don’t Want to Create More Content
A simple 301- or 302- redirect uses about as much bandwidth as serving a 404 status code. You’re not saving any server resources by redirecting dead URLs.
For URLs that hackers either link to (driving unwanted crawl to a site) or probe for (seeking vulnerabilities to exploit), I usually implement 301-redirects to www.example.com. There is no Website at that domain. If a spammer is linking to your site, you don’t have to worry about the site being harmed by the link if you redirect the hacked URL away from your site.
Again, this only replaces the 404 status in your server log files with a 301 status – but it makes it easier for you to find other problems in future server log reports.
For URLs that are obvious typos, I redirect to the correct existing content. On some sites where users mis-type things in the site search, I create a page for mis-spellings that shows them how to reach the correct destination. You can use a “noindex” on these kinds of pages if you don’t want Bing and Google to show them in their search results.
Create Content To Take Advantage of Inbound Links
Whether people mis-type their links or link to content that no longer exists, if you see random visitors coming through those links then you really should do something about that traffic.
Sometimes it’s good enough to create a default 404 document with a few general-purpose links.
Sometimes it’s better (in my opinion) to create content at that URL, even if it’s only to say, “Sorry, what you’re looking for is no longer here.”
The idea is to put something in front of the visitor, to give them a reason to click deeper into your (client’s) site.
You should never worry about whether search engines care about those old inbound links. It doesn’t matter. If you see real people clicking on those links then you should give them something for their effort.
It’s true that you can just ignore 404 errors if you have no purpose for the dead URLs. Although people with less experience may feel alarmed when they see a long list of 404 errors in Google Search Console, if the content was deleted because it’s no longer relevant to the site’s purpose then the errors really won’t hurt the site.
Over time the search engines tend to crawl dead URLs less frequently. They learn what to expect. They adjust their crawl queues to favor content they know is there and hasn’t gone missing.
But if the crawlers are finding links on other sites that point to dead URLs, then you’ll either have to live with the clutter in the 404 reports or you’ll need to think of a way to leverage those links.
The bottom line here is simple: when it comes to what you should do about 404 errors, it depends, and that’s not something anyone should agonize about. With a little practice you can develop your own strategies for how to handle 404 codes.
Better yet, if you change your mind in the future about the best way to handle these things, you can always go back and change the solution you implemented. Nothing is carved in stone.
Follow SEO Theory
Do you want more than just reposts of the week's SEO discussions and news?
Get the LARGEST weekly SEO newsletter now ...