Large website design and optimization theory

Posted by Michael Martinez on April 9, 2008 in Content Theory, Link Theory

We have an in-house working group that focuses on Large Web Site Design Theory. Although I cannot share many details about what the group does I want to discuss Large Website Design Theory (did you notice I spelled “website” as “web site” and vice versa?). Every growing blog has the potential to become a large Web site. However, when we think of large Web sites we usually picture forums, article/message archives, and ecommerce sites

Large Web sites behave differently from small to medium-size Web sites. The search engines also treat them differently (but don’t get out your little black SEO book and scribble down, “Martinez admits large sites have advantages” — see SEO Myth: Search Engines Favor Large Domains).

Large Web sites, in fact, have severe disadvantages with search engines that counterbalance their obvious and not-so-obvious advantages. And that means that large Web sites present us with unique challenges in search optimization. You can take all your duplicate content fears and pack them up for another day because duplicate content is nearly as large a challenge for large Web sites as the SEO industry has made it out to be.

Rather, crawl management is the greatest challenge large Web sites face. Both the Web site operators and the search engines have to manage crawl for large Web sites. The operator has to ensure that the entire site is crawled with a reasonable chance of being indexed. The search engine has to ensure that it doesn’t crash the server hosting the large site.

Today’s SEO mythology teaches us that it’s best to design a site so that every page is ideally no more than three clicks from the home page. At least one search engine has even put that recommendation into its Webmaster guidelines. For small to medium-size sites the three-clicks-from-home rule is acceptable if inefficient. For a large site it’s a suicidal rule, an exercise in failure.

Let’s look at a 100,000 page Web site. Is it possible to arrange all the pages so they are no more than three clicks from the root URL? Sure. You could place 50 outbound links (to deeper content) on the root URL, and each of those 50 pages could link to 100 other deep content pages, and each of those 5,000 pages could link to 20 other pages. Voila! Every page on your site is three clicks from home. You can play with the numbers and reduce the outbound linkage on the root and intermediate URLs if you wish.

But ask yourself how long it will take a search engine to crawl and index those pages if it has to reach them all through the root URL. Lag times kill you in every model unless you hammer the server. We can assume, for example, that it takes a search engine no more than 1 minute to grab a page, parse it for links, and queue them for fetching in a classic deep crawl. In this scenario, the search engine grabs our root URL and comes back a minute later to grab another page. Assuming it fetches a page every 2 seconds, it should take almost 28 hours just to grab all the pages in our structure.

That’s 28 hours of 2-second intervals from one source. Today’s dedicated Web servers should easily be able to handle that kind of traffic. It’s not unusual for low-end servers to max out at around 200 concurrent users, but a typical low-end server may only be configured to handle 100 concurrent connections. Concurrent connections don’t translate into every user hitting the server at the same time but that’s a pretty heavy load for your basic rack-mounted PC.

So with a search engine asking for content every 2 seconds your 99 other users may see their response times degrade a little. However, search engines don’t usually deep crawl from just one location. After all, they have multiple data centers so they can send their 2-second interval requests from, say, 20 locations. That means you’re sacrificing 20 of your 100 concurrent connections for deep crawl fetches coming in every 2 seconds from each location. You can do the math and conclude, “I’ll only have to suffer for 1.4 hours” but life should be so simple. You’ll be lucky if your server keeps running that long.

In other words, it’s not practical for a small server to be deep-crawled — especially not during peak periods of use. That’s why search engines rarely deep crawl the average Web site. Furthermore, in real search engineering the crawl queues are being loaded up with random URLs constantly. So when SearchEngineBot X.0/1 comes fetching your root URL, you need to assume that no matter how quickly it parses your page (and it could take days or weeks for that to actually happen, depending on what the search engine’s priorities are) the URLs will be dumped into random queues behind other URLs.

On average it may take 1-12 hours for a search engine to crawl 10-20 pages from a Web site. If you’re getting 50 fetches a day you’re doing good. Now you just need to sustain that fetch activity for the next 2,000 days and all your 100,000 pages will be crawled. Of course, because of internal linkage (you’re following all the good SEO advice about having every page link back home, right?) your crawl patterns will develop around a cyclical pattern: fetch home page, follow some links, fetch home page, follow some more links, fetch home page, follow some more links.

The search engine will gradually start to recrawl deeper pages that also have a lot of links pointing to them from yet deeper pages. A smart search engine might separate the normal on-site navigation from other internal links and give priority to the new links, but that kind of scheduling optimization is a discussion for another day. Let’s just assume that every page that is fetched is parsed evenly and all the links are thrown into the first available queue slots.

That means your most heavily internally linked pages will be recrawled often.

That means it will take more than 2,000 days at 50 fetches per day for a search engine to cover a 100,000 page Web site.

Now, we know that large sites can be crawled better than that, but how does it happen? One explanation is that a good external link will help search engines find deeper content. The pages that external links point to thus act like doorways into your Web site (and this realization many years ago led me and others to propose the melding of doorway design with content design to create content-rich doorway pages). We can evaluate these entry pages in a different way and call them entry points. If you were to chart your 100,000 page site on a graph so that it looked like a big blotch of points, your entry points would be colored dots on the edge of the blotch.

Now imagine that color seeping inward toward the center of the blotch. If you could copy the graph onto a series of slides and have each slide extend the color deeper into the mass of dots (by coloring each dot that borders a colored dot in the previous slide), you would gradually see your blotch of uncolored dots change color. Now you should see the value of having external links pointing to deep content.

And, yes, it would help if I had time to create sample slides to show you but I’ve got people coming into the office and talking about client stuff and other annoying things, so I have to ask you to use your imagination.

Now, how many entry points (aka access points) should a large Web site have? As many as possible, of course. But is there a way to measure how efficient your entry point network is? Actually, there are several ways to do this.

For example, you can look at the unique-internal-linking-pages-to-pages ratio. This is a simple, straight-forward metric. You count up the number of pages on your site that include unique links to deeper content on your site. That means your on-site navigation links only count once (on many large sites on-site navigation is broken up into tiers so you have to form a very precise definition of what constitutes “on-site navigation” for each site).

Let’s say that 5,000 of our 100,000 pages contain unique links to internal content. We divide the 5,000 by 100,000 to determine our ratio (1:20), which can be written as an efficiency or percentage (.05). Is that good? Is that bad? Well, let’s just say that the higher the efficiency (that is, the closer it gets to 1.0) the better. Most Web sites will never even get up to .50.

You can look at the number of access points (non-root URLs with external links pointing to them) and divide those by the number of pages on the site. Hence, if you have 50 access points you have an access efficiency of 50 divided by 100,000 (0.0005). Again, the closer to 1.0 your efficiency gets, the better. Most sites would be lucky to hit .1, in my opinion, but I’m just guessing. It’s not easy to calculate these kinds of efficiencies because you have to determine a framework for them.

The framework of an access efficiency is bounded by the limits you place on the linking profile. For example, you can calculate an access efficiency for a site in a “Google framework” (that is, according to all the links Google gives crawling credit to) or a “Yahoo! framework” or an “Ask framework”. Or, you can calculate an access efficiency for a site in a “Web framework” (without regard for which search engines give credit to the links). And you can apply other limits, so as calculating social media access efficiencies, forum post access efficiencies, blog post access efficiencies (note: I am distinguishing between links embedded in posts and links embedded in signatures).

Each access point on a large site can also have a reach efficiency. For example, you could determine that a reach efficiency exists for all pages that can be reached from an access point within three clicks. Hence, if a randomly selected access point only provides reach to 50 other pages, we can say that it has “a reach of 50 pages with an efficiency of 0.0005″.

Large reach efficiencies are good to have and hard to achieve, although they do appear for most large Web sites. Somehow, some way, someone eventually puts some copy on a large web site that links out to lots of other copy on the site and people link to the linking copy. Although SEO Theory is not a large Web site, you can see how several other sites linked to my “Best SEO Theory Blog Posts of 2007″ post.

That blog post became an access point with a reach efficiency (that I don’t have time to calculate for this post, but it should be larger than we would reasonably expect from a site with thousands of pages of content).

I’ll have to leave the discussion at this point, but you should be able to take away some lessons for small Web site design from what I’ve presented here. For example, if you calculate access points and reach efficiencies for a small site that has only 100-200 pages and find they are low you probably have a serious linking problem. Even small sites should have external links pointing to deep content, especially blogs.

But here is one last thought: Todd Friesen asked me to calculate how long it would take to get a 500,000,000 page site indexed (that is not the real number but it’s relatively close). My best estimate, knowing nothing about the site architecture, ran into years. Todd’s own estimate was similarly long. We then did a little research and found that the largest Web site we could identify was more than 10 years old and had only about 100,000,000 pages indexed (these are guestimates) in major search engines.

If you think that these ratios I’ve described don’t have real world applications, you’re not optimizing large Web sites. Large Web site optimization theory is all about the numbers.

7 Comments on Large website design and optimization theory

By brill on April 9, 2008 at 11:33 am

I just want to clarify “deep content”… content not linked to from the root page?

By Michael Martinez on April 9, 2008 at 11:43 am

brill, “deep content” is — technically — any content OTHER than the root URL, but I think most people would probably qualify that in some fashion. For your purposes you can arbitrarily define deep content to begin at the nth click away from the root URL.

By tinpig on April 9, 2008 at 4:54 pm

what role do site maps play in this equation? that is, in the absence of in-bound external links, does submitting a site map impact how quickly a site is indexed? if i have a page that’s buried way down on my site with only one obscure internal link-path to get there but my site map lists it at highest priority, will it get indexed faster than page closer to the top?

By harrypxx on April 10, 2008 at 5:56 am

Interesting piece. I’m a big admirer of your blog. I have some responsibility for, and access to the stats for a significant number of medium to large websites - from 50k to 500k pages each - and I thought I’d share some experience with you and your audience. Crawl rates vary, but I have seen maximum day rates of >90k pages for a single site (verifying cross-referenced server logs and G Webmaster tools). Clearly this wouldn’t be achievable without decent bandwidth and servers, but as a publisher of genuinely large websites we have those in place.

My rhetorical question would be: if you have a genuinely large website, why don’t you have the bandwidth and servers to match? I think I know the answer. Many “large” websites do not actually consist of large amounts of content. If someone puts together a 100k page website on the basis of database permutation, scraped content and boilerplate text, it’s no surprise to me that they find it difficult to get the pages indexed.

My experience of Google’s crawl method for new or newly relaunched sites with large numbers of pages is that it will come in and hoover up a sample. If it likes what it finds (ie it doesn’t detect auto-creation of multiple pages but finds unique content), it will increase its crawl rate until either it does find duplicate (or near-duplicate) content, or it experiences speed issues with your servers.

Some of the sites that I mention have very good external linking to many different pages. Others, in very specialist areas, have good links to the home page but very little to other pages. I haven’t been able to establish significant differences in crawl rates between them. All of them have the vast majority of their pages indexed in the major search engines.

I’d suggest that anyone having trouble indexing their large sites look first to the content - do you have 100k genuinely unique, valuable pages? Or is there a chance that the search engines could be detecting some duplicate or near-duplicate content?

Just to be clear, I’m not disagreeing with anything in the post, just sharing a little of my own experience…

By Michael Martinez on April 10, 2008 at 7:45 am

tinpig: “what role do site maps play in this equation? that is, in the absence of in-bound external links, does submitting a site map impact how quickly a site is indexed?”

Michael: In my opinion, XML sitemaps work like external links as far as the discovery process goes but I don’t believe the search engines assign any value to them. i.e., a link from an XML sitemap doesn’t convey any PageRank (but perhaps they are working on ways to estimate potential PageRank for such links — I don’t know).

The saturation edge for a link from an XML sitemap is probably much smaller than the saturation edge for a link that passes a lot of PageRank. By “saturation edge” I mean the depth of crawling that may be triggered by the link. That is a purely speculative concept on my part.

That said, I don’t believe a sitemap link is sufficient to ensure indexing will occur. The more internal links you point toward your pages, the more likely your pages will appear in a search index.

harrypxx: “My rhetorical question would be: if you have a genuinely large website, why don’t you have the bandwidth and servers to match? I think I know the answer. Many ‘large’ websites do not actually consist of large amounts of content. If someone puts together a 100k page website on the basis of database permutation, scraped content and boilerplate text, it’s no surprise to me that they find it difficult to get the pages indexed.”

Michael: In my experience you have two kinds of large Web sites: those are that planned and those that grow into large sites. Even among ecommerce sites there are many which grew from relatively small content sites into large content sites.

I agree that Google and other search engines will increase crawl rates for large sites they identify (provided they receive good response times from stepped up crawling). But if you’re launching a 20 million page site you probably have people asking you when every page will be indexed on a daily basis. 500K fetches per day is very respectable, but the very site structures we depend upon to help user navigation and crawling work against our crawl saturation.

Out of those 500,000 pages, how many have already been fetched? Improving the crawl rate is not sufficient. We want to improve crawl overall, and that involves everything from entry/access points to crawl reach to crawl frequency and more.

By Ethan on May 22, 2008 at 3:16 pm

Thank you for sharing insightful information in each of your blog posts.

I think I have found a slight error that you can quickly correct to further improve the quality of this wonderful post. In the following passage, I believe the ratio of “1:5″ should read “1:20″:
“Let’s say that 5,000 of our 100,000 pages contain unique links to internal content. We divide the 5,000 by 100,000 to determine our ratio (1:5), …”

By Michael Martinez on May 23, 2008 at 7:46 am

Ethan, thank you for pointing out the error. I have corrected it.

Comment

Log in or Register to post a comment.

More

Read more posts by Michael Martinez

About the Author

Michael Martinez is the Director of Search Strategies for Visible Technologies, Inc. A former moderator at SEO forums such as JimWorld an Spider-food, Michael has been active in search engine optimization since 1998 and Web site design and promotion since 1996. Michael was a regular contributor to Suite101 (1998-2003) and SEOmoz (2006).

Blogpooling for beginners Large Web site design theory and crawl management