Architecting Web sites - Design from the SEO perspective

by Michael Martinez on July 23, 2008

Update (2008-07-25) Gary Lee took this article’s PageRank timeline and made a chart complete with appropriate and/or interesting graphics. Check it out! Thanks, Gary.

In March 2008 I wrote “You need to think in terms of sculpting Web sites, not PageRank”. Shari Thurow suggested to me that it might be better to say, “You need to think in terms of architecting Web sites, not PageRank”.

Before I discuss architecting versus sculpting, let me recap PageRank and the Supplemental Results Index to provide a little context.

A Brief History of PageRank

  • 1998 - Larry Page and Sergey Brin introduce PageRank, a Web document citation ranking methodology based on link relationships.
  • 2000 - SEOs operating link farms and automated link exchange programs discover that Google’s search results are easily manipulated by these methods.
  • 2002 - Google launches Google News, which uses the Hilltop algorithm devised by Krishna Bharat and George A. Mih?il?.
  • 2003 - Bloggers coin the expression “Google bombing” and launch the widespread practice of pointing collaborative links at otherwise irrelevant documents to manipulate query results.
  • 2003 - Google introduces the Supplemental Results Index, a repository for documents Google doesn’t quite know what to do with. Many of these documents are duplicate content pages.
  • 2003, October - Google launches the infamous “Florida” update. Many SEOs wrongly conclude that Google has suddenly adopted the Hilltop algorithm.
  • 2004, March - Google adjusts its algorithm to look at links in a new way. An apparent unintended side effect is the so-called “Google Sandbox Effect”, whereby sites with few to no trusted links cannot rank even for their own names.
  • 2004, October or November - Google begins updating its index on a weekly basis (until now it had been updating monthly).
  • 2005, February - Mike Grehan publishes an interview with Ask’s Jim Lanzone and Apostolos Gerasoulis in which Gerasoulis claims that Google isn’t using PageRank.
  • 2005, February - Google dumps millions of documents from its index, resulting in many queries that are flooded with so-called “URL-only” results, although many older sites simply vanish altogether or show 2-year-old data. The search results return to normal around May.
  • 2005, July - Google begins de-indexing and/or ignoring links from faux directories and and some Web sites that share certain design features of faux directories.
  • 2005, October or November - Google de-indexes more faux directories in volume.
  • 2005, December - Google begins rolling out the Bigdaddy update. Bigdaddy introduces a new infrastructure, a dual-index crawl, and re-engineers the Supplemental Index to absorb many low-PageRank documents.
  • 2006, January - Matt Cutts writes on his blog that PageRank is being used to determine the index to which documents are assigned.
  • 2006, September - Search Engine Roundtable reports that Supplemental Index pages are not fully indexed (this is later confirmed by Google).
  • 2006, December - An internal Federal Trade Commission letter discussing paid endorsements is leaked to the news media.
  • 2007, April - Matt Cutts suggests that paid links should be disclosed, implying that they may fall under the FTC paid endorsement policy.
  • 2007, May - Google follows in the footsteps of A9 and Ask by unveiling the Searchology Update, introducing “Google 3.0″, which emphasizes Universal Search (injecting results from News, Blog, Books, and other search indexes into Main Web search results).
  • 2007, June - Matt Cutts introduces the concept of “Peanut Butter SEO” at the first SMX Advanced conference in Seattle, explaining that “each site gets only so much PageRank” and it can only be spread so far like peanut butter on bread.
  • 2007, June - Dan Thies proposes using “rel=’nofollow’” on some internal pages like shopping carts.
  • 2007, July - Google says it is no longer useful to include the “Supplemental Result” label in search results.
  • 2007, September - Dan Thies revises his internal nofollow position to favor the use of nofollow on more internal pages. The “PageRank sculpting” controversy goes into overdrive from this point on as the SEO community divides.
  • 2007, September - I debunk Matt Cutts’ argument against paid links by pointing out that the United States government says links are not endorsements.
  • 2007, October - Google declares war on paid links.
  • 2008, January - I show that Google’s Supplemental Index is still alive and still preventing legitimate content from ranking above less relevant results in the Main Web Index.
  • 2008, March - Shari Thurow argues that using “rel=’nofollow’” to sculpt PageRank is a bad idea.

The SEO Community’s track record

History shows us that the SEO community doesn’t always get its facts straight. History also teaches us that the SEO community doesn’t always understand history. Link bombing really began in earnest when Brett Tabke engineered the first link farm, but it didn’t become a widely understood or practiced concept until after people began playing with so-called “Google bombs”. Since 2004, most people in the SEO community have obsessed over link building, believing that is what SEO is all about.

Search engine optimization is about much, much more than links and a quick perusal of even the most simple and unsophisticated blogs usually turns up plenty of articles about keyword research, metrics, Web site architecture, and other topics outside of link building. Through two surveys, Rand Fishkin has shown that many people in the SEO community believe that keywords in the title tag are more important than keywords in anchor text.

In fact, when I proposed that people NOT use keywords in their title tags last year (in 20 Hard Core SEO Tips), quite a few SEOs tried to argue that I was wrong (other SEOs pointed out, in my defense, that I was only proposing people do this to develop their other optimization skills).

If you search hard enough, you’ll find plenty of controversies in the SEO industry. Opinion is divided on almost every topic, except one: the importance of links. Although I have never argued that links don’t help, many people call me a “contrarian” because I have often pointed out that you can achieve top rankings, even for competitive queries, through on-site factors (including keyword emphasis, keyword repetition, and use of keywords in internal link anchor text).

The point is not that links don’t help you move documents higher in search results; the point is that links are not the only means of accomplishing this task. Quite a few people in the SEO industry don’t seem to care about real search engine optimization, however, as they feel they can just throw money at the problem and achieve good rankings.

Paid Links, PageRank, and Supplemental Results

Google drove a lot of people to buy links when it closed down many of their free linking resources. The glory days of free-for-all link pages, link farms, spammy automated reciprocal link programs, guestbook link dropping, forum link dropping, blog comment link dropping, and faux directory link dropping are gone. The usefulness of automated blogs and mass-generated doorway pages (and doorway domains) is also greatly diminished.

There are still plenty of people in the SEO community who use some or all of these tactics. Reciprocal linking will probably never die off because there is nothing wrong with simply exchanging links, but too many people still obsess over link exchanges. Google is doing its best to make paid links less effective. I think way too many SEOs naively believe that all they have to do is browse the broker inventories and just pick the sites they like. That was a simpler time, I suppose.

You need PageRank to promote your pages out of the Supplemental Results Index. Of course, it’s no longer possible to be absolutely sure which pages are in the Supplemental Index (in fact, it never was possible because many sites had pages that appeared in both indexes). Nonetheless, we know that the Supplemental Results Index exists because Google tells us it does; we also know that Google doesn’t want us to think about the Supplemental Results Index, because Google says it’s now “mainstream”, whatever that is.

Personally, I’m still waiting for Universal Search to randomly toss Supplemental Results onto the first page of competitive queries. But, hey, that’s just me.

How architecting differs from sculpting

Good link building really begins at home, on your own pages. And this is where we finally get down to brass tacks. Let’s talking about architecting a Web site versus sculpting PageRank. There is a world of difference between the two concepts.

If you check a dictionary you may find a definition like architecting - Verb., planning, organizing, and structuring as an architect (in fact, that definition only deals with the root). A different resource says that architecting is making design decisions or commitments that affect components across the system.

There are numerous definitions for system but let it suffice to say that a system is any collection of things which may be combined together to create another thing, distinct from the component things that comprise it. Throw 50 marbles into a bag and you have a bag full of marbles. Lay the marbles out in a circle and you have a system of marbles.

The difference is subtle. You can remove one marble from the bag and the bag remains a bag filled with marbles. But if you remove one marble from the circle you break the circle.

Web sites can be bags of marbles or circles of marbles. What makes a Web site a system is the fact that its component parts (the documents and files that comprise the site) work together to create something that is not complete without any of them. You have a broken Web site when one of the internal links doesn’t do what you intend it to do.

Some Web sites are not set up that way, however. For example, let’s say you register with a Web forum somewhere but you don’t participate in the discussions. You get a profile page, you may even fill it out with information about yourself, but the page is basically orphaned in a typical forum. No one knows about you when you’re not logged in, so a search engine is not likely to crawl your profile page unless it just happens to see your login link in a footer somewhere.

Some blogs make their comments uncrawlable. Did you know that? The posts appear in search indexes but the comments do not. So you can drop 100 links on a blog and none of them will appear in search results — nor will your name (or your screen name if you use one). Hence, the published articles on a blog help form a system with other pages on the blog, but the comments do not.

So a Web site can indeed behave like a system but still have semi-orphaned content that really isn’t part of the system, although some people would argue that even uncrawlable comments must still be part of some type of a system. That is, one Web site can be viewed as more than one type of system. Let’s call these systems architectural personas.

More ado about “rel=’nofollow’”

Your user-facing architectural persona may be very different from your search engine-facing architectural persona. You may disable footers in your forums for visitors who are not logged in — search engines don’t log in, so they won’t see the footers. You may use “rel=’nofollow’” on links to your comments, so search engines won’t see the comments. One Web site can therefore legitimately be one thing to (logged in) registered people and something else entirely to every other visitor, human or machine.

And there are also “private sections” on some Web sites, where only paying members or staff members have access. Some sites scroll their content chronologically, sort of “leaving it behind” in the past. The content pages still exist but it’s very hard to get to them so they eventually vanish from the search indexes and most visitors never look at them again.

The size of the site doesn’t matter, in these architectural personas. Small sites can semi-orphan content just as easily as large sites. This kind of limited orphaning happens all the time. It may be intentional and it may be unintentional. Many Web software packages create orphan content by default to protect users from exploitation. Other Web software packages turn on all options by default, thus ensuring that nothing gets orphaned. Some people hack their Web software and in so doing many introduce inadvertent errors that unintentionally orphans some content.

One global search and replace can break 10,000,000 links to your home page (I’ve done something like that, a time or two). A single mistyped character in a template can send your visitors crashing into an unpopulated section of your Web site. So “phantom documents” may also be part of the systems we create with our Web sites. Our links say the content is there but it’s really not. Search engines have been known to include phantom documents in their results for a variety of reasons.

Whether you think of yourself as a Web site architect or not, every time you make a change to the structure or content of your site, you are architecting.

Dictionaries may tell us that sculpting is “to shape, mold, or fashion especially with artistry or precision”. People who advocate PageRank sculpting are suggesting that they can apply an artistic or precise shape to their PageRank. It is certainly mathematically feasible to propose that PageRank can be distributed in patterns or according to precise criteria.

However, to sculpt clay, rock, sand, or other materials you have to be able to see them, touch them, and change them. You cannot see, touch, or change PageRank.

To see PageRank you need a tool that queries Google’s secret PageRank database to find out what any given document’s currently assigned PageRank is. There is no such tool that is publicly available and I seriously doubt any such tool exists outside of Google’s resources. People are quick to substitute the Google Toolbar PageRank values, of course, but these are only derivative valuations that are computed on an infrequent basis and, according to Matt Cutts, published only after their base PageRank has been incorporated into Google’s algorithm.

But just because a document may be assigned PageRank does not mean it can pass PageRank. Matt Cutts has stated that Web sites can lose their ability to give PageRank to other sites. So you think you can see your document’s PageRank through the Google Toolbar but the Toolbar won’t tell you whether the document can actually confer PageRank on other documents.

That really pulls the rug out from under people whose link building campaigns rely upon spreadsheets that document Toolbar PageRank. In years past I laughed at people who compiled such data because it was absolutely useless. However, there may be some marginal value to collating Toolbar PageRank values in a matrix — not for “sculpting PageRank” or architecting a Web site, but to study patterns of valuation.

Patterns of valuation tell you something about what the thinking behind a metric may be. They don’t tell you whether the metric is useful for comparative analysis. If site A scores a 10 on Michael’s Metric and site B scores a 5, you have absolutely no insight into whether either site is more competitive in search engine results than the other. The metric has to be tied to the search results, and PageRank is only marginally tied to them.

We can be reasonably sure that pages that rank ABOVE more relevant documents probably have sufficient PageRank to be included in the Main Web Index.

We can guess that pages which rank BELOW less relevant documents probably don’t have enough PageRank to be included in the Main Web Index.

We can also guess that pages which rank ABOVE less relevant documents may have sufficient PageRank to be included in the Main Web Index.

When all you have to look at are the search results, you’re left guessing as to why the search results appear the way they do. Although Google claims to incorporate hundreds of factors into its ranking algorithms, we don’t actually know what those factors are or when they are used. Oh, sure, we can make some pretty good guesses. If you point 10,000 links at a document the odds are pretty good it will outrank something else in the search results.

But the point remains that we cannot know what any document’s PageRank is (although many of us feel comfortable substituting an after-computation derivative value for PageRank) and neither can we know whether a document confers PageRank.

If you cannot see PageRank and if you cannot touch it (cause it to flow to other pages), can you nonetheless change a document’s PageRank? That’s a very hard question to answer because anyone can take the cheap way out and say, “Sure — just point enough value-passing links at the document and you’ll change the PageRank!” In truth, you only need to point one value-passing link at a document — even a document that passes the least amount of PageRank possible — and you’ll change your target’s PageRank. Of course, you may not see that change reflected in the Google Toolbar PageRank.

How much does internal PageRank have to change in order to move the Google Toolbar PR metric up or down 1 point? No one knows, though I’ve seen quite a few nonsensical attempts to guess. Guesswork based on ignorance is neither educated nor helpful. Gut opinions can sometimes get you close to the truth but when you’re trying to understand something as precise and mathematical as PageRank your gut feelings are about as useful as a bowl of salt in a hot desert is to a thirsty man.

SEOs cannot measure PageRank, but they substitute bogus numbers from the Google Toolbar.

SEOs cannot move PageRank.

SEOs cannot change PageRank, except by accident.

Given two out of three feel-good options, can the SEO community contrive a way to agree on a third feel-good option? Enter the whole “sculpt PageRank through nofollow” argument. You can move PageRank by not conferring it. Yes, that actually makes sense. Let me explain.

Let’s say you have 10,000 pages on your site. The odds are pretty good that if you have only followed accepted best practices optimization that at least some of your pages have been assigned PageRank. You have some peanut butter to play with. Problem is, you don’t know where the PageRank really goes. Your gut instinct, however, says that it probably goes to the pages with the most internal links pointing at them (not necessaily true, but that’s usually a safer guess than some).

So let’s say you’ve got a page, call it “George”, which doesn’t link out to any of your other content and which really doesn’t have any content relevant to the queries you feel are most important (that is, “George” is not relevant to queries you feel will help you make money). Does it make sense to allow the other 9,999 pages to point potentially PageRank-passing links to “George”, who isn’t going to do a blasted thing with all that link love?

Of course not. So you decide NOT to allow the search engines to discover “George” through some or all of your links. Poor “George” is suddenly deprived of — of what?

How long does it take a search engine to come back, recrawl all your pages using nofollow on their George-links, and incorporate that new data into its index? In a blog post I still haven’t been able to find again (so I really don’t know how credible the anecdote is), someone supposedly asked an audience at the SMX Advanced 2008 conference how many people had used “rel=’nofollow’” on their internal links. According to the blog post I can no longer find, most of the people in the audience raised their hands.

And this same undocumented anecdote says that, when a follow up questioner asked how many people felt the nofollowing had helped, most of the raised hands went down. If that audience is in any way representative of the SEO industry, that means that most of you have tried using nofollow on internal links and you found no benefit from doing so.

But how long did you wait to see if your efforts would be successful? A week? Two weeks? A month? Six months? Frankly, if I were to attempt something like that, I would give it no less than three months and probably about six months. You’ll see a lot of algorithmic activity in any given 3-6 month period. There should be plenty of PageRank tweaks in that kind of timeframe.

But what about the people who claim to have experienced success? How long did they wait? And when they saw what they felt was a successful result of their internal PageRank manipulation, did they then remove the “rel=’nofollow’” attributes from their internal links and wait a reasonable amount of time to see if search results went back to the way they were before? Of those self-proclaimed successful sculptors, how many then re-implemented the “rel=’nofollow’” and found the same exact changes in search results?

I have yet to find any SEO who has documented this kind of test with any credibility. And I’ve read a LOT of nofollow test results blog posts. Most people use nonsense terms that search engines won’t recognize (these terms are not in any lexical databases). The problem with those tests is that no one knows what a search engine does with a brand new term it has never encountered before.

Sure, we’ve seen some tests where these nonsense terms cause sites to appear in search results, but what does that mean? None of you are qualified to explain what it means because none of you have ever devised a test to determine what it means.

Nor am I qualified to tell you, either.

I found one blog post where the tester claimed his test proved someone else wrong because the other guy’s site appeared in Google search results for its name. He got people to point 10-12 links at the site, some of which were nofollowed. I swear, that is ALL the information I found in the blog post. What did the test prove? Nothing.

You can publish all the nofollow test results posts you want, but your experiments have to be repeatable (with the same results) in order for them to be credible. People have to be able to verify what you’re doing or you’re not doing anything useful. I stirred the pot by briefly putting up some links, grabbing a few screen captures, and then pulling the links. What did I prove? That I can put up some links, grab some screen captures, and then pull the links.

There is a statistics-based argument that shows you can certainly starve documents of their otherwise rightfully earned PageRank (which was proposed as a measure of recognition, not intention). However, you cannot do this with any degree of precision.

But let’s talk about architecting a site rather than sculpting PageRank.

On with the architecting

When you design a site structure, you should have the user in mind. A well-designed Web site should help visitors move from one topic to another. Of course, that principle assumes that a Web site should have content about more than one topic. In practice, nearly all Web sites do cover at least two topics, and many of them cover hundreds, sometimes thousands or even hundreds of thousands of topics.

If you create a one-topic Web site, unless that site is strictly about you, you don’t want to include an “About me” page on it. I’m serious. In order to be a one-topic site, your site has to literally discuss only a single topic. So your “About me” page is absolutely irrelevant and unimportant to any one-topic site that is not about you.

But let’s say you’re in the tire selling business. How many topics should your Web site cover? Every different type of tire you sell is its own topic. Your instructions to people on how to find you, how to buy from you, and your helpful advice on the care and maintenance of tires are also separate topics. So the basic, average tire-selling site is a multi-topic site.

A Web site about nails would probably be multi-topic, too. Sure, you could exercise discipline and restraint and just write about one kind of nail, but most people are not going to do that. Hence, most sites are multi-topic sites.

Which is not to say that you have to tell people who you are. There are plenty of sites out there that don’t provide any information about who is behind them, or they make it hard for you to learn who is behind them. Stanford University’s Web credibility project suggests that we probably don’t place much credibility in such sites, but is that really true?

After all, how much do you know about the people who dump links on social media sites. They may post profile pages on those sites but are the profiles legitimate and credible? Have you ever tried to figure out who the Sphinners are on Sphinn? Not every profile over there points to a real site or represents a real person. The same is true of DIGG and many other social media sites. Nonetheless, all of these sites are extended considerable credibility even though we really don’t know who is behind the posts.

If you create a Web site where you’re the person primarily responsible for the site’s content, the odds are pretty good that your visitors will be curious about you (as long as they like your content). “About Us” pages receive a lot of traffic across all industries. Why? Because people want to know more about who is behind a Web site. They don’t have just one reason for wanting to know.

So if your objective to is build a credible Web site for which you are totally responsible, and if your desire is to earn respect and credibility in the online community for whatever your Web site tells people, you need to make sure people can find out who you are. That’s important. And it’s important enough that when people go searching about you that they should find the content you created about yourself first, rather than some tail-biting pundit’s negative opinion of you.

The SEO industry is paying more attention to repuation management these days, and if you take on a reputation management client, the first thing you need to know is how that client represents himself on the Web. Shoot any RepMan client who comes to you and says, “I nofollowed links to my ‘About Us’ page because it was unimportant.”

An “About Us” page certainly has the potential to rank well in name-related queries, but a well-designed site should not be linking to the “About Us” page more than it links to the root URL and other important pages. Ideally, an “About Us” page should appear in sitelinks, which are becoming more common. If you want to get the “About Us” page into the sitelinks, then you have to point some value-passing links to it — more than you’ll point to most other pages on your site.

Architecting Web sites is not about ranking well in competitive money-making queries. It’s about organizing your information in such a way that people can find it (when appropriate) with the least amount of effort.

i.e., you MUST make all your pages crawlable, discoverable, and rankable.

“But what about Peanut Butter SEO?” some people will ask. You only get so much PageRank for your site, so you cannot expect every page to rank well. Right? Wrong.

Remember, you can choose how to set up your site search. Google will now create a custom site index for you if you pay them money. Yahoo! and Microsoft won’t be nearly as fussy about indexing your content as Google if you DON’T pay them money. You have some choices.

Your architecting talents have to be flexible, however. That is, since you have a multi-topic site you probably need to build more than one architectural persona. In fact, what if you have content that is relevant to many different queries? Suppose you run a book site. You want people to find books at least by title and author, right? How do you propose to promote your content by title and author with a single architecture?

Sculpting PageRank won’t solve complex problems. Architecting sites gives you the tools to do just that: solve complex problems.

When you need the same pages to rank for multiple, not-necessarily related queries, disarming them by denying them PageRank and anchor text is suicidal.

Do you sell jewelry? Do you sell gold and silver jewelry? Do you sell necklaces and rings? How many different queries do you feel one site structure can support? How many different queries are you presently trying to support with just one site structure? Have you weakened that site structure by using “rel=’nofollow’” on your internal links?

The point is that if you feel like some page was ranking that should not be ranking, you had two other options to work with that could have helped you: you could have revised or added content on your existing pages and you could have used that high ranking page to help people find what they are looking for.

You know, I really need to say a lot more about architecting Web sites, but I’ve already trimmed this article twice and it’s getting long even by my standards. I’ll have to come back to this topic later.

{ 2 comments… read them below or add one }

Halfdeck 07.26.08 at 3:36 am

The strongest argument against PageRank Sculpting is not that its ineffective or that real PageRank can’t be quantified but that there are a ton of other stuff higher up on the to do list. Do you want to spend 10 hours figuring out how to squeeze 12% annual interest out of $1000 or do you want to spend that time making $100,000?

If your site doesn’t have a lot of link equity no matter how you spread that peanut butter you’ll find you just don’t have enough peanut butter to spread. It’s better just to go out and get more peanut butter.

If you have a ton of peanut butter and they’re going to waste due to canonical issues, then you might think about sculpting PageRank by setting up non-www/www 301 redirect (that consolidates PageRank), META robots, etc. Rel=nofollow isn’t the only way to move around PageRank.

Michael Martinez 07.26.08 at 6:13 am

I think you’re right, Half. If anything, the changes in the way Google handles PageRank over the past couple of years have offered people a more sensible reason to engage in ongoing link building than simply “I read it on an SEO blog or forum”. While I don’t like the way Google divides the Web (relevance should be their first priority), the now-justified need for PageRank should drive people to improve their selection of linking sources.