How many pages should a search engine index?

by Michael Martinez on March 30, 2010

One of the latest ongoing concerns in the SEO industry appears to dwell on the number of pages that a search engine indexes (most specifically Google, but this discussion really extends to other search engines).

Of the four major search engines, Ask and Bing are the stingiest when it comes to indexing deep content from Websites. Yahoo! is pretty generous with its indexing but pretty much everyone in our industry now fusses only over Google’s deep indexing.

Index depth is a low-calibre metric for the quality and search potential of a Website. That is, if you have a Website with 100,000 pages, knowing that only 2,000 of those pages is indexed doesn’t tell you much.

Some people might be quick to point out that those 98,000 unindexed pages could draw substantial traffic. What if each page — were it indexed — drew an additional 1 visitor per month to your Website? That’s 98,000 search conversions you’re not getting now, right?

Maybe. But those 98,000 search conversions are built on the worst of foundations: IF.

IF-based SEO is about as reliable as betting on a horse at a racetrack based on its handicap. “The odds are 100 to 1. IF this horse comes in, I’ll win a bajillion dollars!”

If (and I use the word carefully in this context) that is the way you want to do your SEO, good luck to you. You’ll need all that and more.

What if your 98,000 unindexed pages are autogenerated place-holder pages that consist of nothing more than boiler-plate templated text with a few injected keywords? If that is the basis of your SEO strategy, you have nothing to offer to those imaginary 98,000 search visitors. Even if they show up, the chances of their converting for you are pretty slim.

That’s the mentality that SpamAd site operators take. They work on low quality volume. So simply knowing how many pages of a site are indexed doesn’t tell you anything useful to search engine optimization. Does that sound familiar? It should, because I have pointed out through the years that simply knowing how many links point to a site doesn’t tell you anything useful, either.

The SEO industry wants to quantify things, and I’m not sure of why. There is no real value in quantification. The quantification — in order to be useful — must be tied to a specific value scale. Who is paying money for getting X number of pages indexed? If there is someone out there with that kind of agenda, then you can certainly optimize your search to get more pages indexed. Job done.

The SEO’s job is determined by the needs of the end-user. Therefore the tools the SEO uses must be flexible and customizable. Link counts and Indexed Page counts are neither flexible nor customizable. They are random, inaccurate, search engine-specific numbers that provide you with neither insight nor advantage in search optimization.

So Barry Schwartz ran a poll asking SEOs whether they use the site: query operator or Google Webmaster Tools to determine how many pages are indexed (in Google, obviously). In recapping the poll results, Barry wrote: “…31% said they still use the Google Site Command. I am a bit upset to see so many people using the site command but I guess it is hard to teach an old dog new tricks?”

It may indeed be hard to teach an old dog new tricks but his comment leaves me wondering WHY the old dog should need to learn such a useless new trick.

I can easily find pages in the Google index that are not reported by Webmaster Tools. Why is that? I have no idea. I don’t care. It means the WT report is not a reliable source of information, so why has the SEO community suddenly fallen in love with a resource that — up until recently — was the SEO class’ kickaround kid? What’s up with that, homeys?

Here’s the thing: Over the past few months many people have complained in various Web forums and at Google’s support groups that their sites have lost page visibility in Google’s index. That is, they are counting “number of pages indexed” and flying into a panic when that number drops from 2000+ to 1200.

I’ve seen this happen with my own sites. For years and years Xenite.Org’s index count has shot up and dropped down in precipitous swings when you do a site: query. I haven’t noticed any correlating drops in search referral traffic. So if we assume for the sake of discussion that the changes in reported page counts have something to do with Google’s indexing, those extra pages aren’t helping much, are they?

On the other hand, on any day I can look at a page index count and drill down deeper to find that the numbers change radically. Starting at the root URL and clicking through to the end of the search results for most sites, the reported number of indexed pages drops radically. But if you then select a sub-domain or sub-directory from the site and drill down to the last search result page for that query, you suddenly find all sorts of pages that didn’t appear in the original query.

Why is that? I have no idea. But it tells me that the site: query operator is a special needs tool. It needs special understanding for proper use, not to mention some patience and common sense.

When I look at the data provided by Webmaster Tools I just want to gag. The dates don’t match cache dates provided by the search index, pages that seem to be missing from the WT reports show up just fine when I use the info: or site: queries, and many backlinks that go missing in the WT reports show up just fine in Google’s index.

I don’t know what the purpose of the Webmaster Tools data is supposed to be, but it’s not helping much with analyzing a site’s Google Index Health. Frankly, I’d rather use the site: query operator. In fact, I DO use the site: query operator — when I want to know if a specific set of pages has been crawled and what their apparent state of indexing is.

You can use either the search box or the Webmaster Tools interface to make some sort of pungent guess at what Google is doing but that is about it. From a search optimization perspective, unless you have been specifically charged with improving an index report count, you’re spinning your wheels looking at page counts anyway.

Site search is better utilized to determine which pages a search engine will return from a site for a given query string (aka keyword). Think about it: if the search engine won’t show that page for the keyword in a site search, doesn’t that tell you where to look when asking why the page doesn’t rank in a normal query?

Site search can help users find specific content in large content sites. It can help SEOs find out which pages have been fully indexed, which pages are being treated as if they are duplicate content, and which pages are NOT appearing in search results.

You can customize your site search by adding and changing terms in the query.

You’re pretty much stuck with the inaccurate, sometimes quite misleading data that Webmaster Tools provides you.

How many times have you wondered why you cannot find your site ranking for queries near the positions that Google reports in Webmaster Tools? I’ve given up trying to make sense of those idiotic reports. And apparently many people who complain about them in forums and support groups are noticing the same inconsistencies as me, so this is not just me grousing on the basis of personal experience.

It’s unfortunate that so many people in the SEO community try to quantify things without any purpose. Just counting the number of things a search engine will report to you tells you nothing. These are data points outside the graph. You need to pick the graph and understand what it is designed to do before you start plugging numbers into it.

Things I’d like to know that you cannot learn from page and link counts include:

  1. How many pages are fully indexed
  2. How many pages are being shown to searchers in clickable zones
  3. How many pages are being handicapped by poor on-site optimization
  4. Which pages are most valued by the search engine
  5. Which pages have the most value to pass to other pages
  6. Which pages are allowed to receive value from off-the-wall pages
  7. Which pages are being recrawled often
  8. Which pages are not being recrawled often

There is currently no tool or method for determining these things with any reliable accuracy. Don’t even start to tell me about your favorite tool. It doesn’t do the job.

But, more importantly (and to the point), you cannot begin to answer these types of questions by counting links and indexed pages. This is the kind of knowledge that empowers a search optimization specialist.

If the SEO community would adopt some real standards, nonsense metrics like backlink counts, indexed page counts, and Google Toolbar values could be openly questioned and challenged in a formal environment. People would have a better opportunity to learn just how useless this kind of fluff data really is, and hopefully learn that they don’t need to waste their time pursuing numbers that (in themselves) have no meaning or relevance to search engine optimization.

{ 8 comments }

Can we talk SEO? In one SEO lexicon?

by Michael Martinez on March 9, 2010

In A Modest Proposal For SEO Standards I suggest that search engine optimization specialists (firms and consultants) include an SEO lexicon on their Websites.

There are many SEO glossaries around the Web already, but they rarely agree on their terminology. This lack of agreement has created a bizarre and humiliatingly divergent conversation between SEOs, their clients, the search engines, the academic community, and the media.

In short, none of us is talking the same language as anyone else. This failure to communicate reveals itself in numerous online publications both within and without the SEO community. For example, a recent academic paper titled The Role of Search Engine Optimization in Search Rankings pretty much butchers the jargon of the SEO world.

Here is the surprising thing about that paper. One of its authors, Ron Berman, worked for a venture capital firm that specialized in tech industries and Web startups. He is no stranger to the Internet, according to his LinkedIn profile.

The paper’s co-author, Zsolt Katona, has a strong technical and science background but no apparent Web marketing experience.

Neither author seems to have any experience with search engine optimization and their paper — which proposes a method built upon Game Theory for evaluating the value of search engine optimization to Publishers, Searchers, and Indexers — reveals their naivete.

Their theory reaches some right conclusions albeit in the wrong way and for the wrong reasons. That is, their axioms are flawed because they are based in an inappropriate mythology.

We all have mythologies — we use mythologies to explain our environment to ourselves and to each other. Every SEO specialist in the world has constructed an SEO mythology that is at best only poorly and partially articulated to others.

It is precisely because of this almost inaudible articulation from the SEO industry that no one else seems able to get it right. The academic community has seriously missed the mark in every attempt to document what search engine optimization is and how or why it is employed (that I have read — and all I can say in defense of my criticism is that I have read many dozens of academic papers that attempt to discuss search engine optimization).

I don’t blame students like Ron Berman for not knowing how to define common expressions like “white hat” and “black hat”. The Berman-Katona paper seems to view “white hat” SEO as dealing with on-page factors and “black hat” SEO as dealing with off-page factors. They also use the words “link” and “links” to refer to listings in search results.

The theoretical concepts they propose seek to measure the economic benefit of search engine optimization. The problem with their work is that it is not relevant to actual search engine optimization. There is far more going on in the paper than a mere misuse of terminology. It creates a symbolic world in which search engine optimization is distinguished from the creation of content, an idea that stands outside reality.

Although you can create unoptimized (even unoptimizable) content, you cannot optimize without content. On-page optimization, off-page optimization — it all has to revolve around promoting some form of content (even if it consists of nothing more than a domain name or non-existent document name) toward the top of search results.

SEO does not exist outside of or in spite of content. SEO is all about the content, just as search is all about the content. Neither search nor search engine optimization have any use or function without or in spite of content.

This lack of comprehension of what is actually happening in search among academics is a serious problem for the SEO industry because their papers, books, and presentations all go into the academic continuum where they will be ingested and digested by future marketers, decision-makers, search engineers, and journalists.

In 5-10 years we will be dealing with a large number of outsiders who think they have an idea of what search engine optimization is all about, when in fact they are only relating to a fantasy application that cannot function on the real Web or in the constantly evolving marketplace.

It’s not enough, really, that everyone publish a Website about their SEO expertise and services which includes an SEO glossary. We MUST acknowledge that other glossaries exist and ideally we should seek to come together on some kind of consensus.

Even here at Visible Technologies (which has almost 100 employees) I hear people casually drop SEO terminology into conversations that makes absolutely no sense. I can’t train everyone and I cannot prevent them from finding loosely jargonized expressions on Twitter, Facebook, blogs, and Web forums.

The problem exceeds epidemic proportions in that it has not only clouded communications within our industry, it has divided them into multiple conversations that essentially talk past each other.

People in the SEO industry assume they are speaking to an informed audience but we are not. We are largely speaking to an UNinformed audience — and that’s just when we speak to each other. We all have our own ideas of what constitute link farms, blog farms, link circles, good content, site structure, Web spam, and more.

This lack of congruence in how we describe our activities, the activities of other people in the field, and everyone else exacerbates itself by an order of magnitude each year because we are continually developing new ideas, testing new expressions. Our conceptualization rolls out new buzz terms faster than any one person can document them.

What’s worse, we have no way of coordinating the discussion. Attempts to document the SEO jargon through social resources like Wikipedia have proven to be only partially effective and in some cases absolutely disastrous.

You can get a quick idea of how bad the problem has become by browsing the AIRWeb (Adversarial Information Retrieval) site. Their various papers look at Web spam from a search engineer’s perspective. The papers (and many others like them that you can find through Google’s Scholar search) often use terms that either don’t enjoy much frequency in the SEO field or which are often used in other ways.

We have ourselves to blame, of course, but the academics seem to do a very poor job of searching out the best quality resources. Their ideas are insular and barely resemble what is actually happening on the Web.

Web spam itself is a curious notion. The name implies that any spammy process entails excessive repetition, but while some Web spam may rely on repetition, other Web spam may rely on deception.

Although I don’t think many people would agree in detail on what constitutes white hat or black hat search optimization, it seems to me that black hat SEO is universally deemed to be unethical and usually entails deception and/or excessive replication (of links and/or content).

White hat SEO is really very difficult to nail down. Some people say it seeks to comply with all search engine guidelines. But what about practices that search engines neither endorse nor oppose? Are these to be relegated to so-called “grey hat” SEO?

If we ourselves cannot draw clear distinctions between the Good, the Bad, and the Sort-of-Good-but-may-be-Bad then how can we expect anyone else (especially people in academia, Web search, or the media) to get it right?

There is no right or wrong when it comes to how you talk about search engine optimization. That’s just wrong, and you know I’m right.

{ 5 comments }

Essentials of Off Site SEO

by Michael Martinez on February 26, 2010

“Essentials of Off Site SEO” is an article I could have written, except I did not write it — I only wrote the original article on which “Essentials of Off Site SEO” was patterned.

It appears that a company in India, offering SEO services, has rewritten one or more SEO Theory articles and distributed them through services like Article Depot, embedding links back to their Web site (DimensioniSEO.com).

I won’t say bad things about that company. After all, if they are reading SEO Theory and following the principles I teach, I should feel flattered, correct?

And, technically, there is no copyright violation if you simply rewrite an article in your own words.

But there are other intellectual property issues which, perhaps, have not yet found a presence in either domestic (U.S.) or international law. For example, who owns the genesis of a concept? History teaches us that Alexander Graham Bell “invented” the telephone — but in a footnote you’ll occasionally stumble across Elisha Gray, who is mentioned as the man who failed to get his patent application for the telephone into the office before Bell.

That is, the idea of telephonic communication did not necessarily originate with Bell. We don’t need to revisit all that history to understand that there have been incidents in the past when people “co-invented” or discovered ideas. George Harrison claimed he had never heard the song “He’s So Fine”, but a court ruled he had violated copyright anyway by incorporating its melody into “My Sweet Lord” (personally, I never felt the two songs sounded that much alike — but I digress).

The presentation of a discovery or theoretical concept as one’s own work is considered unethical at best and fraudulent at worst. Some scientists have lost their standing, sacrificed their careers, by failing to disclose sources for their work. It’s okay to incorporate other people’s ideas into your own research, but to rewrite a serious theoretical concept and not disclose where you got that concept from — that’s a serious breach of ethics.

It gives your field a bad name. It implies that people cannot trust you or your colleagues to be honest. Honesty and integrity may be perceived in different ways based on cultures and value systems but somehow I feel that if a leading scientist in India were to claim to be the author of one of Steven Hawking’s papers, the Indian science community would disown that person (or, in the worst-case scenario, the community would become divided over what might seem like a plausibly alleged claim).

I’m not Steven Hawking but I think it’s safe to say that I am well enough known in the SEO community that few people in our industry would mistake one of the theorems or formulas I’ve put forth for the work of some unknown entity halfway around the world.

Not that everyone has heard of me, but if you search on “SEO Theory” you’ll find this blog listed first in the major search engines (despite numerous attempts to dislodge it). It’s not my machinations that have ensured this blog’s dominance in that query but the recognition it has been awarded from around the world (literally).

So while it might seem clever to lift an article from SEO Theory and rewrite it as Essentials of Off Site SEO without advising your readers where you got your inspiration from, you’re still taking a big risk of being detected.

That particular article (whose right name was “Fundamental Principles of Off Site SEO”) proposes a number of principles such as the Principle of Search Engagement and the Principle of Message Engagement. These are not terms that have caught on within the SEO community. Most SEO technicians couldn’t care less if someone has made a conjecture about how or why things work — they just want to know how to deliver good search referral traffic.

I’m not the only guy to coin a principle in SEO. Mike Grehan, so far as I know, explained the Filthy Linking Rich Principle. There are probably hundreds of principles that have been articulated either formally or informally in various search marketing blogs and newsletters. People do try to lay out their ideas in some sort of structured process.

But those terms were, so far as I know, first used here on SEO Theory. I don’t think it’s asking too much for an SEO firm to give credit where credit is due when adopting the theses put forth by other people. I have tried to do that much here and elsewhere when I have referred to other people’s ideas.

Here’s another fundamental principle of off-site SEO for people to think about: The Principle of Not Getting Caught. Sorry, I couldn’t think of anything better at the time I coined that term.

I doubt many people will search on “essentials of off site SEO”. Maybe after reading this article a few people will check out the query. Maybe not.

Hopefully, somewhere down the line this article will appear in the search results for essentials of off site SEO (although, technically, it is the original article “Fundamental Principles of Off Site SEO” that deserves to be there). In fact, I’ve updated the original article to include the appropriate language.

But I cannot do that every time someone decides to steal my thunder. Nor can you do it every time someone decides to steal your thunder.

I have to admit that I gave serious consideration this week to writing an article proposing some standards for the SEO community. It’s a good thing I decided not to because it would not have occurred to me to include giving credit where credit is due. But I think that would be a good one.

If you want to reuse other people’s ideas, do that. We all learn from each other. But don’t fool yourself into believing you can fool everyone else into thinking you came up with the ideas. I seriously doubt the author of the rewritten article could explain SEO theory as well as I can.

Long-time readers of this blog may (or may not) remember my story about how I wrote a computer program to read paper tape in the 1980s. Because of the tension between client and vendors over that project I graciously invited another programmer to take credit for my work.

The next day he found himself sitting in a room full of people with broken code on his hands. Boy did we both get into trouble over that deal!

You know, I learned my lesson that day. If you’re going to break the rules or cross the lines, you need to take responsibility for what you do. Many people on the Internet have not yet learned the value of that wisdom. More importantly, if you’re going to use an idea that you did not develop yourself, you should give credit where credit is due.

Call that the Principle of Not Getting Caught.

And by the way, it really IS one of the fundamental principles of off site SEO. Think about that.

{ 1 comment }