Optimization for the expanding Web
Posted by Michael Martinez on April 7, 2008 in SEO Theory
The Netcraft March 2008 Web Server Survey shows that Netcraft has identified more than 160 million domains. Netcraft first reported 100 million domains in November 2006.
Netcraft reported 50 million domains in May 2004 so it took 30 months for an additional 50 million domains to hit their radar but subsequently only about half as long (15 months) for Netcraft to document another 50 million domains.
This is not the first time we’ve seen an explosive growth trend in domain names but there has also been a sharp increase in active hosts — sites that appear to be adding content. The World Wide Web is growing so fast that we should top 200 million domains by the end of 2008 (barring another dot-com meltdown, of which no sign is yet in sight).
I was reminded of the Internet’s phenomenal growth by an interesting (and somewhat depressing) article in the March 2008 issue of Scientific American called The End of Cosmology. Briefly, current scientific knowledge and theory hold that the universe is expanding so rapidly that — if nothing changes over the next 100 billion years — eventually most of what we can perceive in the universe today will slip beyond an event horizon surrounding our galaxy, leaving our star system and a few others (that will merge with ours) alone in a very dark sphere of nothing.
The universe is literally breaking apart, or so scientists currently believe.
The Internet is undergoing a similar expansion. The World Wide Web has become so large now that no search engine can possibly hope to index it. Not that search technology was ever capable of fully indexing the Web, but today’s Ask, Google, Live, and Yahoo! all have the technologies and resources that would have been very useful 5 years ago. They’ve caught up to where the Web was maybe 5-8 years ago.
When true search indexes emerged in the late 1990s they all had two things in common: they built up their inventories by “crawling” the Web and they all lacked the capacity to fully index that portion of the Web they knew about. Inktomi’s dual index sort of solved the problem because they could pretend they had indexed four to five times as many sites as they could actually search. The user experience nonetheless suffered primarily because users did not have access to most of the sites Inktomi knew were there.
As competition between search engines heated up they began making increasingly larger slices of the Web available to searchers. By contrast, however, today’s visible Web has actually collapsed from what was visible only a few years ago despite the fact that millions of new Web documents are created every month. The search engines no longer report estimates for how many pages they index but all the old external tests now show fewer raw results than before (except on Yahoo!). That is, queries for common English words like “the”, “and”, “what”, “that”, and “so” now return fewer raw results than they used to.
Search engineers have pointed out that those estimated raw hit counts we see are not very reliable. In fact, since no search engine will show us more than 1,000 listings in response to any query the size of the visible Web may seem almost meaningless. However, the search engines still have to contend with the fact that their resources cannot embrace the entire World Wide Web. They know internally what the growth rate of their capacities are but they don’t share that data so the competition for Capacity King is virtually dead.
Google’s inelegant solution to the problem has been to resort to Web Apartheid. Like Inktomi did in FirstGen Search, Google divides the Web into those fortunate few pages that shall be fully indexed and the remainder of the Web, which is either not indexed or is only partially indexed. In Google’s defense it should be noted that a huge number of spam sites have been dropped out of Google’s index, but when all is said and done Web spam is a legitimate part of the World Wide Web because neither Google nor any other search engine has the authority to determine what is and is not part of the Web.
Anti-spam filtering apart, today’s search technology limps along at a snail’s pace compared to the rapid growth of the Web because more search resources are being invested in redundancy than are being invested in growth. Redundancy ensures that we have search tools and indexes around the clock and that people across the globe will be able to find approximately similar results for similar queries through their redundant resources.
For the search optimization community, then, the primary challenge today is the same as it was ten years ago: we have to ensure that our protected content is included in as many search indexes as possible. The more search indexes a given page appears in, the more likely the page is to be found by a searcher. We do that by passing trust and value tests and by seeding the Web with countless links to help crawlers find our content.
The more inbound links a page has, the more likely the page will be crawled and indexed. Simply having a page crawled and indexed is the SEO’s first priority. Getting the page to rank is the SEO’s second priority. But the very fact that SEOs must engineer inclusion implies that their efforts impede the search engines’ efforts to include as much information as possible. And redundancy forces SEOs to look at repeating the inclusion process over and over again, thus escalating the inefficiency of the dependencies between SEOs and search engines.
This is so for two reasons: first, every redundant link that a search engine crawls is a lost opportunity for content the search engine has yet to index; secondly, the more links we create, the more unreliable the search engines’ link-based algorithms become. It is as if the growing volume of “optimized” links is a hill the search engines cannot cross over. No matter how much more effort the search engines put into crossing the hill, the hill simply becomes larger.
The Discovery network has been running a series of documentaries called Download: The Real History of the Internet. As so often happens with media productions they get some of their information wrong. For example, in the episode on search they discussed how Google was created and though the historical events are probably document accurately they overlooked a gross error of judgement that has yet to be acknowledged by Google.
Larry Page and Sergey Brin tested their PageRank hypothesis on the Stanford University Web site, where pages were not embedded with links designed to assist with crawling or commercial promotional links. Just because they were able to show that Stanford’s probable most important pages were better linked than other pages did not mean their model was relevant to the real World Wide Web. Google’s founders and investors failed to reconcile the wrong assumptions with reality.
By the time Page and Brin were stumbling into their multibillion dollar fantasy the real World Wide Web had already become permeated with millions (perhaps billions) of links whose only purposes were to either send traffic to sites being promoted or to help sites be included in the search indexes. The valuative citation model that Page and Brin believed existed in fact never existed at all, except possibly on academic and government Web sites.
The quality of Google’s search results was based more on the various relevance scoring factors that Page and Brin included in the technology from the start than on PageRank. Nonetheless, Googlers have touted Google’s PageRank technology as if it was a legitimate breakthrough in search indexing and many people have come to accept the propaganda as if it were more fact than fiction.
The World Wide Web cannot be accurately valued through its linking structure because the linking structure was never designed to be a valuation process. Links document their destinations but the collective link documentation is not representative of any valuative process. PageRank is thus a valid metric only in groups of documents where links are created primarily with valuative goals in mind. In most document clusters links are provided primarily for crawling and content is expected to be the primary factor in visibility — the “if you write it someone will read it” mentality that has driven the creation of millions of link-poor Web sites.
The imbalance between the creation of links for crawling and the creation of content for visibility settled in before Google existed and Google failed to provide any incentive for valuative linking. Altavista had actually developed a better search index at the end of its productive cycle with the Raging Bull search engine. Unfortunately, Altavista had no capital to develop the technology and it was sold and resold to owners who failed to appreciate and exploit the superior technology.
Ragiing Bull was no perfect solution, though — at least, many SEOs guesses that Raging Bull used some sort of link evaluation in its algorithm. When the Teoma search engine (which used a different form of link evaluation) appeared many people were impressed with the quality of its search results but we noted that Teoma seemed not to be aggressive. After Ask integrated Teoma’s technology into its service it became clear that Ask/Teoma would never attempt to document the entire Web or even a majority of it.
The practical decision to exclude large portions of the untrusted/unconfirmed Web works well enough for today’s search engines because they are profitable (well, at least 3 of the 4 major search engines are profitable). As long as you’re making money you have little incentive to improve your technology.
Google has expanded its technology and redesigned its search engine at least twice since joining the search industry, but instead of improving its technology Google has steadily sought to integrate PageRank into everything it does. PageRank is now apparently more important to Google’s search indexing than ever before and the continually declining quality of Google’s search results seems to hinge on their passionate devotion to an irrelevant concept.
The World Wide Web does not value content through links, so it’s virtually impossible to determine which content is the most authoritative, reliable, or important through links. I point this out to underscore the fact that the search engines are facing the wrong direction. Instead of working to index more of the Web and make more Web documents available to their users the search engines have instead become fixated on monetization.
The search engines no longer care about the core concept of search itself. If they did care they would be offering better solutions than link-based algorithmic apartheid.
We need a revolution in search technology development for two reasons: first, a new set of ideas would help the search engines break out of the cyclical rut they’ve returned to so they could index more of the Web; secondly, it would give the search engines a much-needed advantage over search spammers.
Simply making the spam and trust filters more complex only serves to antiquate the entire search process. Spammers won that war years ago. They don’t need to keep their Web pages alive longer than it takes for them to make a small profit. Volume is their ally, time is their friend. They are miles ahead of the spam filters (which is why Google now routinely asks people to report spam).
The spit and packing tape that keep Google running may impress the non-discerning media but as a search optimizer I need a better system to work with. It’s not enough that I can create more links to pages that are not being indexed; I need to know that the search engines’ indexes DON’T require my special efforts to ensure that truly good content outranks link-heavy sites with poor quality content.
I linked to the Download site without “rel=’nofollow’” even though I know it’s not a good quality site. Why? Because the World Wide Web is not built on the artificial convention that nonendorsing links should should be nofollowed. I just don’t believe in qualifying links that way. It’s inefficient, stupid, and it assumes that I will not abuse the system.
SEOs cannot drive search technology, they can only react to it. But the search optimization community has grown stale because the search engineering community has rested on its laurels and allowed the Web to outgrow useful index sizes. Searchers will eventually turn to new search tools that help them find what they are looking for and that can lead to a revolution in search technology.
Perhaps we’ll find a new generation of search engines that have embraced the core concept of finding content on the Web again and they will supplant today’s giants. Or perhaps we’ll simply augment today’s search tools with new tools that explore and index regions of the Web that today’s search technology ignore. There is a real danger of the Web splitting apart in a similar fashion to the universe’s eventual disintegration. If the Web is too large for any one search engine to document, we need many search engines to document all the parts of the Web.
Video and audio search actually represent two experimental approaches that went beyond what current search technology was capable of. When video search appeared it opened up huge segments of content that were previously invisible in major search indexes. The major search engines incorporated those technologies into their capabilities, thus reuniting the visible portions of the Web for users.
But there are other areas of the Web that are not being as well addressed as the Video Web. Blogs and forums are prime examples of traditional Web content that remains poorly indexed by all the majors. It’s virtually impossible to capture a decent footprint of all the discussion communities that embrace a particular topic. There are too many of them to be included in Main Web Search (a limitation found in all four major search services, not just Google).
Beyond the Forum and Blog Web are other areas that remain dark and poorly document not simply for lack of capacity but also for lack of coherent presentation. That is, we don’t do a very good job of explaining to people (much less to search engines) what our data is doing on the Web and why people should be interested in it.
When it comes to building out the crawlable Web all we have to work with are links. The content already exists but the pathways become thin and poorly marked as you move away from the core media sites that everyone is so fond of linking to.
Search engine optimization would benefit from its own revolution, in which we find new ways to promote content through alternative search. As optimizers we can help shape the next generation of search tools simply by using them, sharing them with visitors to our sites, and helping other people learn about them.
That is, after all, how Yahoo! and Google became so popular.
2 Comments on Optimization for the expanding Web
By wyliet on April 8, 2008 at 1:55 am
Great post, thanks Michael.
The web is yet another victim of money. The internet’s initial focus was on the dissemination of information, but that no longer is its key focus. The web has become a glorified free ads leaflet, 1% useful information, 99% adverts. As long as it remains easy to make money on the web, and as long as the search engines focus their efforts on making even more money then the internet will fail to produce relevant and vibrant content. Its a shame but I think because of this the internet itself is ultimatley doomed. Afterall I’ve stopped reading ad heavy magazines, so why not stop using the interent? Maybe the real development waiting in the wings is not in search but in a new information platform: Internet2 anyone?
T
By cleyva on April 8, 2008 at 5:21 pm
Michael,
This is high quality writing with a significant amount of clarity. There is way too much “snake oil” out there regarding SEO, but that is the nature of the beast. Nonetheless, it is refreshing when I stumble upon an author with an expansive style and one that is willing to put a stake in the ground!
Comment
Log in or Register to post a comment.