Hypertext Matching Is Alive And Well

by Michael Martinez on March 20, 2009

Google founders Larry Page and Sergey Brin married Citation Analysis to Textual Analysis in their foundational paper Anatomy Of A Large-Scale Hypertextual Search Engine — the so-called Backrub paper which introduced Google.

While most people through the years have focused on the PageRank algorithm the paper introduced it did talk at considerable length about the IR factors used to determine relevance in documents. In 2007 Danny Sullivan wrote:

PageRank gets its name from Google cofounder Larry Page. You can read the original ranking system to calculate PageRank here, if you want. Check out the original paper about how Google worked here, while you’re at it. But for dissecting how Google works today, these documents from 1998 and 2000 won’t help you much. Still, they’ve been pored over, analyzed and unfortunately sometimes spouted as the gospel of how Google operates now.

At the time Danny rightly pointed out that a lot of things have changed since 1998-2000; however Google still talks like the original Backrub project remains the core of its technology.

I don’t mean just PageRank — I mean the combination of Citation Analysis and Textual Analysis, or what Google (and other IR resources) calls “Hypertext-Matching Analysis”. As in the Backrub paper Google is talking about “also [analyzing] page content. However, instead of simply scanning for page-based text …, [Google's] technology analyzes the full content of a page and factors in fonts, subdivisions and the precise location of each word. We also analyze the content of neighboring web pages to ensure the results returned are the most relevant to a user’s query.”

Notice the last sentence, which discusses the analysis of content on neighbor pages. Of course, we can now point to many Google queries where trust is now a significant factor.

People are sure to guess that there is a connection between Google’s PageRank algorithm and its Trust filters. For example, we know from Google’s admissions that PageRank is the primary factor used to determine whether pages go into the Supplemental Results Index or the Main Web Index. One might reasonably ask how much trust should be conferred upon a page that only receives links from Supplemental Index pages.

Hypertextual Matching Analysis, however, has now evolved into looking not only at how each document uses words but also at how its neighbor pages use words.

There is really nothing new in all of this, except that we can say we’re living in the age of PageRank 2.0 (where not all links are permitted to pass PageRank), Trust 2.0 (where trusted sites are actually favored in at least some search results), and Hypertext Matching 2.0 (where a correlation between content on linking and destination pages is desired).

Let’s assume for the sake of discussion that we can obtain all the PageRank-passing, Trusted links we desire. Is that sufficient to manipulate the rankings in search results? Or do we now need to devote significant attention to the content in which links are embedded?

{ 6 comments… read them below or add one }

Wes Young 03.20.09 at 5:35 pm

I think it only makes sense to devote attention to the content where our links are embedded. But isn’t this just as important from the point of view of providing a quality user experience as it is for manipulating search results?

Michael Martinez 03.21.09 at 9:07 am

In a world without search things would be much simpler.

An SEO must balance the needs of the Web site to achieve search visibility and strength against the needs of the visitors.

We really do make content for search engines. We succeed best when that content is equally useful for visitors.

The two concepts should work hand-in-hand. Information retrieval metrics and algorithms have been under development and review for decades, much longer than we’ve had the World Wide Web. They were designed to help find documents based on how those documents are structured and how people use them.

The basic idea of hypertextual analysis has never gone out of fashion with the search engines — just with the search optimization community.

olmei 03.23.09 at 7:23 pm

Michael,

Relative to the above mentions of contemporary pagerank and trust factors, how much influence would you give to a listing in dmoz and other directories?

The past several months I’ve seen boatloads of PR drops within sites I’ve been debating to take on as clients. Usually one digit drops but in some cases 2-3 a pop. How would you relate this – if in any way at all – to current duplicate content craze?

Thanks

Michael Martinez 03.24.09 at 12:06 am

I don’t see any connection between drops in Toolbar PR and duplicate content.

My own personal sites have lost Toolbar PR but that hasn’t translated into lost search rankings or Google traffic referrals.

I don’t understand why you’re looking at Toolbar PR. Yes, I talk about PageRank occasionally, but I’m usually referring only to internal PageRank that we don’t see.

olmei 03.24.09 at 12:29 pm

Thanks for the reply

I guess I wasn’t thinking about tool bar PR as much as the concept you implied in this response; cumulative factors which Google considers when imparting rank.

Michael Martinez 03.24.09 at 12:33 pm

Sorry. I’m almost ready to run to the doctor for a second round of antibiotics. I’m just a little more testy than usual, I guess.