Optimizing Web sites through Ordered Pair-based link analysis

by Michael Martinez on May 19, 2008

In Mathematics an ordered pair is any two objects grouped together for a specific reason. Algebra students usually make the mistake of assuming all ordered pairs are numbers (specifically coordinate sets in Cartesian planes) so one of my math professors took great delight in denoting ordered pairs like this:

(doggie, kitty)

He usually got the point across (haha — I made a funny!) relatively quickly that we usually create ordered pairs from numbers but that we don’t have to do so. Although ordered pairs can be used in many disciplines, we traditionally refer to the first item in the pair as the first coordinate or X coordinate and we refer to the second item as the second coordinate or Y coordinate.

In fact, when you move into Set Theory you learn about algebras, which are ordered pairs. An algebra consists of a set of objects and a set of rules governing how those objects behave within the algebra. The algebra model was adapted by Computer Science to define objects — what I like to call “bounded data spaces”.

In Computer Science an object is defined by its data type and the functions and procedures (called methods) that can be performed upon that data type. CS students many spend a lot of time defining objects that do esoteric things no one really cares about but the exercises teach us to see that everything in the universe is an object — it has properties that define what it consists of and how it behaves.

And all objects are ordered pairs. But we can deal with abstract objects — things that don’t really exist except in our imaginations — in order to analyze discrete objects (things that physically or otherwise exist).

That is, we can define ordered pairs that bring together concepts and specific properties we associate with those concepts. For example, take a Web page. Does it really exist if it can only be seen on a computer screen? Does it exist in more than one place simultaneously since it has to be copied down to your computer in order for your browser to render it? A Web page is more of an abstraction than a discrete thing because there may be many identical copies of a Web page at any moment and they may all cease to exist (except for one — the original copy).

We could call a Web page and its multitude of occasionally existing copies a quantum foam but that’s really neither here nor there. Let’s assume for the sake of discussion that a Web page is a discrete thing (an abstraction that allows us to treat an abstraction as if it were discrete).

We can look at the links from other Web pages that point toward our example Web page. Since we have defined our Web page to be a discrete thing it follows that the other pages linking to our page are equally discrete, and we don’t need to walk any further down that path. But the neat thing about Web pages and their inbound links is that they can form ordered pairs (page, links pointing to page). These ordered pairs can be grouped together in sets.

For example, we can look at all the ordered pairs consisting of the pages from a Web site — such as SEO-theory.com — and the links pointing to those various pages. These ordered pairs consist of a discrete item (the destination page) and a set of discrete items (the links on other pages pointing toward the destination page). If a destination page has no links pointing to it from other pages its Y coordinate consists of an empty set.

If you’re the kind of person who hates dealing with abstractions (much less abstractions of abstractions that pretend to be discrete things) you may, at this point, substitute some numbers for the coordinates in your ordered pairs. But you can’t use just any numbers — you have to use meaningful numbers and this is where the fun begins because you create the meaning for the first coordinates yourself. I cannot do that for you.

I would, however, suggest or recommend that you consider using a document ID system that assigns a unique integer value to each document on your Web site. Your root URL could be document 1, and you could assign document IDs arbitrarily or according to some random sequence or some ordered progression. It does not matter as long as all your document IDs are unique.

You probably see where this is going. The second coordinate for each ordered pair could easily be all the inbound links pointing to that particular document. You can qualify those links if you wish. You can look at just external links (links from other sites), or only links reported by Google (a random selection), or only links reported by Yahoo! (don’t make the mistake of assuming Yahoo! has any idea of what Google knows about), or only links you know you placed, or only links you know appear in all major search engines, etc.

What you should have when you’ve finished doing all your research is a set of ordered pairs that represent all the documents on your site and the number of inbound links (according to whatever criteria you’ve determined) that point to those documents. Something like:

(root URL ID, number of links pointing to root URL)
(HTML sitemap page ID, number of links pointing to HTML sitemap page ID)

If you have a relatively small site this is a fairly short exercise. If you have 100,000 pages or some large number you’ll see quickly that this system won’t scale. Don’t panic. I’ll show you how to deal with large Web sites below.

In the meantime, you can now play with numbers. That is, you can plot your ordered pairs (if you wish) on a 2-dimensional table or chart (a Cartesian plane) to see if any trends develop. You may be surprised at how broken your lines are, how much the dots move around, how spread out they can be.

Or, if you wish, you can reduce your set of ordered pairs to consist only of those pairs which have non-zero (non-empty) values in the Y coordinates. Comparing the number of ordered pairs that are robust to the number of ordered pairs that are thin (they have 0 Y coordinates).

The ratio of robust to thin ordered pairs would be an indication of value for your Web site. That is, if you create a lot of content and people don’t link to that deeper content, they don’t place much value in your site. If you have n documents on your site, then the closer the ratio gets to 0:n (robust pages : thin pages), the less value your site has. The closer the ratio gets to n:0, the more value your site has.

You can use this analysis for your site or for someone else’s site.

So, that’s pretty easy to check if you have 10 pages but not so easy if you have 100,000 pages. What can you do with a large site?

You have to work with random samplings. The larger the sampling the better, of course, but you may want to practice with small sample sets at first because you have to learn how to understand a large Web site’s structure.

For example, a large site tends to NOT have a flat architecture (the pages are usually organized into groups by sub-domain or sub-directory). The flatter a Web site’s structure is, the less value it tends to have overall (you should be curious about why I say that, but this model shows you how to test my assertion). To accrue value well, a large Web site needs sink points and that brings us back to the concepts of crawl saturation and entry points.

If you’re designing large sites with flat structures you’re doing it wrong because you’re making it more difficult for people to find and link to entry points. Imposing a flat architecture on a Web site is like building a wall around it and allowing people to come in only through the door. A lot of people will either ignore a flat architecture and create their own entry points or they’ll just ignore the site altogether (because flat navigation makes a Web site unusable very quickly as it accrues pages).

You need to design large Web sites with tiers, entry points, and vertices in mind. A Vertex is a point where two lines or line segments meet together to form and angle. We usually describe them as “peaks” and “valleys” when looking at curves on 2-dimensional charts. Assuming your document ID numbers are contiguous positive integers, you should be able to draw a flat line segment on our 2-dimensional chart (with a slope of 0 — meaning the line segment runs from left to right parallel to the X-axis).

If you adjust each point on that line segment up or down to show how many links are pointing to that document, you’ll produce a lot of vertices. In a natural large Web site there will be a decreasing number of upward-pointing vertices that rise above the other upward-pointing vertices. That is, fewer and fewer pages have any specific number of links pointing toward as you raise the minimum number of links.

In other words, you’ll have more vertices at the 10-links-per-page level than at the 100-links-per-page level.

You can look at random samplings of ordered pairs for a large Web site to determine approximately how evenly distributed the inbound links are. The distribution patterns may change at different levels — in fact, they should change at different levels. But the larger the sample you work with, the more accurate a graph you’ll be able to draw.

You want to start your analysis by finding out several things:

  1. How many non-zero vertices does the large site have? (How many pages with a lot of inbound links)
  2. How many zero vertices does the large site have? (How many pages have no links pointing to them)
  3. How many mid-level vertices does the site have?
  4. How flat is the link profile?

Large Web sites create a lot of opportunity for attracting inbound links. If you’re operating a forum and you find that relatively few threads get inbound links, you can analyze the discussions to determine why people prefer linking only to a small number of those threads. Maybe you can encourage your community to engage in more such discussions.

If you’re writing a blog and you find that most of your posts have few or new links, you can analyze the link-rich posts to determine what inspires people to link to your posts. You can strive to write more posts like those.

If you’re publishing news, managing an ecommerce site, etc. you should be able to see where the inbound links tend to gather and ask yourself why people value that content over other content.

In other words, you can use link analysis as a bio-feedback mechanism to help you improve the quality (and popularity) of your content. Understanding what people want to link to helps you manage your Web site’s visibility and crawl.

If you’re going to invest time in analyzing your links, stop wasting that time by drooling over PageRank and anchor text. Start leveraging the data you collect into creating patterns on charts that help you see your site as other people (including search engines) see it. This sort of analysis takes both time and practice but it’s very worthwhile.

{ 0 comments… add one now }