Web Informant #144, 28 February 1999:
Learning about web mindshare


One of the things that I like about the web is the ability to use it as one big programmer's paradise. Here is a report from Jon Udell, who has been in the technology trade press as long as I have. Take it away, Jon.

I'm continually surprised by the unexpected and powerful new views of the web that you can create when you use web sites as a series of networked software components. XML-enabled sites will make this even easier in the very near future, but even without XML you can do a lot with simple programming scripts to manipulate sites. For example, I was wondering recently about the continuing impact of www.byte.com, the site I built for my former residence, Byte magazine.

Most of us know that you can search for keywords using AltaVista or one of its competitors. But a feature not as widely known is AltaVista's ability to count the number of pages in its index that link to a specified site. I call this number "web mindshare" because these are the inbound links that drive traffic to your site from elsewhere around the net.

To use this feature, you type a command like this in the AltaVista search box:

link:strom.com -url:strom.com

This asks the search engine to look for links that contain strom.com other than those that are found on the strom.com domain itself. That's well and good, but what does a single number convey? Lacking context, not much. Having 665 pages pointing to strom.com sounds like a lot, but is it? How popular is that site, really, in the scheme of things?

Numbers in context

What's missing is context. To provide it, I turned to Yahoo's directory. Among its many categories are a cluster related to computer-magazine sites. As I started plugging site addresses from these lists into AltaVista, I realized that the procedure was cumbersome, though. Pick a site in Yahoo, capture its URL, feed it into an AltaVista query, write down the result. This gets old in a hurry. It's time to do some programming.

In effect, every web site is a scriptable component, and the web as a whole is a vast library of such components. You can invoke these individually from any scripting language that can issue HTTP requests and interpret the responses. What's more, you can join components to achieve novel effects. That's what I did to create my computer-magazine-site mindshare report.

I started by writing a Perl script to unroll Yahoo's /Computers_and_Internet/News_and_Media/Magazines category to create a long list of URLs. When I unrolled Yahoo's computer-magazine category, the raw list -- about 585 items -- was itself an interesting result. (There's a downside to Yahoo's compartmentalization. You don't get to see long lists of items related under a super-category. But that's a digression for another time.)

Next I extended my script to feed the URLs to AltaVista, capture its reference counts for each, and rank the results by reference count.

Exploring the results

What did I learn? I'll admit I was surprised (and pleased) to see that BYTE, which ceased publication last May and has been stagnant online since then, remains 12th on that list of 585 sites. That LinuxWorld ranks third (53290) behind CNet (95600) and ZD Net (83200) is certainly an eye-opener. On balance, the picture that emerged seems a credible representation of web mindshare for computer-magazine-related sites. Caveat: this is only the set of sites that were in the selected Yahoo category subtree, and it is only AltaVista's view of their mindshare. It's not a complete or perfect view -- nothing on the web is -- but it seems more than good enough to be useful.

Does this technique generalize? Yes and no. If you start with an overly broad category, you'll cut an overly wide swath through the directory. For example, Computers_and_Internet/News_and_Media was too broad. I gathered a lot more computer-related sites that way, but also veered into foreign territory -- for example, chemical and biological journals.

I had better luck when I focused on /Science/Nanotechnology. Here's an area that I know little about. Yahoo told me which sites it thinks belong under that category. AltaVista told me the mindshare of each of those sites. Working together, Yahoo and AltaVista gave me a quick read on the "important" sites in that category. Here were the top 10:

  1. www.di.com 1631
  2. www.foresight.org 937
  3. nano.xerox.com/nano 628
  4. www.zeiss.de 385
  5. www.lucifer.com/~sean/Nano.html 223
  6. nanozine.com 159
  7. nanocomputer.org 134
  8. www.molec.com 130
  9. www.physikinstrumente.com 114
  10. www.polytecpi.com 106

Is this the "right" top-ten list, based on an unrolling of Yahoo's /Science/Nanotechnology? It looks reasonable to my untutored eye, but only nanotech buffs can say for sure.

Of course mindshare isn't everything. Having all the computer-magazine-related links in one place led me to explore a lot of interesting but lesser-known sites that I'd never known about, because I'd never fully plumbed that region of Yahoo.

Scripting the web like this is going to get a lot easier, once XML becomes more pervasive. Why? Today Web sites have implicit APIs (application programming interfaces); in the near future, they'll have explicit APIs defined in terms of XML. But the fact is, you don't need to wait until this sea-change is complete. Humble Perl scripts can already tap into the web's library of components, and do amazing and useful things with them.


N.B. Udell continues to update his "mindshare" report. Here are the most recent results.

David Strom
+1 (516) 944-3407
back issues
entire contents copyright 1999 by David Strom, Inc.
Web Informant is registered trademark with the U.S. Patent and Trademark Office