Web Informant #172, 18 October 1999:
Preserving online archives

http://www.strom.com/awards/172.html

Years from now, researchers looking back at the dawn of the web era may have a small fraction of the web to use for their research. Why? Because many of today's publishers, including the computer trade press, do a poor job of archiving old content. And as more publications fall by the wayside, their online archives disappear from view as well.

Even those publishers who have methods in place for obtaining back issues haven't consistently carried that policy forward into the web era. This seems ironic since we used to have more complete archives in the days before the web, when back issues were easily available on microfiche. There are some notable efforts to preserve web content from outside the publishing industry, including Brewster Kahle's Internet Archive project.

But for the most part all of our current digital technology doesn't much matter. Actually, what is involved in preserving archives isn't really a technical challenge -- it is mostly politics. Someone from On High has decreed that All Old Stuff Must Go.

It bothers me on several levels that this information is slipping from sight, and from sites. We're losing our rich techno-cultural history. I want to see what the pundits, reporters, and experts were saying years ago to learn from their mistakes. (Or maybe to poke fun of them in a subsequent column.) A few years ago I did a column looking back ten years in our industry. The only way I could do that research was to look through the printed paper archives that I gathered from friends and pack rats still working at the publications.

And, storage is cheap these days. The cost to keep the old content can't be much. Granted it can be a nuisance for web site operators to backup the files and remember the old links, especially if they change the file structure of the site and all those pages refer to now-broken links.

Also, as a writer of a few of these words of previous wisdom, I want to see my stuff preserved for all to browse. I mean, that is one of the reasons I got started in this business 13 years ago -- to make a mark, however small, upon the world. And when ZDnet (not to pick on them in particular) eliminates my old Windows Sources articles because they no longer produce the publication, it bugs me. Not to mention that now all MY links to these pages from my web site are broken too.

A few years ago we had ZD's Computer Library, an expensive monthly CD subscription that was the industry bible. It had tons of full-text articles from many (even non-ZD) pubs, although going back only for the past year. Then when CMP, ZD and IDG started their web efforts, it was a joy to be able to search their sites and come up with articles. Well, maybe joy is too strong a word. But it was certainly easier than finding the current CD or digging up an old paper issue. (It still exists, both in CD and on the web at the address above.) But computer publishers have begun to change their archives lately. The dead tree trade publishers are more interested in what is happening today than supporting what they said yesterday, let alone several years ago.

There are two particular items I want to cover here. First is being able to go to a page for your magazine's archive and links to various issues going back several years. Second is to index this content in such a way that most ordinary humans will be able to easily refine a search through these archives and come up with useful results.

The big three publishers fail on both scores. Only a few of the individual magazines have clear links to any archive pages. (Byte.com is a good example here, but their archives only go back to 1994.) All three publishers' home pages offer a "search" box that is all but useless in my opinion. You type in a few words and hit return, and what you get is usually too much, too little, or too unfocussed to really help you find what you are looking for.

Best of a Bad Lot

CMP is the best of a bad lot. They offer full-text archives of some publications going back to 1994, not far enough for me, but it is a start. You can refine your search by publication, date and other parameters from a screen that isn't too many clicks from the home page either. You can even search defunct publications.

ZDNet has taken the CD Computer Library and turned it into the Computer Magazine Archive going back three years with articles from several hundred publications (some are abstracts, not full text). It costs a few dollars per month for access. For free, ZDNet offers limited searches of their current publications. But these searches can be painful. You can't easily limit your search to particular publications unless you first do a simple search and then refine it.

With IDG, you can go to individual publication web sites and then use the search functions you'll find there. (Computerworld's archives, for example, goes back to 1994.) IDG.net has a search function that will scour all its publications' web sites, but to use it you have to learn both the domain name used by the publication and the underlying Infoseek syntax. I would guess about ten people in the world could figure this out, even if they do come across the page explaining it all in rather gruesome detail.

Many Web-only publications don't fare much better when it comes to archiving content. The best example I found is John December's Computer-Mediated Communication magazine. Its archive page is a great example of how to place everything you might need about a publication together in one simple, single place. Too bad he stopped publishing the magazine last January. Still, all the old issues are still available here.

And if we want to pay for content for the general press, NewsLibrary.com offers many years of archives to newspapers such as the San Jose Mercury News, the Boston Globe, the Washington Post and many others. The past week's archive and headline results are both free.

Of course, you might be wondering where I am going here. I have a page of links to all of my back issues, and another search page that covers all content I've published on the Web, not just Web Informant issues. I admit this could be better, and one of these days I'll get around to improving it. But at least I'll leave the old stuff around for you to enjoy (and poke fun of, too).

Self-promotions dep't

My latest article for Computerworld reviews two long-time contenders, Laplink and PCAnywhere. It is entitled Remote Control and File Transfer: Comparing the Two Champs.

I wanted to also let you know about a multi-city tour of one of the companies I advise, Delano Technology. The tour will focus on how to use Delano's eBusiness Interaction Suite of email/web applications products.

Afterword

Long-time reader Robert Stanley has this to say upon reading my essay, reprinted here with his kind permission.

Dear David,

This is precisely why scientific journals still publish on paper and basically will not sanction on-line publishing.

There is a subtler and nastier aspect at work here also, namely revisionism. If you don't have any accessible archives, then what you currently say on a web page must be what you have always said... Oh yes, the scientific community has already fallen prey to this: I have several prestigious papers in more than one version, where only the most recent exists in the world at large, and no reference is made to changes that have been introduced.

I'm pleased to state that, while what I think of as the electronic news media (the electronic arms of erstwhile or extant paper publishers) are failing, the digital media aren't doing so badly. There are a number of electronic newsletters that I subscribe to, yours being one of them. All (including yours) have taken care to address this issue.

For example:

TidBITS: < http://www.tidbits.com > and (archives): < ftp://ftp.tidbits.com/pub/tidbits/issues/ >

Cu Digest: < http://www.soci.niu.edu/~cudigest/ >

Risks: < ftp://ftp.sri.com/risks/ >

The TidBITS crew have gone to considerable lengths over the last decade to ensure that all earlier material is maintained in its original context, but is equally accessible to contemporary review.

I think there are actually two issues relating to the web publication problem. One is those sites dedicated to topicality, which is pretty much the whole news publishing business. The other is the whole question of digital heritage; I know that Simware could not reconstruct any earlier version of its corporate web site with any huge confidence. However, I have the ability to do it for them, right down to the last erroneous (and subsequently corrected) page. That's because I run one of the simple comparator programs, and archive every published page.

But what do you do when the pages aren't static pages, but are generated on the fly from databases and other sophisticated systems? How do you capture the complete set of components that are combined to make up a web application? It certainly isn't sufficient to capture just the code base of the application itself.

Of course, this problem is not new, it has dogged humanity every time a new medium has been introduced. Where are all the copies of early printed works? Where are all the postage stamps and banknotes that were minted by the million? How long did it take for the Congress to not only create the Library of Congress, but enact the legislation that required copies of every publication to be submitted? Why wouldn't the act eventually be extended to cover electronic news? And what makes you think that the NSA doesn't already effectively maintain such an archive? I speak for the US in the above example, but other nations have corresponding organisations and legislation.

Finally, consider the other end of the problem, the need to develop extremely long term markers for important artifacts. I mean markers that can survive and be recognised and understood tens of thousands of years into the future. At worst, we need to achieve this to mark our hazardous waste, which will remain toxic over such vast spans of time. At best, we can use it to preserve the record of the achievements of our several societies. It is a noteworthy irony that all printed material predating "modern" acidic paper manufacturing processes have a far, far greater lifespan than the majority of contemporary printed material. It seems that homely microfiche, to which you alluded, is in fact the only publication medium from the majority of the Twentieth Century to have any likelihood of survival to even so near a goal as the twenty-second.

To subscribe, send a blank email to
webinformant-subscribe@egroups.com

To be removed from this list, send a blank email to
webinformant-unsubscribe@egroups.com

David Strom
david@strom.com
+1 (516) 944-3407
back issues
entire contents copyright 1999 by David Strom, Inc.
Web Informant is ® registered trademark with the U.S. Patent and Trademark Office.
ISSN #1524-6353 registered with U.S. Library of Congress.

Web Informant #172, 18 October 1999: Preserving online archives

Best of a Bad Lot

Self-promotions dep't

Afterword

Web Informant #172, 18 October 1999:
Preserving online archives