Introduction to Digital Archeology

As our civilization moves to almost all digital forms of storage and archiving of various aspects of life, it is important that this history is preserved for the future generations in a way that would accurately reflect how this civilization really is. This article details some of the difficulties that would face future archeologists, and how to get over them now, for the sake of the future.


What Is Digital Archeology?

The term "Digital Archeology" can be used to mean several different things. I will attempt to define them below:

  • Social Digital Archeology

    This is the use of digital media and digital information to get a clearer picture of the society that uses them. This could be emails, newsgroup and forum postings, databases, digital pictures and videos, ...etc.

  • Computer Digital Archeology

    This is the use of older material to research the history of early computer architecture, computer peripherals, operating systems, programming languages, systems administration tools, and the like.

  • Assisting Digital Archeology

    The use of digital imaging, and digital simulations and graphics to assist in the archeological research of ancient civilizations. This definition is explicitly excluded from the scope of this article.

Introduction

Only two centuries ago, our knowledge of ancient civilizations and cultures was limited to written history, relying on historians consciously writing annals, or chronicles for the purpose of being read by future generations.

However, by the 1800s C.E., additional sources added significantly to our understanding of ancient civilizations. The most important is archeological evidence, which has often changed our perception of people and events so much as to question the conventional historical record on the same people or events.

Fragments of the daily life of average people like you and me slowly emerged and a the mosaic picture started to be clearer and clearer as more time allowed more discoveries to be made as well as more research and study into existing material. So whether it was the Dream Stela of Thutmosis IV, a baked clay tablet with the epic of Gilgamesh on it, an Assyrian or Babylonian cuneiform tablet, the code of law of Hammurabi, the Rosetta Stone, or simply a Greek ostraca with a name on it, a list of goods, accounts, customs or taxes, or pottery shards for writing practice in Hieratic, they all give us a glimpse into the workings of these civilizations.

Moving to Digital Storage and Archiving

Now, in the early 21st century, and the third millenium, think about how people several centuries or millenia from now will get enough information about us to form a comprehensive picture on what was going on in our era?

In today's terms, think about your personal accounting records, and how much you paid for a chicken or a dozen eggs, or a car. This information is now kept on your personal computer, for example using Intuit's Quicken, Microsoft's Money or GnuCash. Also, you your emails that you meticulously kept for 5 or 10 years would provide a glimpse into your life. Think about blogs, personal web sites, news web sites, Google Groups, ...etc. All this information shows us civilization as it is today, warts and all.

Most of our data and information today is stored digitally in computers, whether on hard disks, CD-ROMs, DVDs, or magnetic tape.

Even on the personal level, outside the realm of computers, personal and family photos are stored on CDs, family movies on Digital Tape, and soon all will be on DVDs, with the advent of DVD Camcorders.

Pitfalls of Fragile Media

All these media suffer from serious flaws when we consider civilizations spanning centuries:

  • Technology
    These media are based on magnetic (hard disks, magnetic tapes, floppy disks) or optical technologies (CDs, DVDs).
  • Longevity
    These media are extremely short lived: The stated longevity is a few decades at best.
  • Durability
    These media are volatile: The method of recording is either magnetism on a magnetic surface, or optical laser on a plastic back.
  • Compatibility
    These media become obsolete quickly. Even if they do not physically fail, the technology to read them will be obsolete in a decade or two. For example, how many computers today can read 8 inch floppy disk drives that were used in the early 1980s? How many today can read 5.25 inch floppy drives, which were in use up to the early or mid 1990s?

Here are some real digital data loss horror stories.

That is the stuff on your computer. What about the stuff you put on the net in one form or another? For example that blog you setup? Or that web site?

Once you die, the PC eventually becomes obsolete or unusable. Chances are, your spouse of kids are not interested in what is the computer, and it is gone. Your web hosting account will probably be terminated due to non-payment.

Before archeology, our only sources of data on past civilizations was from historians. These were often porfessional people writing for posterity, and had some bias or other. After archeology came into play in the 19th century, our knowledge of past civilization had a quantum leap, after we found artifacts that we could tie together and decipher many puzzles.

How Will We Be Perceived

What about the bigger picture? Not individuals, or families, but societies and civilizations.

All this meta data about humanity in the last few decades of the 20th century, and the 21st century is on perishable and fragile media. It is even volatile (web hosting account?)

How would people several centuries from now view this entire civilization? How would they guage the reaction to say Sept 11, or invasion of Iraq? Would they see the US population as pro or anti war, or divided evenly? How would Bin Laden and Bush be assessed? Blair? Aznar? How would they get a glimpse into people's daily life.

Remember that as things are happening, it is easy to think that the information you gather on the event/person/concept are always clear and available. However, if you give it a decade or two, you yourself will not remember much details. How about people from a different culture/mindset/civilization/society? What would they think and how would they perceive you from the little they manage to recover?

The only hope here is the wayback machine at Archive Wayback Machine and other specialised sites similar to it, such as Steve Baldwin's Ghost Sites. But will it endure? Is it enough?

People a thousand years ago had many of the same hopes and fears we have, but I worry about how future generations will know us. In the absence of any evidence of our culture will they consider us uncultured (as we often think of cultures without a strong written history) or will they judge as simply living in a time with brittle technology?

Context and Data

A digital medium is not only fragile, but they are not self explanatory. For example, a TAR GZIP archive or a PDF doesn't give any clues in the file itself about how it should be used, nor does the CD or tape that this file comes on tell us that information. Compare that with ostraca shards, earthen ware pots or glass containers, which although not very durable, enough of them survive, even in its broken state to know what they were used for, and glean other context information from.

Real life case: Dennis Ritchie on UNIX

Case in point is that of Dennis Ritchie, co-creator of UNIX, writes in one of this notes.

This shows how information is made inaccessible over time, even for those who created it.

Writings from the Past

Machine-readable versions of early Unix material are hard to come by,
even for us. "Backup" in those days (1969 through the early 70s) consisted
of punched cards, paper tapes, or uploading to a Honeywell machine. We no
longer have those cards, tapes, or the Honeywell.

When we got a PDP-11 around 1971, we did get DECtape, and did save some
material, though not enough. A few years ago Keith Bostic and Paul Vixie
resurrected a PDP-11 DECtape drive and offered to read any old tapes we
might have around, and I sent several to them. These notes are among the
small treasures discovered there.

Two files named "notes1" and "notes2" were on the tape labelled "DMR", and
their date, if I correctly interpret the time representation on the tape,
is 15 March, 1972.

I reproduce them below. I have no memory of why I wrote them, but they look
very much like something to have in front of me for a talk somewhere,
because of the references to slides. From the wording at the end ("the
public, i.e. other Labs users"), I gather that it intended to be internal
to Bell Labs. HTML markup and the corrections and annotations in [] were added
in September 1997, but otherwise it's original.

Dennis

Another Real Life Case: NCR VRX Operating System

I had a set of 5.25" Floppies that I copied some stuff from a now obsolete NCR made Operating System called VRX. These were copied in early 1989. In 2004, I "discovered" that I still held on to these, and was curious to know what is in them. I went searching for ways to read those floppies. They were fomratted in MS-DOS and had a FAT file system on them.

The challenge was to find a drive to read them, since 5.25" floppy drives have been obsolete for a while now. A friend found a drive in his basement, but it was covered with dust, and I was hesitant to use it, lest it would ruin the floppies.

I eventually found someone who has an old 386SX PC with both 5.25" and 3.5" floppy drives with MS-DOS on it. I used the 16-bit version of Info Zip to create ZIP archives of each floppy.

Later, in July 2004, I found a garage sale with a 486DX PC that had a 5.25" drive in it. I got it for 5$Cdn, and after a lot of experimentation (mainly cable orientation, ..etc.) managed to install that drive in a Pentium II Celeron 300 MHz PC with Linux on it. I was able to read most of the floppies.

Amazingly, of some 20 or so floppies, few had bad sectors. However, I found that some files were corrupted during file transfer using a terminal emulator connection to the mainframe. Regardless, there were plenty of good information was left for analysis, and the results of this "digital excavation" will soon be published in the NCR History section of this web site.

Data Representation

Thanks to early efforts in the 1960s, a standard for information interchange has been developed, which we know now as ASCII, American Standard Code for Information Interchange. The intent then was not longevity, but rather the ability of different machine architectures to communicate with each other and exchange data among them, without the internal representation on each machine being an issue.

A similar effort was done in the early days of the internet, by using the Network Order, and not the internal representation on different CPU architectures. So, it would not matter if the source machine is Big Endian or Little Endian.

Media Longevity

Much of our storage now is either on Magnetic or Optical media. Tapes, hard disks, and the like are magnetic. CD-Rs and DVD ROMs are Optical. Both media are fragile and volatile.

Media Readability

Even if the media is not damaged due to fragility or age, the challenge can be finding a drive that can read it. The example of Dennis Ritchie above is a case in point. He held on to the DECTape media, but there were no drives to read them until Paul Vixie was able to resurrect a unit into functionality.

A similar story is mentioned above (the 5.25" floppies containing some VRX related programs).

The vast majority of PCs today have a 3.5" floppy. However, this will not be the case soon. Here is an article on how the Floppy Disk Becoming Relic of the Past, and the Slashdot discussion that ensued.

Larger mediums are not faring any better. CDs are used for music and data. They are known to be unreliable, with a shelf life of only several years. This has been known for sometime, but most people are unaware of it. For example, A Dutch PC magazine conducted a test, which was discussed on Slashdot. Also, NIST has conducted a study on the topic, which was also discussed on Slashdot.

Data Structure and Format

Reading the raw bits is not the end of the journey here.

Suppose a future digital archeologist finds a file that is just a Comma Separated Variable (CSV) format, when this format has falled out of normal use.

Assuming that a few centuries in the future 8-bit ASCII is still recognized as a valid character set, then the challenge would be:

  • Recognize that there are certain patterns in the data.
  • Recognize that the Newline (or Carriage Return + Line Feed) are a record delimiter
  • Recognize that the comma is a delimiter for fields
  • Recognize that the quotation marks surround every field
  • Recognize that there are escape characters within a field used to escape quotation marks and commas

Data Semantics Or Metadata

The above is the easy part. What is really hard is knowing what this data represents, and what is the meaning of it!

For example, suppose our hypothetical future digial archeologist goes through all the above steps, and ends up with the following:

"NCR",49.82,"7/16/2004","4:04pm",-1.19,51.14,49.75,537900
"IBM",84.28,"7/16/2004","4:01pm",+0.26,86.05,84.28,10441900
"RHAT",15.15,"7/16/2004","4:00pm",-0.38,15.62,15.00,5455437
"YHOO",29.19,"7/16/2004","4:00pm",-1.06,30.61,29.15,18658800
"KO",50.58,"7/16/2004","4:00pm",-0.26,51.25,50.31,3724800

What sense can he make of this data, without any metadata to go with it? For example, a README file describing it, or even some code (hopefully commented!) in some programming language?

Perhaps he could deduce what the date is, and that it is in M/DD/YYYY format quite easily, and perhaps the time field after it too. Maybe he can relate that the first field is really a stock ticker symbol. The rest of the fields will be very difficult (they are the stock price, change, 52-week high. If we get this in a tabular format, then the columns tell us what each field means.

Stock Symbol Price Date Time Change 52-Week High Previous Close Volume
NCR 49.82 7/16/2004 4:04pm -1.19 51.14 49.75 537900
IBM 84.28 7/16/2004 4:01pm +0.26 86.05 84.28 10441900
RHAT 15.15 7/16/2004 4:00pm -0.38 15.62 15.00 5455437
YHOO 29.19 7/16/2004 4:00pm -1.06 30.61 29.15 18658800
KO 50.58 7/16/2004 4:00pm -0.26 51.25 50.31 3724800

If the data was a log of Apache web server, a similar degree of confusion would apply to our digital archeologist.

Case Study: BBC 1986 Domesday Project

In 1086 C.E., William the Conqueror ordered a general census to be taken of his newly conquered Britain. This was known as the Domesday Book. To celebrate the 900 year anniversary of this event, the BBC embarked on a project called the 1986 Domesday Project. It used video clips, pictures, and the like to chronicle how Britian was like in 1986.

However, they used a modified BBC Micro computer and a custom Laser Disc attached to it. After only two decades, there was no hardware left to read the wealth of information they have collected!

Eventually, the problem was solved, but a lot of searching, time, money and effort was put into this. See the articles below for the details:

Alternatives: The Rosetta Project

In order to preserve a 1,000 languages the world languages for posterity, the The Rosetta Project has relied on relatively simple technology of an analog disk that is readable by a microscope. It does not require electricity to access it. Nor is it on fragile magnetic media either.

Obviously, this approach avoid most, if not all, of the pitfalls that we mentioned above.

Resources

Contents: