As our civilization moves to almost all digital forms of storage and archiving of various aspects of life, it is important that this history is preserved for the future generations in a way that would accurately reflect how this civilization really is. This article details some of the difficulties that would face future archeologists, and how to get over them now, for the sake of the future.
What Is Digital Archeology?
The term "Digital Archeology" can be used to mean several different things. I will attempt to define them below:
- Social Digital Archeology
This is the use of digital media and digital information to get a clearer picture of the society that uses them. This could be emails, newsgroup and forum postings, databases, digital pictures and videos, ...etc.
- Computer Digital Archeology
This is the use of older material to research the history of early computer architecture, computer peripherals, operating systems, programming languages, systems administration tools, and the like.
- Assisting Digital Archeology
The use of digital imaging, and digital simulations and graphics to assist in the archeological research of ancient civilizations. This definition is explicitly excluded from the scope of this article.
Only two centuries ago, our knowledge of ancient civilizations and cultures was limited to written history, relying on historians consciously writing annals, or chronicles for the purpose of being read by future generations.
However, by the 1800s C.E., additional sources added significantly to our understanding of ancient civilizations. The most important is archeological evidence, which has often changed our perception of people and events so much as to question the conventional historical record on the same people or events.
Fragments of the daily life of average people like you and me slowly emerged and a the mosaic picture started to be clearer and clearer as more time allowed more discoveries to be made as well as more research and study into existing material. So whether it was the Dream Stela of Thutmosis IV, a baked clay tablet with the epic of Gilgamesh on it, an Assyrian or Babylonian cuneiform tablet, the code of law of Hammurabi, the Rosetta Stone, or simply a Greek ostraca with a name on it, a list of goods, accounts, customs or taxes, or pottery shards for writing practice in Hieratic, they all give us a glimpse into the workings of these civilizations.
Moving to Digital Storage and Archiving
Now, in the early 21st century, and the third millenium, think about how people several centuries or millenia from now will get enough information about us to form a comprehensive picture on what was going on in our era?
In today's terms, think about your personal accounting records, and how much you paid for a chicken or a dozen eggs, or a car. This information is now kept on your personal computer, for example using Intuit's Quicken, Microsoft's Money or GnuCash. Also, you your emails that you meticulously kept for 5 or 10 years would provide a glimpse into your life. Think about blogs, personal web sites, news web sites, Google Groups, ...etc. All this information shows us civilization as it is today, warts and all.
Most of our data and information today is stored digitally in computers, whether on hard disks, CD-ROMs, DVDs, or magnetic tape.
Even on the personal level, outside the realm of computers, personal and family photos are stored on CDs, family movies on Digital Tape, and soon all will be on DVDs, with the advent of DVD Camcorders.
Pitfalls of Fragile Media
All these media suffer from serious flaws when we consider civilizations spanning centuries:
These media are based on magnetic (hard disks, magnetic tapes, floppy disks) or optical technologies (CDs, DVDs).
These media are extremely short lived: The stated longevity is a few decades at best.
These media are volatile: The method of recording is either magnetism on a magnetic surface, or optical laser on a plastic back.
These media become obsolete quickly. Even if they do not physically fail, the technology to read them will be obsolete in a decade or two. For example, how many computers today can read 8 inch floppy disk drives that were used in the early 1980s? How many today can read 5.25 inch floppy drives, which were in use up to the early or mid 1990s?
Here are some real digital data loss horror stories.
That is the stuff on your computer. What about the stuff you put on the net in one form or another? For example that blog you setup? Or that web site?
Once you die, the PC eventually becomes obsolete or unusable. Chances are, your spouse of kids are not interested in what is the computer, and it is gone. Your web hosting account will probably be terminated due to non-payment.
Before archeology, our only sources of data on past civilizations was from historians. These were often porfessional people writing for posterity, and had some bias or other. After archeology came into play in the 19th century, our knowledge of past civilization had a quantum leap, after we found artifacts that we could tie together and decipher many puzzles.
How Will We Be Perceived
What about the bigger picture? Not individuals, or families, but societies and civilizations.
All this meta data about humanity in the last few decades of the 20th century, and the 21st century is on perishable and fragile media. It is even volatile (web hosting account?)
How would people several centuries from now view this entire civilization? How would they guage the reaction to say Sept 11, or invasion of Iraq? Would they see the US population as pro or anti war, or divided evenly? How would Bin Laden and Bush be assessed? Blair? Aznar? How would they get a glimpse into people's daily life.
Remember that as things are happening, it is easy to think that the information you gather on the event/person/concept are always clear and available. However, if you give it a decade or two, you yourself will not remember much details. How about people from a different culture/mindset/civilization/society? What would they think and how would they perceive you from the little they manage to recover?
People a thousand years ago had many of the same hopes and fears we have, but I worry about how future generations will know us. In the absence of any evidence of our culture will they consider us uncultured (as we often think of cultures without a strong written history) or will they judge as simply living in a time with brittle technology?
Context and Data
A digital medium is not only fragile, but they are not self explanatory. For example, a TAR GZIP archive or a PDF doesn't give any clues in the file itself about how it should be used, nor does the CD or tape that this file comes on tell us that information. Compare that with ostraca shards, earthen ware pots or glass containers, which although not very durable, enough of them survive, even in its broken state to know what they were used for, and glean other context information from.
Real life case: Dennis Ritchie on UNIX
Case in point is that of Dennis Ritchie, co-creator of UNIX, writes in one of this notes.
This shows how information is made inaccessible over time, even for those who created it.
Writings from the Past
Machine-readable versions of early Unix material are hard to come by,
even for us. "Backup" in those days (1969 through the early 70s) consisted
of punched cards, paper tapes, or uploading to a Honeywell machine. We no
longer have those cards, tapes, or the Honeywell.
When we got a PDP-11 around 1971, we did get DECtape, and did save some
material, though not enough. A few years ago Keith Bostic and Paul Vixie
resurrected a PDP-11 DECtape drive and offered to read any old tapes we
might have around, and I sent several to them. These notes are among the
small treasures discovered there.
Two files named "notes1" and "notes2" were on the tape labelled "DMR", and
their date, if I correctly interpret the time representation on the tape,
is 15 March, 1972.
I reproduce them below. I have no memory of why I wrote them, but they look
very much like something to have in front of me for a talk somewhere,
because of the references to slides. From the wording at the end ("the
public, i.e. other Labs users"), I gather that it intended to be internal
to Bell Labs. HTML markup and the corrections and annotations in  were added
in September 1997, but otherwise it's original.
Another Real Life Case: NCR VRX Operating System
I had a set of 5.25" Floppies that I copied some stuff from a now obsolete NCR made Operating System called VRX. These were copied in early 1989. In 2004, I "discovered" that I still held on to these, and was curious to know what is in them. I went searching for ways to read those floppies. They were fomratted in MS-DOS and had a FAT file system on them.
The challenge was to find a drive to read them, since 5.25" floppy drives have been obsolete for a while now. A friend found a drive in his basement, but it was covered with dust, and I was hesitant to use it, lest it would ruin the floppies.
I eventually found someone who has an old 386SX PC with both 5.25" and 3.5" floppy drives with MS-DOS on it. I used the 16-bit version of Info Zip to create ZIP archives of each floppy.
Later, in July 2004, I found a garage sale with a 486DX PC that had a 5.25" drive in it. I got it for 5$Cdn, and after a lot of experimentation (mainly cable orientation, ..etc.) managed to install that drive in a Pentium II Celeron 300 MHz PC with Linux on it. I was able to read most of the floppies.
Amazingly, of some 20 or so floppies, few had bad sectors. However, I found that some files were corrupted during file transfer using a terminal emulator connection to the mainframe. Regardless, there were plenty of good information was left for analysis, and the results of this "digital excavation" will soon be published in the NCR History section of this web site.
Thanks to early efforts in the 1960s, a standard for information interchange has been developed, which we know now as ASCII, American Standard Code for Information Interchange. The intent then was not longevity, but rather the ability of different machine architectures to communicate with each other and exchange data among them, without the internal representation on each machine being an issue.
A similar effort was done in the early days of the internet, by using the Network Order, and not the internal representation on different CPU architectures. So, it would not matter if the source machine is Big Endian or Little Endian.
Much of our storage now is either on Magnetic or Optical media. Tapes, hard disks, and the like are magnetic. CD-Rs and DVD ROMs are Optical. Both media are fragile and volatile.
Even if the media is not damaged due to fragility or age, the challenge can be finding a drive that can read it. The example of Dennis Ritchie above is a case in point. He held on to the DECTape media, but there were no drives to read them until Paul Vixie was able to resurrect a unit into functionality.
A similar story is mentioned above (the 5.25" floppies containing some VRX related programs).
The vast majority of PCs today have a 3.5" floppy. However, this will not be the case soon. Here is an article on how the Floppy Disk Becoming Relic of the Past, and the Slashdot discussion that ensued.
Larger mediums are not faring any better. CDs are used for music and data. They are known to be unreliable, with a shelf life of only several years. This has been known for sometime, but most people are unaware of it. For example, A Dutch PC magazine conducted a test, which was discussed on Slashdot. Also, NIST has conducted a study on the topic, which was also discussed on Slashdot.
Data Structure and Format
Reading the raw bits is not the end of the journey here.
Suppose a future digital archeologist finds a file that is just a Comma Separated Variable (CSV) format, when this format has falled out of normal use.
Assuming that a few centuries in the future 8-bit ASCII is still recognized as a valid character set, then the challenge would be:
- Recognize that there are certain patterns in the data.
- Recognize that the Newline (or Carriage Return + Line Feed) are a record delimiter
- Recognize that the comma is a delimiter for fields
- Recognize that the quotation marks surround every field
- Recognize that there are escape characters within a field used to escape quotation marks and commas
Data Semantics Or Metadata
The above is the easy part. What is really hard is knowing what this data represents, and what is the meaning of it!
For example, suppose our hypothetical future digial archeologist goes through all the above steps, and ends up with the following:
What sense can he make of this data, without any metadata to go with it? For example, a README file describing it, or even some code (hopefully commented!) in some programming language?
Perhaps he could deduce what the date is, and that it is in M/DD/YYYY format quite easily, and perhaps the time field after it too. Maybe he can relate that the first field is really a stock ticker symbol. The rest of the fields will be very difficult (they are the stock price, change, 52-week high. If we get this in a tabular format, then the columns tell us what each field means.
|Stock Symbol||Price||Date||Time||Change||52-Week High||Previous Close||Volume|
If the data was a log of Apache web server, a similar degree of confusion would apply to our digital archeologist.
Case Study: BBC 1986 Domesday Project
In 1086 C.E., William the Conqueror ordered a general census to be taken of his newly conquered Britain. This was known as the Domesday Book. To celebrate the 900 year anniversary of this event, the BBC embarked on a project called the 1986 Domesday Project. It used video clips, pictures, and the like to chronicle how Britian was like in 1986.
However, they used a modified BBC Micro computer and a custom Laser Disc attached to it. After only two decades, there was no hardware left to read the wealth of information they have collected!
Eventually, the problem was solved, but a lot of searching, time, money and effort was put into this. See the articles below for the details:
- Domesday Redux: The rescue of the BBC Domesday Project videodiscs
- BBC Article dated Dec 2, 2002 on How the digital Domesday project was saved.
- An article in the Register about saving the Digital Domesday.
- Britain's National Archives on the 1986 Domesday project
- A good web page on how the 1986 Domesday project was rescued
- BBC World article on the Domesday Project
Alternatives: The Rosetta Project
In order to preserve a 1,000 languages the world languages for posterity, the The Rosetta Project has relied on relatively simple technology of an analog disk that is readable by a microscope. It does not require electricity to access it. Nor is it on fragile magnetic media either.
Obviously, this approach avoid most, if not all, of the pitfalls that we mentioned above.
- Wikipedia Termina Event Management Policy (TEMP)
- Google saves digital history about Google Groups being a repository for Usenet.
- A Sydney Morning Herald article titled The Digital Dark Age where the perils of digital media is explored. This was discussed on Slashdot.
- Jason Scott runs the Text Files web site, which archives lots of early BBS files, hacker/cracker adventures, and the the entire subculture of the Bulletin Board Systems in the 1980s. He gave a talk at the HOPE Conference, titled, "Preservng Digital History - A Quick and Dirty Guide".
- New York Times: Whose Data Is It, Anyway? By Jeffrey Selingo - June 3, 2004
- The Dead Media Project is a witness to data communication and storage media across the ages, that are no longer in widespread use. Think about Telegraph and Telex which was in widespread use just a few decades ago.
- Another New York Times article, A Preservation Jam for the Digital Age. This was discussed on Slashdot.
- Tips to help executors clear the clutter
- Dan Bricklin, inventor of Visicalc, the first commercial spreadsheet has an article on his web site about Software that lasts 200 years. He touches up on the issue of the longevity of the software systems used in most record keeping by governments at all levels and advocates the solution in something he calls "Societal Infrastructre Software". It was discussed on Slashdot.
- An article about someone building a cheap home RAID system. While this is technically interesting, the amount of confusion where RAID is thought of as a backup mechanism is quite disappointing. I have written up why cheap rewritable removable media is still the best way to do backups.
- New Scientist article on how old machines can be rebuilt. Addresses many aspects on how data can be made to be continually readable on current hardware. It was discussed on Slashdot.
- Another How To article on Slashdot about how to backup photos and video in 100s of GB.
- Slashdot article on Archiving Digital History.
- Slashdot discussions on May 4, 2004, June 3, 2004, and July 3rd, 2013.
- My list of resources on Digital Preservation.