How I Code

Data Preservation

Computers cannot preserve data.

They are unparalleled tools for manipulating data, and that two-edged sword is the heart of their downfall as archival tools. Computation is the mutation of data, the antithesis of archiving data.

Computers can interact with data on secondary storage systems, which could theoretically store data immutably. However, the most commonly used storage media are mutable, so programs can alter archived data, accidentally or intentionally.

If the data are mutated maliciously, there is no way for a future viewer to know that.

Most digital data can be modified without leaving any trace of interference, making digital archives untrustworthy. Bytes are bytes. You cannot tell what bytes preceded them on the medium, even with an electron microscope.

On retrieving a new dataset, you could generate a cryptographic signature, which you could use to check whether it has changed the next time you retrieve it. However, such a signature is useless to a first-time viewer. A signature and public key stored with the data are easily replaced by an adversary, while a signature and public key stored elsewhere are irrelevant to the viewer.

Digital secondary storage is not vulnerable solely to computational mutation. The most common storage media can fail in a handful of years, or just a few months.

Even theoretically-immutable media can degrade and fail quickly. Although there are variations meant to last a millennium, they are only a few years old. No digital storage system has existed for more than a few decades, and we do not know how long any of them will last.

When digital media fail, it tends to be catastrophic. Even a small malfunction or scratch can render the stored data unreadable, requiring expensive experts to repair the damage.

Of course, the passage of a few centuries can render digital data unreadable even if the storage medium is in perfect condition. The archive's discoverer may not have a computer. If they do, they are still unlikely to be able to read the stored data. Bytes are still bytes, and meaning is in the eye of the beholder. A digital archive is just a set of ones and zeroes that could be interpreted an infinite number of ways.

What custom hardware is needed to read the medium? Is there a working instance of the hardware? If not, are there blueprints for building one? Are there still extant drivers for this ancient device, and do they work on current operating systems? What filesystem was used on the storage medium? How are the data in a given file structured? Even if a file's contents are just "text", how is that text encoded? EBCDIC? ASCII? UTF-8? UTF-32?

In stark contrast, ink and paper have been used to preserve data reliably for centuries, while papyrus and scrolls have lasted for millennia.

Updating a book's contents with new information is very difficult, and for all but subtle forgeries, any layperson can see the changes - a brand-new page in a volume otherwise yellowed with age, or whited-out text with re-printed ink.

The written word's failure modes are forgiving and comprehensible, and recovery of intentionally-destroyed data is often feasible. A page can be dropped, written over, torn, or even immersed in liquid, and still be readable.

Little technology is needed to read writing - just eyesight and knowledge of the language. Even when the language is long-dead and a mystery, it may be rediscovered.

The pen is mightier than the program, and the digital age will vanish like dust on the wind.

Articles

Dates And Times
—
A brief summary of time.
Human Names
—
Dealing with human names is hard.