Tuesday, January 19, 2016

Two cool podcast interviews on bit rot, an unsolved problem

We all need to be aware of the problem of bit rot in our work.

One of my favorite podcasts, OnTheMedia, produced a program called Digital Dark Age (52:14) last year. It consists of several segments on the problems of protecting and archiving the vast amounts of data we are generating. If you don't have time to listen to the entire podcast, at least check out two segments, interviews of Vint Cerf on information preservation problems (6:28) and Nick Goldman on using DNA as a storage medium (9:02).

Let's start with the Goldman interview. He notes that applications like video entertainment and scientific research are generating immense amounts of data, which is already overwhelming today's optical and magnetic storage media. Goldman is experimenting with DNA as a very high capacity, long lived data storage medium. How might that work?

A strand of DNA is made up of strings four bases abbreviated A, T, C and G, as shown here:

The DNA strand is like a twisted ladder where
the "rungs" are either A-T or C-G bonds. (For
more, see this animation).

Biologists have developed equipment for sequencing (reading) the list of bases making up a strand of DNA and for synthesizing arbitrary strands of DNA. That makes it possible to store a copy of a binary file in a strand of synthesized DNA, for example by synthesizing strands of DNA in which binary 0s are represented by an A or C base and 1s are represented by a T or G. A DNA sequencer could then convert those A, T, C and Gs back into 1s and 0s.

Goldman thinks we will see expensive DNA storage devices in three or four years and they will be cheap enough for consumer storage in 10-15 years. DNA stored in cool dry places will last hundreds of thousands of years and "all the digital information in the whole world everything that's connected to the Internet" will fit in "the back of a minivan." If Goldman falters there are other DNA storage projects at Harvard (article, video) and Microsoft.

But even if DNA or some other storage technology gives us dense, cheap storage, there are other problems, as outlined by Vint Cerf.

Cerf talks about "bit rot." The simplest type of bit rot is media deterioration -- becoming unreadable after 20 0r 30 years. That can be overcome by creating a new copy periodically, but that is not be enough. Let me give you a personal example.

In 2004 my wife had a small hole in the atrial wall of her heart repaired. In a 30 minute, outpatient procedure a skilled surgeon installed a small device in which tiny umbrellas were clamped over the hole, held there by a spring.

A ballon determines hole diameter (left), the device in place (right)

That was pretty amazing, so I asked for a video of the procedure, which was provided on an optical CD. The CD also included a program for viewing the video, the Ecompass CD Viewer. The CD media might be somewhat deteriorated by now, but I have transferred the program and data to magnetic storage, so I can still view it on my laptop.

But, my laptop is running Windows 7. The version of Ecompass CD Viewer I have was written in the Windows XP days. It still works, but will it be compatible with Windows 10 or later? If not, I could upgrade to a current version, but, as far as Google Search knows, Ecompass CD Viewer is no longer with us -- perhaps the company went bankrupt or dropped the program.

Ecompass CD Viewer works by stringing together clips stored in a video file format called SSM. If I could find an SSM viewer, maybe I could cobble together the entire video since I have the SSM clips. That might work for these static files, but we need the functionality of the original program for something that a user can interact with, like a spreadsheet. Looking further in the future Windows will disappear along with Intel-processor machines.

Cerf does not see a complete solution to the bit rot problem, but he pointed to Project OLIVE at Carnegie Mellon University as a significant step in the right direction. OLIVE emulates old computers running old software on virtual machines. Below, you see an image from a demonstration of an emulation of a 1991, OS7 Macintosh running Hypercard. The display, mouse and keyboard of the Dell PC are interacting with the simulated Macintosh, which is running on an Internet server.

Still image from a Project OLIVE demo video.

The OLIVE demonstration is impressive, but it is a research prototype that assumes standard input/output devices and capturing the vast number of programs and hardware configurations that exist today would require a massive effort. Similarly, DNA storage is at the early proof-of-concept stage. Both feel like longshots to me, so, for now, we all need to be aware of the problem of bit rot in our work.

Update 3/9/2018

Checkout this video putting DNA storage in context.