Vint Cerf speaks on Digital Preservation at NASA GSFC


-Our speaker today
is Vinton Cerf, who is vice president of Google, and their Chief
Internet Evangelist. He is also a visiting scientist
at JPL — the Jet Propulsion Laboratory. He is co-designer
of the TCP/IP protocols and the architecture
of the Internet, which qualifies him as one
of the fathers of the Internet. He has served
on so many positions in so many institutions that we really don’t have time to list them all if we expect
to listen to him today. But they include Stanford
University, DARPA, the Internet Corporation
for Assigned Names and Numbers —
ICANN — StopBadware — That’s an institution
I wish, certainly, all success to — and the Visiting Committee
on Advanced Technology for NIST, the National Institute
of Standards and Technology — he was chair of that committee. President Obama appointed him
to the National Science Board in 2012. His honors would likewise take
entirely too long to try to list all of them. They include
the U.S. Presidential Medal of Freedom, the U.S. National Medal
of Technology, the Queen Elizabeth Prize
for Engineering, the ACM Turing Award, Officer of
the Légion d’Honneur — that means “Legion
of Honor” in English — and 21 honorary degrees. He is a fellow
of the IEEE, the ACM, the American Association for
the Advancement of Science — the AAAS — the American Academy
of Arts and Sciences, and the Computer History Museum, and he is a member
of the National Academy of Engineering. With all that behind him,
it’s not surprising that in December 1994,Peoplemagazine
identified Cerf as one of that year’s 25 most
intriguing people. His personal interests include fine wine,
gourmet cooking, and science fiction. I can personally
vouch for that last, because when he was a speaker
at Balticon, the Baltimore
science-fiction convention, that my colleague
Karen North spotted him and invited him
to come speak here today, which let’s get on to that
with no further ado. Will you join me, please,
in welcoming Vinton Cerf? -Thank you. [ Applause ] Thank you very much. I always get nervous
when people clap before you’ve said anything. It feels
like I should just sit down, ’cause it won’t get
any better than that. Let me just verify by stepping
away from this microphone that the lavalier mike
is still working. You can hear me okay? The usual question
is “Can you hear me back there in the rear?” No, no — The answer
is “We’re not built that way.” But it’s okay. So,
I’m going to talk to you about a problem which has been disturbing me
for quite some time, and some other people
for even longer than that. Some of whom are here
in this auditorium. And it has to do
with the preservation of digital information
over long periods of time. I don’t pretend to be expert
at this, so why am I up
here lecturing is mostly just to raise the level
of awareness and sensitivity, not only
for our own personal interest like our photographs and important correspondence
and the like, but also for what we do
here at Goddard, and that is collect
very valuable, very expensive, and hard-to-get
scientific information which we want to preserve over
long periods of time, because the value may, in fact, increase as opposed
to decrease over time. And yet there are many,
many challenges in front of us. So I’m going
to try to touch on both the institutional
challenge of archiving
significant quantities of technical data accumulated from
our exploratory experiments, but also information that you
and I might hope is preserved
for our descendants over time. So, let me just
start by observing that static content has been
archived in a variety of ways, some of them quite robust
over periods of time, like the cuneiform tablets
which were not originally designed to be long-lasting — Many of them were just
transactional information about trade, for example — but they acquired
their longevity because fire sometimes burned
down the buildings and baked the clay tablets into very hard,
very long-lasting material. Papyrus,
on the other hand, was not designed to be
a long-lasting material, but it ended up in places that allowed it to last
for a long time like the caves in Qumran, for instance, where things
were very, very dry, and the papyrus
did not disintegrate, although it did dry out. But there is also
high-quality rag content — books that were published before 1800 often still
look pretty good today, because there was a high amount of cotton in the paper, as opposed to after the 1800s, when people figured out
how to make really cheap paper using a sulfur process which eventually caused
the paper to pick up, you know, moisture
from the atmosphere and turn it into sulfuric acid, which is why newspapers
turn yellow over a rather short
period of time. And then there’s vellum, which is basically
sheepskin or goatskin or some other animal skin, which is a highly
resilient material. There are vellum manuscripts which are well
beyond a thousand years old. You know, I’ve held
some of them in my hands, and if you can still
read, you know, ancient Greek or Latin
or something — and some of them
are beautifully illustrated. Of course,
I don’t come to you suggesting that we should use vellum as the storage medium
for our data, because a lot of dead sheep
would be required for that. You know,
you can imagine handing somebody a manuscript saying “Many sheep died
to preserve this information.” But nonetheless, it’s a lesson that is important to know that these older media for this kind
of material, some of which have lasted
for quite a long time. Now let’s look
at our digital media, and you’ll recognize
many of these things, and you’ll probably
also recognize that we don’t exactly have a lot
of reading equipment left to read 8-inch
Wang disc drives or 5 1/4-inch floppies or 3 1/2-inch floppies or VHS. I keep one VHS, you know,
player in the house so that I can still
watch old VHS tapes or convert them from VHS to DVD. You’re not supposed
to be able to do that, but, you know, there are ways
of achieving this. Unfortunately, even DVDs
are becoming less available. Those of you who have Macintosh
equipment will have noticed that the CD-ROM reader
has disappeared from the current brand
of MacBook. And there are external hard
drives and things like that, but the connectors turn out
to be a problem. So, we have the possibility
of storing digital information in media
that no longer can be read, even if the bits
are still there. And this leaves out,
of course, the other problem, which is even if
we pull the bits off of these media,
do we know what the bits mean? And this poses yet
another major problem for digital preservation. So, one thing
I wanted to emphasize — Oh, I see what’s happened. I’ve been flipping the charts
and you haven’t seen them, because I’m doing it
on my laptop and not over here, so those are the things
I was just talking about. Those are the things that
I was just talking about now. I apologize for that. Normally, my laptop
is connected to the projector, and it’s not, so I have
to be ambidextrous for today. So, I wanted to emphasize that this problem
is known by many people. I’m not the only person
to have noticed that we have a problem. And I wanted to draw attention to a book
which I was given, actually, about two weeks ago called
“Advanced Digital Preservation” by David Giaretta. This is actually
a very hefty 500-page tome, and it goes into great detail about the open archival
information system architecture. It digs very deep into concepts that would allow you
to build archives that have a reasonable hope
of lasting for long periods of time
or adapting over long periods of time
to new architectures, new digital media, new kinds of instruments, and new programming languages, and the like. One nice thing to know is that the Consultative
Committee on Space Data Systems has representation here
at Goddard in the form of the Digital Archive
Ingestion Group, some of whom I met just prior
to coming to the auditorium. The attention that’s being paid there,
I think, is very important, because there is
a great deal of value in the scientific data collected by robotic
and manned space exploration, and it’s almost without doubt that that data will be useful
10 years, 20 years, 100 years from now, possibly confirming
new theories, or possibly being reanalyzed
to discover things that we didn’t know
we should be looking for in that data in the past. The National
Science Foundation has also recognized
the importance of preservation
of digital information with a Research Data
Alliance effort, which is quite distributed. It’s led by Fran Berman
and John Wood, who co-chair the committee, and they are responsible
for allocating resources to support a wide range of efforts to preserve
digital information. I’m going to be in Iceland in the next week
to meet with the International Internet
Preservation Consortium. There were about 400 people
in the room the last time
I met with them, which made me feel better, that there were
hundreds of people who actually cared
about this stuff. And I invited Mike Kearney, who is part of — was, anyway — part of the CCSDS activity. He retired officially
and is now back in the fray with a contractor hat on. But Mike has been very vocal
about the importance of this kind of preservation. And I don’t mean to go through
every single one of these, but I want to draw
your attention to the fact that this is not an effort
which has gone unaddressed. It’s just that we still have
a great deal of work to do to achieve some kind
of long-term success. I do want to draw attention
to Brewster Kahle. For those of you
who track these sorts of things, Brewster
is sort of an uber-geek. He’s the guy that wired
the first Connection Machine that Danny Hillis
designed at MIT some years ago. But he recognized this problem
of archiving the World Wide Web, and so now he has crawlers that run around essentially ingesting web pages
the same way we do at Google, although our purpose
is to index. His purpose is to actually
capture the web pages and store them away. And so he’s been doing that
for about 15, maybe more, years in a facility
in San Francisco. He has a backup facility at the Library
of Alexandria in Egypt, not Virginia, and another one I think in Asia, but I forget where. It may have been Keio
University, but I’m not certain of that. So he’s been trying to somehow
absorb the World Wide Web, which, of course,
is an impossible task, considering its scale,
but he has a lot of it, and he has this Wayback Machine, so you can actually go
and look at web pages at a particular domain name as they looked 10 years ago or 15 or 20 years ago, so it’s actually quite helpful
when people have disputes over what happened when or what were people
capable of doing. He’s capable
of taking you back in time to see what the content
of the web pages were. And, oh, there’s one other
here at the bottom. NEON is the National Ecological
Observation Network is a major effort
at the NSF to build observation
stations, towers, all the way across the country for both water
and atmospheric sensing, and to gather
all of that data concurrently and then to try
to make a model out of what it’s telling us about ecological conditions
in the U.S. and, ultimately,
they hope, elsewhere. So, this is not
just about preserving things by moving bits from one medium
to another, although that, too, will be required over time, just as we all don’t know
that we have problems. There I go again —
I forget to do this. You can just holler at me if I appear to be
speaking to a slide that isn’t up there,
and it will help my memory. So, we have to worry about, you know, what is the shape
of a digital object, you know? How is it structured? How is it represented? What vocabulary and standard terminology
should we use to describe this stuff in a way that other people
will understand it or that a program
could understand it, which is sometimes harder than getting people
to understand things? We need to have common
identifier spaces, so we can make reference
to digital objects and digital content. We have to be able
to refer to the registries where they are
or the repositories where they are. We have to be able
to resolve references to them. So, let me give you an example. Everybody uses
the World Wide Web. When you type a URL in, something happens called
a domain name lookup, that translates
into an IP address, and that takes you
to a website, typically, and then you download
whatever that web page is. There’s an interesting sad irony
about the World Wide Web. It was developed originally
by Tim Berners-Lee at CERN in order to help
his physicist friends get access
to each other’s information — primarily, I think, papers, renderable materials, that were hard to come by. He wanted to make it as easy
as clicking on a hyperlink in order to pull up
this documentation. But every one of you,
I’m sure, has experienced
the click on a link and getting back 404 — Site Not Found
or unresolved domain name. What happened — and this is a real lesson,
I think, for those of us who believe that commercialization
is sometimes helpful because it creates
an economic engine to support the process — Domain names used
to be free of charge, run by a volunteer
name John Postel at USC Information
Sciences Institute who maintained a notebook of
assigned top-level domain names. What happened is
that somebody decided spending
research money from NSF to maintain the domain
name system seemed silly because it was mostly now
being used — around 1992 — in the booming, you know, dot com boom period. And so they said, “Why don’t you just
charge for this and let it pay for itself?” Well, in fact,
that’s exactly what happened. The trouble is that if somebody
fails to pay their monthly or yearly registration
for a domain name, it may no longer resolve,
in which case, all the things that it pointed
to may disappear. Even if they’re still physically
on the net, you can’t find them, because the reference
no longer resolves. And so
this really elegant design, the commercialization of it
has led us to this fragility which we need to overcome. It’s pretty clear,
reading this book and meeting with the DAI group
earlier today, that ingestion of data needs to be really thoughtful
and very rigorous. A lot of information —
representation information and the like —
all the metadata — Where did the data come from? How were the instruments
calibrated? How do I tell
that it’s authentic data? And so on.
All has to be accounted for if we’re going to have archives that are useful
either in the present and in the future. The other thing,
which one of my colleagues who gave me
this book is tackling, is the legal framework in which this kind
of preservation is supported. There’s other issues associated
with software, for example. Some people own the software and they don’t
want you to use it unless you pay them royalties for it or license fees
for some period of time. So it could be patent, which is 17 years, but those patents
often get extended in various ways,
or copyright, which is 70 years after
the death of the author, which I think is excessive, but, you know, that’s the way the intellectual
property community has pushed copyright law over the last 50 or 60 years. So, the question is whether or not we can get
special dispensation under the patent
and copyright laws, for example, to run software
on behalf of third parties or get access to information which would otherwise
be protected for purposes of preservation. And so, we don’t have too many
carve-outs. Libraries are special
in the copyright law, and so they have privileges which we might not have
as ordinary citizens or as corporations, and I think the same kind
of thing may be needed. And my colleague,
who is working in this space, has been looking
at a variety of areas where preservation
should be given authorities that we wouldn’t normally
get in the commercial sector. Then there’s this question
of who’s gonna pay for it all. And this is a non-trivial
problem, especially if you’re talking
about 100 to 1,000 years or something. I have been,
just for the heck of it, looking at corporations that have lasted
for a long time, and a few come to mind. How about the Catholic Church? That’s a couple thousand years. I’m not suggesting that they should become
the digital archive of record, but the fact
that they’ve persisted over a couple of thousand years
is pretty interesting. The only other
really long-term companies that I know about tend
to be breweries, you know, that have been around
for 500 or 600 years, and, you know,
maybe they have caves down in the brickwork and dug into the wineries. So I don’t know whether that’s indicative
of the kind of organization that we might have
to turn to for longevity, but it is kind
of an interesting observation. And in the worst case, we can always just
drink the beer and forget about the fact that we haven’t succeeded
in storing anything for long periods of time. But these are just
snapshot examples of the kinds of problems that are not
necessarily technical, but they have to be solved if we really are serious about preserving
digital information or any kind of information
for long periods of time. Now I have to remember
to go there. Okay.
So, here’s a couple of problems. One of them
is just scaling up — storing huge amounts
of information that a lot of our instruments
are producing, whether it’s
ground observatories or orbiting telescopes
or robotic space probes or the Large Hadron Collider,
or the South Polar IceCube. By the way, if you haven’t read about this,
it’s absolutely amazing, and your TEDRA system,
by the way, is helping us pull data
from the IceCube at the South Pole back to places
where it’s useful. This thing is literally
at the South Pole. They drilled, I don’t know,
5,000 holes in the ice, go down
about a mile and a half, and they put detectors
in the holes to sense the light that’s generated
by a high-powered neutrino that interacts with, you know, ice — with water. As you all know, neutrinos don’t interact
with much of anything, and so when they do,
it’s hard to detect. And these things
are very, very powerful — possibly coming
from outside the galaxy. I’m not sure
I have the exact metrics right, but there were three incidents that were detected
in the last year or so. One of them — I think it’s gigaelectron volts, but somebody here
might tell me I’m wrong. There were two incidents that were 12
or 13 gigaelectron volts, and they called that — Bert and Ernie were
those two incidents. And then
there was another detection later that was
like 8 gigaelectron volts, and that was Big Bird. So I don’t know
what’s gonna happen when they run
out of Muppet names in order to reference
these things. But the point
I want to make is that the data accumulating
is significant in scale and it’s also fairly complex
in structure, if you think about all
of the information that’s needed in order
to make sense of the numbers that those data represent. I was told that the Apollo Lunar
Surface Experiment Packages — the ALSEP packages — lost a lot of data, or the parties responsible for
archiving the data lost some of it, because they just
repurposed the tapes that the data was written on. There’s an analogue
of this in the vellum world, in case you don’t know. Some people considered vellum
to be more valuable than that which
was written on it, and so there have been attempts
made to scrub the data from the vellum and to write on top of it. There’s an example of this. It’s called the Archimedes
Palimpsest. If you have not read about it,
I would recommend it. It’s a fascinating story. The manu– Sorry? Yes. In fact,
the Walters Library did some beautiful
preservation work. The guy that owns it
is in Fairfax County, it turns out. Ha, ha, he’s in Virginia
and not Maryland, you know. Phbt. Sorry, I’ve lived in Virginia for about 40 years
and I know there’s this tension. Anyway,
what happened is this manuscript was copies
of Archimedean works in Greek and it was written
around 1000 A.D., kept in a monastery
somewhere in the Middle East. It might have been Jerusalem,
but I’m not certain of that. And then around 1200 A.D., the abbot
of the monastery decided that the vellum
was more valuable than that which was
written on it, so he had them scrub off
as much as they could, rotate the vellum 90 degrees, cut it in half, and then put sort
of an owner’s manual — a liturgy and other operating procedures for that monastery
on this vellum thing, and that vellum
stayed in the monastery for about 700 years until around 1900. Then it disappeared, and nobody knew where it went until it reappeared in the attic of some Frenchman’s house
in 1998 where it was
offered for auction. It was in terrible condition. It was, you know, all — water dripped on it and mold and it was all wrinkled
and everything else. And so my friend
took it to the Walters Library, spending about $1.6 million
to acquire the manuscript, and they have now
not only taken it apart, ’cause they had to literally
look in the binding in order to see the parts that had been written
before it had been cut up. They discovered at least
one Archimedean writing that talked about what we would
have called pre-calculus. Archimedes
was apparently familiar with the idea
of infinitesimals and the notion of area
under a curve. He never got quite
as far as Isaac Newton and the others
who developed calculus, but just think,
in 300 B.C., he was that far away
from the calculus, so if that manuscript
had been preserved and more widely available, you know, who knows
what we might have accomplished. But the point
I want to make is that these things
do disappear if we don’t make
a conscious effort to preserve them, and so this particular situation with the tapes is not new. So the efforts here to do a better job
of instituting preservation, I think, is very important. So, from reading David’s book, and the others
who helped author it, I was attracted to this notion, since I’m a computer
programmer by trade, to the recursive element
that was in here. You had to define
representations of things and then the description
of the representation might actually
be recursive ’cause it had to refer
to something else, and eventually it’s turtles
all the way down. The question is how do you stop
the recursion? And a very clever idea
was introduced and that is called
the Designated Community. And what happens is
that you do the representation and you represent
the representation — You keep going
until you get to the point where there is a community that actually understands what
the lowest level representation means and you don’t have
to go any further than that. Except for one problem — What happens if that Designated
Community dies out and nobody remembers
what that meant? So now we have this problem that we have to be conscious of whether there is longevity in the Designated
Community itself, and if there isn’t
for some reason, then the whole process of archiving
has to take that into account and provide
additional information so as to allow the data which is represented
in these complex ways to be correctly interpreted
by a less well-informed Designated Community. And when we’re talking
about hundreds of years or thousands of years,
it’s almost certain that the available
designated community will change with time and possibly have less knowledge than the one that triggered
the original archiving process. It’s also pretty clear that we have
quite a wide variety of information
and information structures that we may have to capture. And so these very complex kinds
of objects that get generated in the course
of measuring data have to be not only captured but described
with sufficient precision that first we can unpatent them, but second we can understand
their semantics. And that, too,
is a very major problem — figuring out what vocabulary
to use to describe the semantics of a complex data object so that software can be written
to correctly interpret that is, again,
a fairly significant challenge. So, one thing which is certain
to be important is that we have
to be systematic about this. This cannot be
an accidental thing. This cannot be a casual thing. And what NASA
and others, especially
those working on OAIS, have done is to describe,
in this book and elsewhere, a very systematic approach
to figuring out whether or not
the archiving process is actually reliable or has taken into account that which would
make it reliable. There are also
recognizable pressures coming
from the federal government with regard
to government-supported research pressing the need
for preservation of the digital information
that’s generated. The President’s Committee
of Advisors on Science and Technology
and the Office of Science and Technology Policy
have mandated that all of
the federal agencies, including NASA, make certain that when they award contracts
outside or when they do work
inside of NASA that there is attention
paid to the capture and preservation
of digital content. And, of course,
this question of affordability, again, raises its ugly head, because if we’re talking about hundreds
to a thousand years or something, the business model that will retain
that information and maintain its availability
is a challenge. I also think
it will be very important not to rely on some big
central archive somewhere, because, again, longevity
is in question, and the distribution
of the archived material might turn out to be important, so if an archive fails, you haven’t lost everything. We learned that lesson
at Google very, very clearly and early on when we started building
multiple data centers. We replicate data
inside the data center so that if a portion of a data center fails,
we haven’t lost anything, and we replicate the data
across data centers in case we’ve lost
a data center, and the same thing can be true
for a serious effort at long-term archiving. One thing
which I think is very important that in addition
to having policies and practices,
procedures like this book outlines,
that it’s important that the implementations of these archives
be inter-workable. And the reason
that that’s important is that you don’t want
to wind up having to rely
on a particular archive fro a particular kind of data only to have
that archive disappear. And so having multiple archives that can handle
the same kind of information and can be used
by the parties relying on archives with equal facility
is very important. So inter-working
among the archives and the ability
of a designated community member to get access to data from any of a number
of archives is important. It’s especially true
for this concept of succession, where if an archive
is demonstrably going to go out of operation — if you know that
ahead of time — you want to be able
to migrate the information to other archives that are still
equally accessible and useful
to the Designated Community. And, of course,
whatever we do, taking into account
the example of PCAST and OSTP, we need policy frameworks that will create incentives
for the building and operating of these archives. So, I want to switch gears
a little bit to focus in
on a particular problem. This is not so much the problem
of the scaling and the complex data objects that NASA has to deal with, but it has a lot
to do with digital information that you and I might care
a lot about. For example, our family
photographs or videos on YouTube, Flickr, and Picasa. It turns out
that there’s a lot of software involved in rendering this kind
of digital material. And so I want
to give you a concrete example of something I wonder about. Some of you will have read
a book by Doris Kearns Goodwin called “A Team of Rivals.” It’s about Lincoln’s hiring of all of his rivals
for the presidency to become members
of his cabinet, hence the term
“Team of Rivals.” If you read the book, what you read
is a substantial amount of dialogue in the book
which sounds quite credible. And it’s kind of surprising that it sounds so credible, and I ask myself,
“How did she do that?” ’cause she wasn’t around in 1860 when all these events
were taking place. At least, I don’t think so. It turns out she went to — I don’t know how many libraries and I made up the number, but she had to have
gone to a lot of different libraries to get their correspondence to see what topics
they were talking about and how they expressed
their views. And then from this, she was able to reconstruct
credible conversations among the various parties. So then I got to thinking, “Well, what if one
of you is a 22nd-century — or one of your descendants
is a 22nd-century — Doris Kearns Goodwin wondering what it was
like in the 21st century?” And the question
is “Would this person have access to our e-mail, our tweets, our blogs, our web pages, the URLs that we referenced and so on?” And the answer
is “Doesn’t sound like it.” At least, not if we don’t do
something about it. So I think
that we are facing what I’ve been calling
kind of a digital dark age that our dependence
on digital content right now just as
ordinary people in our society, not as scientists,
is actually risky if the software that we are relying
on doesn’t exist anymore, doesn’t run on any of
the new operating systems or hardware that show up
in another 10, 20, 30 years, or even next year,
for that matter. So I think our 22nd-century
Doris Kearns Goodwin is gonna have quite a challenge if we don’t
do something about it. And then it gets even worse. Instead of static content, what about executable stuff
like games, for example,
or even something that’s as simple
as WordPerfect or Microsoft Word or any of the other
text-editing software that we have? When you look at
a complex data object emerging
from even those applications, and you ask yourself,
“How will I be able to maintain
correct renderability of that material, or manipulation of it
in case of a spreadsheet, over long periods of time?” And it may turn out
that the people who make this software
won’t make it backward-compatible
to older formats. I think many
of you will have already had the experience like I have. I have some 1997 PowerPoint
slides and I pull them up
in the 2011 version of Microsoft PowerPoint, and it basically said,
“What’s that?” And I said,
“It’s a PowerPoint slide set, you blankety blank,”
and it didn’t help. And I can’t blame Microsoft. I don’t think that it’s reasonable to expect that a commercial product
would necessarily always be made backward-compatible to something
that’s 20 years old. So we have a real challenge
here trying to deal with content that was very much bound up
to a particular application. Think about how much
energy you would have put into making
some of these things. I mean,
some people will, you know, write whole books
and, you know, all kinds of other artifacts
using software which may not run anymore
after a decade or two or even a few years. So these are just
some of the challenges which should
be absolutely obvious to everybody
sitting in the room, so I don’t propose to go
through here point-by-point. I mean, the first two
are pretty obvious, but they’re very important. But the thing
in the upper right, I also worry about. Companies go bankrupt. And I don’t know how much
you know about bankruptcy law. I didn’t used
to know very much about it, except I worked for
a company called MCI that was acquired by WorldCom that went bankrupt in 2002, so I learned
about bankruptcy laws. I learned more
than I wanted to know. And one of the things
you learn is that the bankruptcy court
thinks anything is an asset to be held onto
and sold to somebody else. And so imagine that you were
relying on software from some company
that has gone bankrupt, and you say, “Uh, can I get
a copy of the source code?” The bankruptcy judge says, “No. That’s an asset
that I plan to sell.” “Well, can I get a copy
of the object code?” “Well, no.”
You know, you got a whole series of things under bankruptcy law that might deny you
and me access to the very thing we might need in order
to keep using that software or running it
in new environments. I’ve already mentioned
the intellectual property rights and legal frameworks issues that arise. Oh, you might wonder what’s
this digital X-ray about. I’ll come to that.
It’s a tactic — just a tactic
for trying to capture an operating system, a piece of application code, the hardware instruction set, and the data files of a particular application in a kind of digital x-ray that you could use to slam down
into a virtual machine and run the old code on top
of this virtual environment. So, one of the projects that addresses
this specific problem — executing old code — is called OLIVE, like the thing
you stick in your martini glass. Developed at Carnegie Mellon,
Mahadev Satyanarayanan — I really had to practice, We call him Satya
for obvious reasons — developed this technique. It was funded by NSF. And basically
what he’s trying to do is to run old software on top of virtual machines running on new hardware which doesn’t necessarily have
anything to do with the hardware that originally
ran the application. So, there’s a whole lot
of moving parts in here to try to emulate
precisely the hardware that ran an old operating system that runs the application that interprets
the data correctly. In this case,
either renders the data correctly or maybe lets you manipulate it
like a spreadsheet. And so the digital X-ray is a way of
capturing all that information. Unfortunately,
you wouldn’t do this, but there were some people
in the reporter field who thought that I actually
meant you take the laptop with the running application and stick it
in an X-ray machine and you take a picture
and then that’s all it was. So much for my ability
to communicate. He decided that, you know, these virtual machines with all of that data
are really big, large pieces of code, and so he was trying
to find a way to make it easier to run. Instead of having everything
all into your laptop, for example, he had the idea that if he could run this kind of like the way we do Netflix. You know,
Netflix doesn’t necessarily download
the entire video that you’re watching —
It just keeps feeding you pages and it tries to feed you
pages ahead of time to keep you from getting
that delay signal. So that’s pretty cool, except that, in the case
of things like Netflix, it’s a linear delivery, right? You know when the next
page is due. On the other hand,
if you’re running a virtual machine
and not all of the information that you need is
in the virtual space — some of it’s out here
in the cloud — you have to be able to figure
out what I should pull in next, so you have to kind of predict
what pages are gonna be needed. It’s sort of like typical
virtual machine operations, where you’re trying to avoid
page faults by pre-loading pages in that you think the operating
system’s gonna need. This was Peter Denning’s
dissertation thesis around 1968, as I recall — the working set model. So, he’s trying
to turn this problem into a kind
of working set model, but in a virtualized environment on the net. So, streaming is not so easy, and I really alluded
to most of the hard part, including the prefetching
and demand paging part, which he’s successfully
implemented, which I find pretty amazing. So, this is kind
of what he does. He’s got whatever the hardware
is at the bottom. So you imagine it’s 100 years
from now, and you’ve got
whatever the new hardware is — some new operating system — and he’s running
a virtual machine on top of that operating system that has its own bytecode. So it’s not emulating
the hardware of the machine
it’s running on. It’s emulating
a virtual machine code, and so that virtual machine code
is emulating the hardware that the original system ran on. It turns out that’s not as easy
as it sounds. First of all,
it may be hard to get details of how the hardware
actually worked — what the instruction
set looked like, how it executed —
especially the ones that had bugs
in them, because sometimes the software only worked because of the bugs
that were in the hardware. I mean, this is one of those,
you know, my head is breaking. But he discovered — and he’s done this for,
I don’t know, a couple of dozen
different operating systems and machine platforms, and he’s encountered
these various “gotchas.” So, the whole idea is that after you get
this hardware emulator running on top
of the virtual machine, then you load
the operating system and the application,
and then process the data. One interesting side effect of this is that running
old machines’ operating systems, like DOS 3.1 or something,
with an application on it, in this new environment
had a timing problem because it ran faster
in this emulated environment than it did on the original
ancient machines. I’m sure
it could also go the other way, which is pointed out,
again, in David’s book. But for the cases where it actually runs faster because the modern hardware
is faster, it means that some of the games you might try to play,
you can’t win because the machine
is running 100 times faster than it did before. I mean,
that’s yet another little nuance in all of this. But this basic idea
is a pretty powerful one, and he’s been able to show
that it can be made to work for quite a variety
of real cases, and so you can sort
of imagine, you know, he’s got this linearized thing with the description
of the hardware written in XML, which is interesting, and then the disk image
of the operating system and the application
and the actual data. So that gets all downloaded
either all at once in some big local machine or paged in through the net. So, this is one way
to do it, where everything, including the virtual machine
is running local to the user, and we’re just pulling pieces
of the virtual machine or the executable in from the net. What he did in this case
was make sure that all of the fetches are done using standard World Wide Web HTML,
HTTP fetches so that for
all practical purposes, the Internet sees this as just
another web-based application, even though, of course, it’s
doing something pretty unusual in the machine
that’s receiving all this. And he also did the same thing by exporting
the whole virtual machine into the cloud somewhere and then just pulling images
in, sort of like X Window, for some of you might remember
that from MIT years ago. This has a very interesting
property that you can imagine
using the cloud as the mechanism for running old software on
old-emulated hardware in a cloud which has the ability
to expand capacity in order to run
many of them or to run one at adequate speed. So, this is where the program
ended up at Carnegie Mellon. It’s sort of in maintenance mode
at this point, because the NSF funding
has run out. Personally, I think
this is an important element of preservation
in the long term and that somehow it should
be revived, but now we’re back to questions
of business model and how to achieve
that objective. So, there are a lot
of technical challenges here, and, again, I don’t want
to take up too much time with the details as much as to get to some
of your questions and comments. But Satya
is very quick to say that there’s nothing simple
about doing these emulations, and it can be hard
to get them exactly right, especially if the hardware maker
is not interested in giving you
all the details of exactly how the hardware worked
and what its proclivities are. So, there are other projects
that come to mind. I’ve already
mentioned the Internet Archive that Brewster is running,
and the Computer History Museum out on the west coast
is also accumulating software as well as artifacts, machines, some of which still run, and other computing things. And, of course, at Google,
we’ve been doing book scans, and we also have this thing
called the Cultural Institute that we set up in Paris in the building that used to belong
to the French national railroad. So, I don’t know
how we acquired that, and maybe I don’t even
want to know, but we ended up
with this lovely old building with all these beautiful
ceilings, but in that building
is a wall the size of this screen covered
with high-resolution displays. We are accumulating
literally bazillions of images coming from museums
all over the world that they have either sent
to us or used our platform to instantiate all this so that people can visit museums
anywhere through the net or they can assemble
virtual art exhibits. Like if you want to see a major
exhibit of van Gogh, a lot of it’s
in The Netherlands, but a lot of it’s elsewhere, so you can literally
assemble a museum that doesn’t exist with
these virtual images available. Or, they’re real images, but it’s a virtual museum that you assemble using
the Cultural Institute. So, I’m going to stop there, I think, ’cause
that is the last slide, and ask if you have
any questions or issues you’d like to raise. And, of course,
if not, you all get to go home
early or something. But I would be happy to try to respond to questions
or comments that you might have, and in any case, I appreciate very much the time
to join you this afternoon. Thank you. [ Applause ] So, we have microphones, and there we already
have a question. -No, I’m just reminding people that if you have questions,
please come to the microphones, and if you are too far
in and can’t get out, just signal to us
and we’ll pass a microphone in for you. -So, before you ask,
let me just warn you — I’m hearing-impaired,
and I’m not the guy that came to speak
and didn’t want to listen. But we’ve got this repeater
here, which may work. If it doesn’t,
I’m gonna run down the stage in case I have to lip-read, which will cause
the guy who’s videotaping to go slightly crazy, but I promise I won’t bite
and I don’t spit. So, okay, ask away. Let’s see where we go. -You’ve mentioned particularly,
for instance, reading WordPerfect
1.0 documents. One of the problems there is that there is
no standard, say, word-processing format. I know that there’s OpenDoc
as an effort. My understanding
is OpenDoc is, shall we say,
not trivial to implement. It would seem as though between NIST
and the federal government that NIST might be a useful tool with your influence with them to declare some standard
document formats that then the federal government
would only buy programs that were able to read
and write those formats? -So,
I can imagine some companies wanting to have that happen. The trouble is the government
doesn’t always come up with the right answer
for stuff like this. And besides, that may only work
for a finite period of time. Imagine you’re the National
Archives. I know those guys, and what happens
every four or eight years is that people show up
with disk drives — disk drives! They just
hand them these things — “These are the records
of the U.S. government.” And these guys have to index
the disk drives, figure out what’s on there
and everything else. There are formats that have been adopted
by many like PDF-A, which is
the archival PDF format, but I will guarantee you
that over time, the complex digital objects that we have are not
just simply renderable. You have to interact
with the software in order to make use
of the object that’s been created. And so
we should be really careful not to fall into the “it’s
just a document” trap. And I’m certainly
not accusing you of that, but I worry about trying
to pick a format only later to discover
that there are things I want to express that I can’t with
whatever that chosen object is. I’ll give you a concrete example
of this. The request for comments
that are used to document the Internet
standards were created in 1969. The format was ASCII text only,
so all the figures and everything else
had to be made with little X’s and 0’s
and dashes and everything else. And John Postel,
the editor, absolutely insisted
that we stick with that, because it had high probability of being renderable over
really long periods of time, and he was right,
because those documents have been available
for quite a long time. But now the community is saying, “I need to show things that are a lot more
complicated than that, and I don’t have time to try
to draw them using ASCII text,” and so they’re shifting
to PDF types of structures. So now matter what we do, we’re gonna end up with some variation
in document formats. While I think
that might help some, what would be even better
would be to figure out how we describe those things so that you could take
a piece of software using the description and automatically
generate something that correctly
interprets the object. That would be the ideal outcome,
from my point of view. Figuring out whether we can do that
is an interesting challenge. Some of you will have heard
the term compiler compiler. Some of us used to write
programming languages using compiler compilers that would create a compiler
for the language that you were designing, which then would compile
a program and then execute. I can remember having trouble
remembering whether — “Is this gonna happen
at compiler compiler time or compiler time
or at execution time?” And that was not always so easy. But I think our aspirations
should be in the direction of being able
to ingest an object if we know what it is — if we know
what its structure is — and be able to generate programs that know what to do with it. So there. -During your talk, you mentioned repeatedly the importance
of business models and economic incentives, but I didn’t hear, or, at least, catch, any examples of working or even
workable options. -So, let me give you one. Well, there are a bunch. Let me start with an analogue. When Internet was built, we had no clue how it was gonna be supported. We knew that the government
was paying for it initially. That was the easy case. But we wanted
to become commercialized and self-supporting. And so in the end,
the architecture said, “We don’t care
what your business model is.” So some people
run not-for-profit pieces of it, some people
run for-profit pieces, some people run
personal machines like we do at home, some people are running
government-sponsored components, but it all works together, because no one is forced to use
someone else’s business model. So the first observation
I would make is that we should think in terms of multiple
possible business models for systems like this, some of which might be
very commercial and some of which are not. But I will give you
one very interesting example. There’s a guy in Ireland who has started
a company called emortal, and, basically, it is aimed at helping people save
all their family stories and photographs and videos and all these other things for,
you know, the next generations. And the question is,
“What’s the business model?” It’s actually
an interesting experiment. He’s gone
to the life-insurance companies and said, “You know,
you’ve insured your life and the legacy is money
going to your heirs. Why don’t you insure
your digital legacy, as well? Why don’t we make that part
of the service?” Life insurance — now it’s digital life
insurance product. And I want you to think
for a minute about what you know
about life insurance. Think about this business model. Here’s the deal — I’m gonna come down,
the life-insurance salesman, I’m coming to you and I’m selling you this thing which you’re gonna have
to pay me for until you die. Okay, first of all, that’s
a great business model, right? You know, they just keep paying
until you’re dead. It’s really death insurance, but you wouldn’t buy that, so
“life insurance,” is, you know, a clever naming tactic. So, the deal
is I come to you and I say, “I will not only
insure your life so that when you’re gone, your heirs will get some money, but I will also
insure the preservation of your digital story.” And, you know,
I don’t know if this guy is gonna
succeed or not, but of all the business models
I can think of, that’s the only one
I know of where you ask people to keep paying you
until they die, as opposed to keep paying
until the mortgage is paid off, whichever comes first. So, those are — That’s not a wonderful broad set of possible answers. It’s sort of like the domain
name world, which found various
and sundry ways of supporting itself. So, the answer is somebody out there is probably
going to invent some interesting
business models. Maybe it will be you. -Thank you. -Yes, sir. -In H.G. Wells’ book
“The Time Machine,” the traveler
goes way, way back — way, way forward in time. And he encounters
our descendants who have a meager library which basically
crumbles in his hands. How would you envision that? -It crumbles in his hands.
-Yeah. The books are so old
that they — -Oh, oh, oh.
Okay. Well, you know,
some of you might track this — physical media. There’s at least a report that comes out
every once in a while where somebody’s found a medium
that really lasts a long time. There’s at least one example — the claim is that it will
last 10 to the 20th years. Of course, they haven’t had 10
to the 20th years to test that, but the assertion is — It’s a photonic
recording system. I did see one other example
which I was fascinated by. Some people have now figured out
how to use DNA — literally,
the deoxyribonucleic chain — for encoding
of digital information. So they generate the chain, and it’s a specific sequence that is interpreted
as binary code. The reason I bring this up
is that I learned from them — They’re in the U.K. — that if
you remove water from DNA, it’s remarkably resilient. The stuff is really stable
as long as it’s not wet. Once it’s wet,
it’s a different story. So you can kind of imagine somebody
going to the trouble of encoding quite a bit of information
in a tiny amount of space, and you might also imagine
that the ability to get the sequence of DNA, now that we know
how to do this, will probably not be lost — If it is lost, maybe it’s not worth
remembering anything anyway — but those two things struck me
as very interesting, but they are all about
the preservation medium and the ability to read it. But that’s only getting bits
back. That still doesn’t address all
the other part that has to be in place in order to make sense
of what the bits are. But in terms
of long-term media, I’m beginning
to see potential for things that could last a very,
very long period of time and not crumble. -First, just an observation
or piece of information — I don’t know
if you’re aware that the D.C.
Public Library has just opened in its main library something
they’re calling — I think it’s a memory lab.
-No, that’s interesting. -Which is an open lab
with a whole lot of equipment for you to transfer information
from one medium to another. And you can reserve it
in three-hour blocks — which, you know, most of this stuff
takes forever to do — but it’s a public resource, and anybody can get
a library card even if you don’t live
in D.C. -That’s actually
a really interesting observation about libraries and their function
in our society. They keep finding new ways
to be useful, as opposed to a place where you stick
a bunch of books. So, I didn’t know that. That’s actually very interesting
and it’s encouraging. Of course, if everybody decided
they wanted to do that, we’d probably overwhelm
the facilities at the library, so that leaves us
the open question, “How do we scale this up
for people that care about their information?” But thank you.
That’s good to know. -Yeah.
They have a big initiative on maintaining
your own personal archives. -The problem with maintaining your own
personal archives, of course, is that you have to keep
rewriting stuff, and I don’t know about you, but I have boxes
full of VHS tapes which I someday plan to migrate
to something else, and I have not done well
at that. I did manage to get all
of the eight-millimeter videos transferred over into DVDs, so I was very proud of myself
for having done that. Of course,
now I can’t play them on my Mac, you know, so now
I’m all angry again. -Well, this brings me to my more
important comment. I’m involved in records
management here at Goddard, and in 2012, there was an executive order that was issued
by the government that government records would be
in a mostly electronic form by 2019. We generate a tremendous amount
of records here that we have to submit
for archive at NARA. Many of those records
are in databases, and NARA has not
really provided — They say, “Well,
we can ingest it. You know, if it’s SQL or Oracle
or whatever, we can ingest it.” But they’re
in proprietary databases. They have relationships,
they have structures, and it’s not really clear how that’s supposed to happen. -Well, your OAIS team has the right idea
for dealing with this, but we don’t have a set
of specific implementations that you can just turn to
in order to turn the crank. I always get worried
about these unfunded mandates. People come to me and say, “What should I do
with my digital photographs?” My honest answer is the ones you really care about,
print them, because we know prints last
at least 150 years. We don’t know anything
about how long any digital format
is likely to survive. I don’t know about you, but I have examples of pictures that I ingested into
a photograph-management system which suddenly,
in its latest version, doesn’t know what a TIFF
format is or doesn’t know what GIF format is, so I am now persuaded that printing stuff
out may turn out to be the most reliable medium
for that which could be printed. But as you point out,
databases — I mean,
how do you print a database? How do you print a spreadsheet? I mean, sure, you can print
an instance of a spreadsheet, but printing all of the formulas
and everything else — What in hell
would that look like? Ugh.
Yes. Okay.
Thank you for that. Yes, ma’am? -My question
will be about textbooks. -Okay. -I’m asking if you’re aware
of any projects to select — translate
for different languages, if needed — collection
of the best textbooks for all countries for all times and preserve them, instead of archiving
to beat old video games. I think in the case
if something will happen, collection of good textbooks
will be the replica of the footprint
of our civilization. Is it true? -Let me make sure
I’ve understood. You’re asking about the ability
to translate the textbooks. -Yes, because we have a lot
of printed textbooks in different languages. -Yes.
-And for example, we have a lot
of German books and Russian books —
beautiful textbooks — that are not available for
people in different languages. -It’s true. Let me say with my Google hat
on for a moment that we work really,
really hard to try to build translators
from one language to another. Some of you will have used
our Google Translate, and you see what
we’re able to do and what doesn’t necessarily
work very well. For technical textbooks, it’s a challenge to get
the translation to work right. On the other hand, for mathematics books,
the language is so stilted and so regular that that turns out
to have been easier to do than some of the, you know,
more diverse kinds of text. We’ve gotten past the point where I remember being in 1960, when we thought translation
would be a matter of putting, for example, an English-Russian dictionary
into the computer and just running the words
and translating them. We tried this at Stanford, and we thought
we would challenge the system by putting in an expression, “Out of sight, out of mind,”
and what we got back was something in Russian, then we translated
that back into English, and it said,
“Invisible idiot.” [ Laughter ] And, okay,
I think we have a problem. So the answer
is I don’t know of a systematic
translation program, but I will say that for confined languages
like mathematics, automatic translation may actually
have some potential. But for the richer stuff,
I’m not so sure. The semantics are so critical
in a technical work, so that’s a big challenge. -Thank you, but I assume
that in English, at least, there’s some database of textbooks available
for everyone for free? -Well, this
raises another interesting issue about copyright
and who owns the text. The solution to this,
by the way, in the case of software,
has been open-source software, which often is just given away like we do with Android and some
of our other applications. I don’t know
whether there is a strong effort right now
to generate free textbooks. I don’t know. Maybe
some of you are aware of this, but I don’t know of any. The closest I’ve seen anybody
come to that is not a textbook. It’s the Khan Academy, where those short videos of teaching people
how to do mathematics have been successful
for a large number of people. For those of you
who haven’t noticed this, young people these days,
millennials, for example, when they go looking
for information, don’t actually go
to the Google web search. They go to YouTube, because there’s bound
to be somebody who actually did a video showing you how to do X
for some value of X. I was quite surprised
to learn that. So we should be thoughtful about whether the medium of transferring knowledge is going to continue
to be textbook, or whether it’s going
to something else. But in either case, if language is an issue, it’s a challenge, exactly
as you point out. -Thank you. -Let me get the one from over — Oh, he’s jumping the queue. Talk about a hack. How about that?
Okay. No, it’s all right.
Go ahead. That’s why we have
two microphones. But that may cause half
the people over here to run over there. Okay.
Yes? -I feel compelled to defend
my mother agency, NARA. -Yay. I’m sorry.
I didn’t in any way mean to suggest
anything negative. I think NARA
has a huge challenge, and they’ve been whacking away
at this for a long time. -Yeah. And the two comments. The first one is NARA’s
been accessioning databases for 45 years. We have a tremendous amount
of experience, and, in fact, databases
are one of the easiest things that we ingest. The second one is I think
you’ll be pleasantly surprised for the 22nd-century
Doris Kearns… -Yes. Tell me. -…because, as you pointed out,
every four to eight years, we get the President’s
records — it’s not the entire
federal government’s records — and we have all the e-mail
for every President who’s ever used e-mail going back to Reagan.
-Wow. -Officially, yes.
-Right. There we go.
-Their official e-mails, yes. I stand corrected.
-Zero, right, yeah. Very clever.
Did you get Hillary Clinton — Well, okay. -Official — yes. -Sir? -I’m glad you brought up
the business model. In fact, I’ve experienced
that the business model actually has been a problem
rather than a solution, in that many times
a modern business model is to change the format
very frequently and then hold
your data hostage… -Yes.
-…so that, in fact, I have lots of data that is not a thousand years old but, in fact, nine months old that I cannot access. And, in fact, not being a computer expert, often I am unaware
of which program I’m missing or which version of which
program that I’m missing. -Yeah.
-So even with the data, I’m often left trying
to read the bits and make up what the meaning
might have been. -Ugh! -And I’m wondering
if you have any comment or solution to this problem. -Well, you know,
this actually gets to something
I only lightly discussed, and that’s this question
of preservation rights and the notion that for purposes
of preservation — that is to say,
preservation of meaning — that you, as the user
of that software, should have the right to get access to older versions
of the software — that it needs to made available in some way
that’s useful to you by the people
who created that software. We don’t have any rules
like that right now, and with the advent
of cloud computing, it’s theoretically possible to make
the older software available and to run it against the data
that you have. And so the technical means to do some of these things may actually be
within our grasp, but the rules to force companies
to cooperate isn’t there. And I think that your examples and the examples that others might offer make,
in my view, a very compelling case for changing rules to allow preservation
to be part of the equation. It’s sort of like the rules
that said — a private copy of a movie, maybe, that you bought on a DVD,
that you were allowed to make a copy of that
for backup purposes. And eventually,
in order to avoid having you freely
make copies of everything, the companies that were
selling the DVDs, for example, also made a CD-ROM copy for you, and you often end up
with two things. So this argument that you owe it to the people that are relying
on your software to help them preserve
the utility of what you created
with it goes a long way with me. Now, whether it goes a long way
with the copyright councils of, you know, the world,
it remains to be seen, but I think we should make
a really loud noise exactly along those lines
for the reasons that you imply. Okay, are we running
over time here? -I’m gonna cut it off
after this next question, which raises
an interesting point that I hadn’t thought of,
so ask your question, and you can either
answer this fully or leave it hanging in the air
as you wish. And our speaker might have time afterwards
to answer individual questions. -Yeah, I’m happy
to chat for a little while after we’re officially done.
Yeah. -Last question. -Well, then I shall be quick. Let’s assume
that I’m a techno-optimist and I believe that technology
will continue to progress. Why shouldn’t I assume that this
problem will solve itself? -Okay, let me give you — You’re like that young twit
that was sitting in — My definition of young keeps
changing, but — I had a collection
of librarians, and we were discussing
this very thing. And this young fella got up and he said, “Oh, look,
this isn’t a problem. The stuff
that’s important will get copied into new formats and the stuff
that isn’t important won’t and nobody will notice,
so what’s the big deal?” It took me half an hour to get the librarians
off the ceiling, and the reason is
that they pointed out that you don’t know what’s important, sometimes
for a hundred years or hundreds of years, and then you realize
that this particular thing, if you only had it, would explain something
important to you, historically speaking. Or, in the case
of our scientific community, “If I only had the data
from that date and event, I could compare it
with what I have now.” So the answer
is it won’t solve itself, because it hasn’t solved itself. And if we don’t do something, if we don’t take action,
it will not solve itself. So, you may choose
not to believe that, but I will argue
that the people who work in this space, and my own experience
over the last 40 years — 45 years — is that the problem
has not been solved by itself over that period of time, and I do not see
any natural solution unless we really focus
on what’s getting in the way of getting something to work. So there. Okay.
All right. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *