“Preserving Our Past for the Future” at ACM Turing 50 Celebration


– Now, a topic dear to my own heart, preserving
our past for the future. I’m sure it’s no surprise that the pace of
change we are experiencing only seems to accelerate. Our next panel is devoted to one of the easily-overlooked
ramifications of that. Just exactly how do we preserve our past so
we can access it into the future? For this discussion, we call on our program
chair, Craig Partridge, and his panel to entertain us for the next hour and 15 minutes. Craig? – Thank you. – Round of applause? – Craig, you want to be here? – Yeah, I’m here. – One too few. – Well, you could sit in his lap too. – Okay, so, as Wendy said, this is about preserving
our past for the future. We have a wonderfully distinguished panel,
and I’m just going to introduce people quickly down the line, and then we’ll each start with
a brief discussion of why we’re each individually interested. Then I’ll start firing questions at them,
and I’ve encouraged them to be extremely lively. I think we have to counteract the lunch syndrome. At any rate, I’m Craig Partridge. I’m here as the moderator ’cause, as Wendy
said, if she wasn’t running the whole thing, she’d be running this, but she’s running the
whole thing, so I’m here. Then this is Vint Cerf. Vint, for years, has been an advocate of worrying
about the preservation of data in very vocal ways, and our need to think more about how
we preserve our digital past. Then, next to Vint is Brent Seales, who’s
actually, I think, generating some of the most interesting digital data that we’d like
to preserve, and can give you some insights into that. I contacted Brent early on about getting him
on the panel and said, “Of all these wonderful artifacts you’ve created, “have you had trouble
with retrieving some of them “a few years later?” and he said, “Absolutely.” I said, “Good.” Then we have Natasa Milic-Frayling, who is,
like Vint, a very public advocate for preservation and has been working very hard with, it’s
UNESCO, right? – Yes, UNESCO. – Yes, correctly, UNESCO, on these sorts of
issues. Next to her is Brewster Kahle, who’s, in fact,
as he reminds me repeatedly, actually doing this and trying to preserve things, so he
has all sorts of insights from the trenches and other crannies of the process. Finally, we have Satya, and by the way, Satya
is known as Satya to all of us for years, Satya, who has been worrying about the very
complicated issues of ensuring that the program to access the data, or the computable piece
that goes with the data or is perhaps embedded in the data is still computable in the future. Starting with me, about why we’re interested
in this, I actually trained as a medieval historian before I got into computer science,
so I’m very used to a world in which you have very few sources. Actually, I can tell you, as a historian,
that’s a bad thing. I’ve continued to do history research for
fun on the side, and there’s a wonderful quote from a historian of Middle European Jewish
history, and if you think history in general is short on sources early on, you should try
Jewish history. What he said is, “The cross product of imagination
“and facts is constant.” If you don’t have the sources for the facts,
what you’ve just enabled is for people to fantasize about what happened in ever-increasing
volumes. As an editor for the IEEE Annals of the History
of Computing and as someone who’s actually tried to write some history pieces about what
went on in the early internet days, I can tell you that a lot of the artifacts that
we would hope existed from that time do not, so this encourages a certain level of fantasy
about how things happened, and what happened. There’s, in fact, an entire mailing list devoted
to trying to capture oral histories to make up for the fact that we’ve already lost the
digital artifacts. I’ll finish off with one last story, which
was from a friend of mine, Brian Carpenter, who was actually trying to track down the
early memos related to Tim Berners-Lee’s invention of the web, and the various memos that were
passed amongst people to approve his project, and to approve various things, and discovered
many of them were already gone within 10 years of the time that Tim had come up with this
wonderful invention. It’s a very serious issue, and I wanted to
make sure we expose it here. With that, Vint, I’ll let you go. – Let me just give you an anecdote. Some of you have read a book, maybe, called
A Team of Rivals by Doris Kearns Goodwin. It was about President Lincoln, who hired
all of his rivals for the presidency to become part of the cabinet. Of course, that’s why is was called A Team
of Rivals. What’s interesting is that, when Doris Kearns
Goodwin wrote the book, she was able to reproduce the dialogue of the time in a very believable
way. The reason she was able to do that is that
she had access to the letters that had been written by the principals. She had to go to different libraries in order
to get access to them, but she assembled this information of, what did they talk about,
how did they talk about it, what positions did they take? I got to thinking, what if you were a 22nd
century Doris Kearns Goodwin, and you were trying to write about the early 21st century? What would you have access to? What would you be able to appeal to? Would our email be gone? What about the tweets? What about the blogs? What about all the other digitized objects? All the photographs, would they still be preserved? I worry greatly that, not only do we have
trouble maintaining the physical materials, and the ability to read them, I mean, 5 and
1/4 inch floppies may be stacked up in your drawer, accumulating dust, but no readers
to read them, but we also have the problem that Satya has been tackling, which is to
figure out how to get the software that knows what the bits mean to run again so that we
can correctly render the content. That’s what’s driving me, anyway, and I’m
hoping that we’ll discover some good ideas from this panel. – Bret. – Thanks, Craig. I didn’t know you were a medievalist. That’s interesting, yeah. This issue of letters, trusting the source
is an interesting question. I started my career as a computer vision specialist
and quickly got into digital libraries, which, at the time, in the ’90s, was a really important
movement. I quickly found that there were some things
that were actually really hard to digitize, that being the first step in making a digital
library. They were hard to digitize because of damage,
and so, in this entire world of libraries, I found that there was a small part of it
which I’ve come to call now the invisible library. It’s a part of the library that, because of
damage, becomes really hard to digitize, or to even read or make visible at all. The digital divide that we were crossing with
digital libraries has continued to be required for us to cross, but damage interrupts that,
and it became something that I got really interested in. Recently, we’ve had work where we were able
to digitize something that, in fact, will never be able to be opened. It’s physically impossible. We presented what the contents of that scroll
actually show, and it brings up some really interesting questions. If I tell you that this is what this thing
that you cannot see inside actually says, how do you believe that that is true? It turns out that the provenance chain between
the digitization process, which creates a digital surrogate, and the physical object
is one that’s completely wide open right now. We don’t have good solutions for how to guarantee
that the digital record that we have is actually faithful to the original record, and it’s
really hard to inspect, and it leaves us open, I think, for manipulation of that record. I’d like to also say that the durability,
once you get these digital surrogates, of how they persist, and how they can be surfaced,
and how they can be accessed is really questionable. I think Natasa’s gonna talk a little bit about
how, going forward, it’s really, really important to preserve the access in archives and even
some of the software executability around that. Overall, I would like to say that durability
and access, and then the ability to create more than just facsimile are things that I’m
really interested in. When we started, we were really just creating
digital facsimiles that looked similar to the originals, but I would say, now, digitization
has become more about data collection, and it’s a big data problem, and we really don’t
have any idea how to manage all of these digital surrogates over time, and how to interpret
what the data actually means about the original thing. That ability to store the information, interpret
it, and then understand how to connect it back to the original object in a way that’s
faithful and certifiable, I think those are really important issues. – I had started my journey towards digital
libraries and preservation while I was still at Microsoft Research, and now, I can’t believe
it’s already 10 years, a colleague, who is over here in the audience, Tony Hey was VP
of external external research at the time, and he suggested that, in parallel to innovation
that we are doing in text mining, data mining, machine learning, we actually start talking
to libraries and archives who are suffering greatly because we started giving them over
digital records. While they knew what to do with books, they
have elaborate and very efficient processes to do curation of digital artifacts, suddenly,
they were faced with this thing called digital, and it was really escaping the intuition and
the nature of what they used to deal with. This parallel track, I live for 10 years,
first through our PLANETS project and Escape project, all of these projects are funding
the European Union to help libraries and archives dealing with the notion of digital. It took a while to explain that digital is,
in fact, computational. It is really hard to imagine that making a
shift from somebody understanding that the file is actually a document, explain that
file is an input into a program. For us computer scientists, that’s clear,
but over the years, we have been conditioned to look at the screen and see a document,
and they think it’s a document, but actually is as persistent as a rainbow. While you have electricity, while you have
computing, it lasts. You turn that button off, gone. I pull out the plug, gone. This ephemeral nature of digital need to be
explained, and now, I think, with libraries and archives, with UNESCO as well, we have
a common language. My purpose today is to bring the message back
to the computer science because I have now learned what challenges they have and how
they perceive things. I would not want to talk about preservation. In fact, I would like to talk about the challenge
to computer science, which is about digital continuity. I’m not interested in preserving. I want to be able to use it now, forever,
and we need not just for our panel here. We need it for our children. Thanks. – Brewster. – “Universal access to all knowledge,” is
a motto of mine that I cribbed from Raj Reddy, Turing Award winner here. We wanted to build the digital Library of
Alexandria. Could we do this? The answer is, absolutely. Technologically, you just follow the curves. All the books, music, video, web pages, television,
software even, could do it. I said, “Okay.” In the heady days of the ’80s, we called it
building a global mind, or if we’re going to build our new overlords, let’s have them
read good books. This was the AI days. Now we know it’s a much more collaborative
enterprise where they’re taking over, but that was sort of the idea at the time. The web collection that the Internet Archive
has been building for the last 20 years is fundamentally a kludge towards trying to give
this digital library a memory. It’s a kludge ’cause the web, thank you to
Sir Tim, but the web is still kind of a kludge. We wanted it to go and have all the versions,
and be able to build something that you could learn from in aggregate. I think of Google as a first pass, Google’s
search engine as a kind of a cool hack, or the Wayback Machine as a cool hack, useful,
but think of what you could do if you knew what everyone was thinking about and over
time. This was the idea of building, well, my career
in trying to build the dataset that then we could go and use. Fortunately, the good thing is, the library
is, large part, there, so let the fun begin. The problems that I thought, where it was
going to be technical, or bit rot, or obsolescence, or emulation, turned out to not be the things
that I spend most of my time dealing with, which are institutional issues and policy
issues, which is code word for copyright, piracy, privacy, trademark, all of those complications
towards building the global mind. – This is a topic that is perfect for this
recognition and honoring of Alan Turing, because the thing that Alan Turing invented was execution. Prior to his work and the founding of computing
as a field, knowledge was preserved statically, in books, or papyrus, or what have you, but
it was execution that is the distinct new capability that computing introduced. As I have watched the progress of science,
we’ve become increasingly dependent on executable content as the medium in which we do our science
and in which we communicate it. The ability to reproduce that, a decade, two
decades, many decades later has very deep implications for the scientific method. Let me give you just one example. In 2010, two economics authors, Reinhart and
Rogoff, published a very influential paper that was widely cited, widely read, and because
it came out of the depth of the recession, seriously influenced economic policy in many
countries, including Europe. The paper analyzed data from the economies
of countries since World War II and came to the conclusion that countries that did not
practice austerity during periods of recession ended up having extremely long recovery periods. It’s published, and many countries in Europe
actually followed it, and now it is understood that it has had some very severe consequences
for their economies. Three years after the paper was published,
a graduate student at UMass Amherst contacted the authors, obtained the Excel spreadsheet
macros that they had used, and redid the calculations, found a bug, and when he fixed it, the results
were quite counter to the claims of the paper. Now, the debate goes on as to whether it’s
a significant error, or whether it’s a misinterpretation. That’s a separate matter, but think about
the implications for the scientific method. If the student had been born 30 years later
rather than three years later, could he have run the Excel spreadsheet? Would he have an environment in which to execute
it? How would you do science and validation in
a world in which the software can no longer be executed? To me, this is a fundamentally deep problem. The good news is that we have a glimmer of
a solution in technologies that have been pursued for cloud computing in the form of
virtual machines. While there’s a long way to go, I agree completely
with the point that Brewster made, that the technical challenges are actually only a small
part of it. It is the legal, licensing, and other related
issues that are a big part of the problem. – We’re gonna start off, having had those
very interesting introductions, I’m gonna start off by asking each panelist to talk
a little bit about a particular example of a digital object that they know is of value
and is now hard to access, retrieve, use in whatever form. I’ll just toss out an initial one from my
experience a little bit. I have a friend who’s actually a planet scientist. Some years ago, she was curious to double
check that NASA had applied the proper corrections to its pictures of vegetation that were then
used to drive climate models at the time. It turns out, the question of the angle of
the sun and the angle of the satellite vis a vis the place being photographed changes,
to some degree, the spectrum of the plants that you pick up, and you have to do some
corrections and so. She want back to NASA and said, “Can you give
me the original raw satellite feed “instead of the correction that you did?” The answer was, “We didn’t keep that. “We were sure our corrections were right.” This is just an example of a curation problem,
which is one of the obvious ways that you can lose things, that you may go back and
go, “We regret losing that.” At any rate, with that. – Let me give you a couple too. I recently discovered, I had some 3 and 1/2
inch floppy disks stuck in a drawer somewhere. I wondered what was on them, so I found a
3 and 1/2 inch floppy reader that plugged into the USB port of my Macintosh. Amazing, right? I stuck the little disk in, and I actually
pulled some files off, and lo and behold, they were a bunch of Word Perfect files. Of course, I didn’t have any Word Perfect
running on any of the machines that I have right now, and I didn’t pick up the phone
and call Satya, although I was tempted to do so. I wanted to make one other point related to
this. It has to do with the wonderful worldwide
web. The worldwide web is current. It’s not cumulative. Stuff comes and goes. Even the URLs that we use come and go depending
on whether the domain names have survived. Everybody’s got 404 Page Not Found. Brewster did something absolutely wonderful
to help with this currency problem because he’s got the archive. There’s a plugin that you can put into the
Chrome browser, and when you get Page 404 Not Found, it automatically invokes a search
of his Internet Archive, and it works, by the way. I found a punch of pages that were gone except
for your saving them. This notion that the web has everything in
it is wrong, and we actually need to be very conscious of trying to create a cumulative
environment, not the one we have today. So there. – One example I thought of was, actually,
my dissertation, which I wrote in device-independent troff, ditroff. – [Vint] Wow. – But you said we had to have examples that
had value, so I’ll move on from that. I have to say that, in my own lab, we’ve misplaced
data over the decades I’ve been working. It happens. Students come and go. I can’t read my Iomega disks anymore. Things like that happen, and evolving the
technology is one of those problems, but there’s a human problem too, and that’s that we’re
just not very good at being disciplined about organizing ourselves. Maybe something like the Internet Archive
can help impose an organization for us. I’d also like to say that libraries and museums
have had this problem from Alexandria on. Things get lost in libraries and museums. We digitized a manuscript a decade ago, and
in the flyleaf, we found another manuscript that was thought lost that had been bound
in when it was rebound in 1960. For those 40 years, no one knew where that
second manuscript had been until we digitized it, and then there it is. It happens. – I give a story that, I assume, relates to
almost everybody here in the audience. After a number of years working industry,
and now teaching at University of Nottingham, have charity to science, and often get students
come with brilliant ideas that want to work on a project. One of them wanted to create a very personal
communication between two individuals, something like looks like private Twitter. I remember, we did a system in Microsoft Research
in 2007, and I just thought, “I should show them this,” because they will embark on a
group project that will involve lots of interaction in interfaces and architecture that they would
have to review. I asked myself, “Why am I accepting the fact
“that I can’t really show them the prototype?” Why is that we, in computer science, as opposed
to somebody who wrote a book, can go and show the book, can only show a PDF, or a published
paper, as opposed to whole system that was there? That sort of made me think that, in fact,
what we need to do among ourselves is change the expectations. If you change expectations, then you start
thinking harder. From us in computer science, we will create
practices that will probably spread to others. – How about Obama-era web services? We are now dependent on this worldwide web,
not just for retrieving pages that we could snapshot into the Wayback Machine, but actually
running web services. In the old days, we’d write papers or whatever,
and we’d hand them to a librarian, they’d put them on the shelves, and they’d pull them
out later and make them useful to another generation, another place that might want
to have access to it. We don’t have that anymore. We have websites that people do, and either
they graduate, or they retire, or die, or get voted out, and the website then goes away,
but they’re working websites. These things are living databases, running
systems. We don’t really have a mechanism of carrying
them forward. We did okay with the open source software. That was good for dealing with generation
change ’cause you could fork it. It was kind of dramatic, but you could do
it. Open access journal literature is a really
much better idea than the closed stuff that we’re still making happen. What do we do now around running web services? Can we build with these virtual machines maybe? There’s new work going on in the decentralized
web where you can actually fork living websites and be able to run them as if they were in
the old days. That is a possibility now. There’s a lot of energy behind this, the Data
Refuge Project to try to keep climate science going at a point where the sites are sometimes
going down, but also they’re getting defunded, but people might want to continue to populate
those databases so we have continuous services, but run by maybe a different country that’s
more climate-science-oriented, some things like this. Is there something we can do to go and preserve
running parts of our culture that are now executable, interlinked, web- or app-based
systems? – I already gave you an example from the scientific
world. Let me give you one closer to me personally. About five years ago, I got a phone call from
a lawyer. Always bad news. This guy had found out that my team had built
the Coda File System 20 plus years ago, and he said, “When was the earliest version “that
you released to the world?” I said, “Oh, it was about 1995.” “Do you have a copy of that running?” I said, “No, are you kidding?” I said, “Why do you care?” He said, “Here’s the problem. “There is this patent troll which is trying
to sue us “for something that, as far as anyone can tell, “was part of standard file system
practice. “None of your papers, we have looked through
all of them, “actually bother to say this “because it was such a standard practice. “If we could actually get that version working,
“and show it, and it predates the date “of the patent filing, then we’d be done.” We had the source code thanks to careful archiving
in source code control systems, but getting that code to work, these guys managed to do
it. I told them all the versions of the compilers,
which are now long obsolete, et cetera, they managed to get it working. It occurred to me how much simpler it would’ve
been if we had been able to freeze dry a version of that system as of that era, and to thaw
it at a distant point in time for situations like this. That is, papers and scientific communication
can only document a subset of the real bits. The total amount of detail is far too large
to document in prose. It’s the running system that is the ultimate
documentation, and the ability to reproduce precisely that execution, which I called execution
fidelity, is really a fundamental capability. – Could I … – Yes, please, absolutely. – Just a question for Satya. You can imagine, like all of us, I’ve spent
a lot of time looking at this whole problem. In the discussions I’ve had with some of my
colleagues, this notion of fidelity came up, and the possibility that you might accept
varying degrees of fidelity in the course of trying to archive things for future use. I just wondered whether that notion popped
up in your universe as well. I know you’ve worked very hard to get absolute
execution, precision, out of some of the old software, but the notion that you might not
be able to do that, but you might have some ability, to get some rendering out of it would
still be useful. – Yes, I think Vint makes a very good point. Some hardware devices, for example, no longer
exist. Just precision of software execution is not
enough. You may have to bring back certain kinds of
hardware if you really want that level of fidelity. Ultimately, the decision as to, how accurate
is accurate enough has to be driven by the use cases. In the case of these lawyers, they had a very
specific goal, that the central patent claim had to be defeated. If you could make it well enough that that
fact was established, the rest of it didn’t matter. You make an excellent point, Vint, that it
is very hard, at the time of archiving, to put a huge amount of effort up front to say,
100 years from now, it’ll execute with exact fidelity. Upfront, paying the cost for something that
nobody may ever look at again is a very difficult case to make. Some compromises that strike the right balance
may be needed, and if you need even greater fidelity, that may require much higher effort
to achieve. – Go, please. – I would like to add to this a slightly different
narrative and different perspective on the same issue. At the moment when we have software, if it
stop being maintained, it stops working, so the challenge for us in computer science is
to think about enabling software to gradually age so it’s not sudden death, it is the way
of aging. Virtualization of a computing environment
now is becoming commonplace. It is bread and butter of cloud, for example,
so industry is investing a lot in virtualization. Encapsulating software in these time capsules
will probably reduce some of the functionality, but that’s okay as long as we know, when we’re
designing the systems, as we know what’s gonna happen after the prime time in the marketplace
is over. When the demand is too low, so it is hard
to maintain this, economically, it’s not feasible, it’s not that people just don’t care. It’s really not economically feasible, and
whether this is commercial, or whether it’s open source community, there is not enough
energy, enough hours to maintain everything. When that’s too low compared to the demand,
then we need to think about secondhand software, or a secondary market in which this aging
can be controlled. Of course, as a ecosystem moves to the next
stage, as we continue to move forward, that’s one thing that we cannot stop, and we don’t
want to stop. Obviously, there will be degradation, but
if we can do this in advance, we can plan, in principle, how that’s done, then we won’t
be surprised, and we will feel better in control with set expectations, and then we can have
lawyers informed, what they can expect, as well as what education can teach students. I can’t teach my students anything. I can’t show them anything at the moment. In computer science, there are only PDFs,
and that’s not good enough. – I’m gonna jump ahead to one of the other
questions we talked about ’cause this seems like a good one here. We’re a field very focused on the future. We’re always interested in bringing out something
that has new features, new capabilities. How does that interact with our ability to
preserve the past? The fact that we’re well into the teens on
PDF versions of formats at this point and so forth, how does this continuous drive to
have something new … now that Moore’s Law’s dead, maybe it won’t be quite so bad, but
the pressure to have something new for the latest computer that shows off its wonderful
features, or the latest version of your software that inevitably changes the document form,
how does that affect? – It interferes with backward compatibility,
which is a frequent problem. I tried to load a 1997 PowerPoint file into
my 2011 PowerPoint software, and it said, “What’s that?” – And your answer was, “You made it years
ago.” – Well, I won’t repeat what I said at the
time. “It’s a Microsoft PowerPoint file, you dumbass.” Backward compatibility suffers ’cause you
can’t keep everything compatible, and it could be that, in order to be backward compatible,
you can’t do some of the new features, as you were talking about. Yeah, it’s a problem. – [Craig] Brewster. – I think it’s actually a blessing. I’m gonna take the counter. The reason is, is because we’re living life
in fast forward. It used to be that, basically, we’d let the
guys die, and then they’d be shoveled over to the archivists and the librarians, and
they’d have to sort of pick through it and deal with it. That’s not our fate anymore. I was sitting next to a fellow that worked
on the original HyperCard program at Apple. It’s like, we just got old Macintoshes to
boot in JavaScript in your browser, which, cool, and the software archivists went and
collected up a lot of the top HyperCard stacks and put it into the virtual Mac that you can
boot. Why am I bringing this up? Because he’s still alive. We have an ability and we have a responsibility
to go and make sure that the best we’ve done or the best we know makes it to the next generation,
and if you don’t do it, there may not be anybody cleaning up your stuff. We have the ability, and we have the urgency
to go and make sure that the best of the things that we’ve done make it to the next generation. Because the cycle is happening so fast, we’re
now alerted to the issue, and we’re part of it. All of us have got some inner digital librarian
to us, that there’s something that we want to make sure happens. There can be empowering organizations, hopefully,
ours as well, but it’s really now a distributed problem as opposed to somebody else’s problem
to do. – I think we should invoke the medievalist
over here for just a second. I remember having this discussion with a bunch
of librarians, and I was concerned about preserving content, and the librarians were all agreeing
that was important. Then some kid got up and he said, “Oh, this
isn’t a problem. “The important stuff, people will convert
“into new digital formats, “and the stuff that isn’t important will go away “and nobody
will care.” It took half an hour to get the librarians
off the ceiling because they pointed out that, some things, you won’t know are important
for 100 years or so, and then you wish you had that. Mr. Historian, can you shed some light on
this? – I have a very good story about that. A colleague, when I was young, who told the
story of going to a medieval library in Florence. There were two parts to the story. One was the tremendous importance, and this
is just for amusement sake, of having a letter from the president’s office of your university
with the appropriate seals on it. Literally, they preferred wax seals, and ribbons,
and everything else, and this was your letter of presentation to even get into this archive. The other thing he mentioned was that they
would then vet your research. The person who was going with him on this
trip, he had his particular research while the other person going was there to actually
study the evolution of Florentine architecture in the 13th century and wanted to look at
all of these old manuscripts which were designs by different early Renaissance architects,
and sketches, and designs. The library concluded this was not valid research
and wasn’t going to open its archives to the purpose yet, so it’s not even just that we
don’t realize it’s an issue. Sometimes, even the people who we task with
storing it, who actually even have it don’t know what they have and whether to value it. – Bleep. – Yes, anyway, so that side effect. I wanted to come back to the evolution, and
Brewster’s posited an interesting perspective. Anyone else? Yeah, Satya. – Yeah, if you took one of those 13th century
books and opened it, it may smell musty, the binding might crack, but typically, nothing
bad would happen to you. If I executed a piece of code from 15 years
ago with perfect fidelity, none of the vulnerabilities that have been discovered since then would
be fixed in that piece of code. You are, at that moment of executing it, both
potentially suffering from any vulnerabilities, and if it is being used as a vector of attack
for the rest of the internet, you’re also being a agent of destruction at that point. What is very interesting here is, perfect
reproduction of old code includes the reproduction of those vulnerabilities because, if the reason
I’m doing this is because I’m teaching a class on computer security, and preserving those
bugs is an important part of what I want the students to learn, the ability to execute
those things in a protective bubble becomes important. This is an interesting dimension that is only
present in execution, not in reading static material. – [Brewster] That’s a good point. – In my attempt to bridge the worlds between
the memory institutions, the library, the archives, museum, and computer science, I’d
just like to point to one thing that relates to the selection process. Memory institutions have developed very elaborate
workflow for preserving stuff. When they get additional object, they need
to first characterize it, need to know what it is, and then they decide what the action
is gonna be. What frighten them most is the amount of it
because, before, curation, meaning describing an object, was picking a book and describing
it, but now it’s about a couple of gigabytes of emails. They need our help to do the selection. They need our help to think how to preserve
the metadata. At the same time, they need to ensure access. One thing that’s very interesting, when we
create computer science tools, one tool can create millions of objects. Exactly that imbalance actually works for
us in the digital preservation because, if you give me only one object and I can preserve
it for 50 years, I can do it for millions. The strategy that Satya is describing, preserving
these environments, we can potentially help memory institutions with this nightmare of
selection process because they feel guilty selecting and leaving the rest in a black
hole, practically, because they are now moving forward. Migration of formats has been the main strategy
that they have so far, but it’s expensive, and it has to be done again, and again, and
again. One important thing, if they come to you and
say, “Okay, I have moved things “from Word Perfect into PDF, “how well did I do? “Is it the same?” We say, “Well, it’s not the same. “One is Word Perfect, and one is PDF. “It’s not the same.” Then you unpack a little bit more. What do you want to be the same? Then it turns out it’s a layout in this instance,
but they all very much know that that’s for the documents, if you know, can read, but
what do we do with the 3D archeological models that we are creating now? How is that going to be seen through conversion
of formats? In a way, it is for us to explore this so
the even distribution, for one tool, we can cover many. Then we can help them then create these tools
to cover the many. – Wonderful. Bret, do you want to say something? – Yeah, one thing that’s interesting about
ancient documents, that you might not get in trouble if you break, is that they contain
ideas, right? The ideas are actually maybe as powerful or
more than an exploit that’s possible in an old piece of code in the sense that, when
people see those ideas, then there are reactions, culturally and scholarly, that have impact. One thing that I’m interested in is making
sure that people have access to those ideas, but also that the adversarial nature that
is possible in this digital world doesn’t play to our disadvantage. That’s where the provenance chain is so important,
because old things that become digital and then are tampered with by an adversary could
create, for example, wild scholarly claims, or worse, and we would have no way, really,
to counter that. – [Vint] It’s called fake history. – Fake history. Fake news, fake history, there you go. – We brought tens of thousands of titles of
Apple II, Atari, Commodore 64, computer programs and games up on the net. There was a little bit of hand-wringing, like,
“Are we gonna get sued out of existence? or, “How are people gonna react?” It turned out, it was Oregon Trail. Oregon Trail is this game that people played
a lot, and they melted down our servers. It was like, “Oregon Trail? “I haven’t played that for years.” They just wanted to go through it again and
die of dysentery every time. This is a game that was almost impossible
to win. You’re struggling along, and you were trying,
and then you die of something, and you get attacked by Indians. Every time, you die. People loved doing it. It wasn’t just people reliving their 8-bit
past. There are now kids in classrooms using that
as a way of learning Oregon history. There’s also this kind of, I don’t know, maybe
it’s like the kids going into vinyl, there’s a little bit of understanding what that 8-bit
world was like, and it’s very important to bring it up and out. The answer is, we weren’t sued. What happened was, some of the publishers
came back and said, “You know, we’re still hawking Tetris and Pac-Man,” so we just went
bloop, bloop, bloop, bloop, bloop, bloop, and took those down, and the tens of thousands
stayed up. Since we’re interested in access, if they’re
still hawking it, great, we’re patient. They’ll die soon. Then we’ll put it up, and as long as that’s
sort of the mode, how we deal with, put things up, take things down that are sort of conflicting
with capitalism, that seems to all around work well. – Actually, this brings up a wonderful point,
and that is that the problem we’re talking about is not only technical, as you just illustrated. They’re the legal questions about, who has
the right to execute code? Who has the right to content? And the like. There’s one other, and that’s the business
model. If we’re really talking about preserving things
for 500 years, how do you build a business model that works? There aren’t very many institutions that last
that long. I can think of some beer breweries that lasted
that long, and some wineries. – [Craig] There’s the Church. – There’s the Catholic Church, yes, that lasted
a long time, and some others, but I haven’t figured out exactly how to connect … – And Martin Luther just got there, Lutheranism. – That’s it. Figuring out business models is also just
as important, as you know from your own experience running the Internet Archive. – We’re getting, actually, a large number
of audience questions, so I thought. – Great, go for it. – I could start moving into them a little
earlier than we’d planned. One of them is about the fact that we generate
large, great volumes of data. While we’d love to preserve things, what do
we do about a lot of the data that, in some cases, may be designed to be ephemeral? You know, Instagram, Twitter posts, Facebook,
some forms of sensor feeds. Do we really need to record all the traffic
light changes in a particular city in the United States or somewhere in the world? This data seems to have, would at least appear,
according to the question, to have rather short-lived value. Is there a value in preserving it? If so, how do we do it, and at what cost,
and how? – Well, wait a minute. We have official processes to preserve some
things that you might’ve thought should be ephemeral. Look at what the National Archives is responsible
for doing. I don’t know whether they’re now responsible
for preserving tweets from the president. – We crawl a lot of that for them under contract. – There’s one thing to keep in mind, that
what is important and what is not can have a huge contextual component to it. Here’s a trivial example. I can think of nothing more boring than video
captured on a dashcam pointing in front of me all day long, yet virtually every automobile
in Russia is equipped with a dashcam because your insurance claim is never going to be
honored unless you can provide the video to support your claim. All those amazing photographs of the meteors
streaking across Russia, you find them because the dashcams of many automobiles were on. It is an interesting question, what is worth
keeping? How long should we keep it? There’s a question of retention also. They’re related, but distinct questions. Is it perpetuity? Is it 30 days? – One thing I love about this side of the
digital divide is that there’s a democratization about that that didn’t exist in the past. Most of the things that made it, either it
was accidental, or someone made editorial decisions, and they may or may not have been
right. They may have suppressed a lot of really important
things just because of the culture, and the decision-making, and the thinking of the time. Now we’re in an era where we can do some things
that are automated. We can crawl things automatically, and we
can also allow that crowdsourced idea to see what floats to the top. For my Twitter feed, there may be crickets,
but for others, there could be real importance. – One interesting point I think is worth making
is that, as we create more and more digital objects, it’s my view that, if you wanted
to preserve it, you should have the option and the capacity to do so, with the technology
and so on. Even though someone else might not care about
it, you might. It’s not that I’m arguing that we should preserve
everything, but we should have the ability to preserve things that we would like to hang
onto for legacy reasons. – We’ve got a real privacy issue, though,
out there. In general, people just don’t want to feel
like they’re being taken advantage of. We’ve had, over the last 25 years, so the
era of the web, a tremendous sharing experiment going on, that, when my parents went to college,
they were warned, “Keep your heads down.” It was the McCarthy era. Bad things can happen to you if you join the
wrong club. My sister had a little journal that she had
a key, and she hid under her bed, basically to keep it from me. The equivalent of my sister, I think, has
that on her public Facebook page now. We have, basically, a real generation that
has grown up sharing. If we don’t live up to the requirements to
not take advantage of people, where Snowden went and showed that things were going wrong
in that direction, I think some of the business practices of some of these large platform
companies are showing ourselves going wrong in these areas. People are gonna clam back up again. The context will shift in such a way, having
private information about you, whether it’s dashcams or who you’re hanging out with, will
become worse than having it be exposed. Societally, we are the establishment, we’ve
got to go and prevent the society from making it so that sharing isn’t going to be used
against you. – That relates to another audience question. Oh, sorry. – That’s okay. Maybe I can just comment a little bit more
on how to start thinking about what to preserve and not to preserve. Often, the question is asked very abstractly,
but even among us, the computer scientists, and just talking for myself, when I start
a project, what do I think about? It’s very hard to think, when you’re starting
something, about what’s gonna happen at the end of the project, and that’s reflected in
my behavior. You have to see how I write papers. You have draft one, draft two, draft 2A, draft
2B, 2C, meaning that I’m creating this whole stack of documents that I know are completely
transient, yet I do not dare press the delete button just in case. I’m accumulating lots of data that’s supporting
my activity. Question is whether, with this awareness of
the costs involved, any digital thing I create is forever. It’s like giving birth to a child that never
grows up. It needs to be fed. It needs to be placed somewhere, taken care
of. If you had that in mind, that would be possibly
changing our practices, but organizationally, collectively, if you just go back to our universities
and ask, “What is the strategy at the university level?” For each individual that’s working there,
to have a life-long record of what they have done. We don’t have to go very far abstract. We start from ourselves. – I will just point out that, if you were
a major author from 100 years ago, all those different drafts would be the subject of deep
scholarly interest by somebody. Again, it’s the trade-off. Satya, you wanted to talk, go. – I was just going to say, in the abstract,
the question of what to save and what to throw away is a very difficult question and has
to be answered contextually. For this audience, I’d like to suggest one
goal. Sounds very humble goal, but I think it’s
very difficult. For a long time now, the PhD dissertations
of every student have been preserved in the university libraries, in various other places. What would it take so that, from now onwards,
the actual working code of all experimental PhD thesises are forever executed? Just think about it. That is a tiny fraction of the big problem,
but just that is a point capture, is the final version of the execution. We would be running Ivan Sutherland’s original
code on an emulated PDP-1 instead of seeing the video that we saw. – I’m gonna jump backwards from that point. We can come back to it, but I wanted to, because
an audience question hinged on this question of preservation and raised an interesting
dimension, which is, the question was, when does data preservation actually become unethical? Don’t people have, in certain situations at
least, a right for their past online behavior, or indeed, even their private digital behavior
that somehow became public by accident, to be not remembered? What are our obligations as preservationists
in that regard? I’m seeing a lot of heads nodding. – This issue, actually, came up in a very
painful way in Germany recently. A young girl who was 13 years old passed away,
and the parents wanted access to her social networking account. I don’t recall which company. The parents were refused access to that account
on the grounds that she was 13 and not eight. I don’t know, 13 must be a special number
in Germany, but the point there was, it struck me that, here was a case where parents were
denied access to a child’s content. You can see, probably, both sides of that. It’s a good example of a worked problem in
that space. – Here’s a partial answer. I think the problem is extremely difficult. I think, in its absolute form, I think it’s
unsolvable. In a world in which caching is possible, you
do not even know how many copies have been made of these bits. – [Vint] That’s a good point. – I think, formulating this problem as a best
effort problem, that is, there is an entity. The entity, to the extent that it’s aware
of all existing copies, can do the best it can. I don’t believe there’s ever going to be a
perfect solution to it because, if somebody ever viewed it, in their browser cache or
their file system cache, are those bits. – Some stuff shouldn’t just be kept in the
first place. Almost all IP addresses on web servers should
be thrown out before they’re written to disk. It’s toxic waste. This stuff is basically tracking the reading
behavior of millions of people, and they’re not really aware of it. If you have a security issue, then turn it
back on again and run it for a while, getting IP addresses, but in general, get rid of them. I think we need to countermand some of the
original, we don’t know anything that’s going to be valuable, let’s keep all of it, I think
we need to take some judicious thought. We really enjoy, actually, not having IP addresses. It shortens conversations with certain people
that come make demands of the Internet Archive about reading behavior of our patrons. We say, “We just don’t have it,” and they
go, and turn around, and leave usually. Sometimes, we get national security letters. There’s real poison in some of this data that’s
going on, and I think we need to push back against some of our urge to save everything,
especially the things that are surreptitiously gathered. It’s a balance. We have to have these conversations in the
open, and often, they’re not. A lot of companies are kind of afraid of even
having researchers be able to look at search logs, things like that, because it is personally
identifiable if you work hard enough at it. It’s all nuance. It’s context, and the context can shift as
political environments, or in different countries, or where we may think that things are okay
here, but other people may not trust the United States. That kind of thing is very much endemic into
our whole world of digital librarianship. – It really relates to something that we computer
scientists should think about. It’s about the transparency in the design
that, in the name of simplicity and actually user interfaces, we hide a lot. It would be overwhelming for end users to
know everything that’s happening in the system. Then it’s about, who decides what is gonna
get into the interface? What should users know about? It’s not this thing about data, this digital
form. It, again, has to be related always to the
tools, infrastructure that is being used to engage with. When we are now at the technical and ethical
issues, I’d like to bring up one other that’s very related. If somebody has purposely designed a user
system to create the data and that is valuable, how do we, as creators of this technology,
ensure that they can continue to use that? If you remove the software, they have nothing. Without our software to use the data, interpret
data, analyze the data, the value of that data is zero. That brings us to another ethical issue, how
do we plan for the degraded aging of software so that we do not affect other people’s assets? In the code of ethics of ACM, it says nicely. It says that we are responsible, what system
that we create should not affect end users, and should not destroy any assets. – I point out that the issue has always been
a problem in antiquity with, for example, correspondence where someone else sends someone
else a letter, what’s the responsibility of the recipient in terms of the content of that
letter? Commentary on religion and philosophy that
could be in that are explosive? Copernicus, Galileo. I think the social media example now is particularly
apt, and I think we’re responsible for the amplification that we’ve created with technology
because this technology is a huge amplifier that we can’t just create and then say, “Well,
there you are.” I mean, I think we have a responsibility to
have a dialogue and answer some of these questions. – You know that there has been legislation
in the European Union about something called the right to be forgotten where people can
assert that they want something taken out of the index of search engines. Of course, that doesn’t cause it to be removed,
in Satya’s point, it doesn’t cause it to be removed from the web server, so it might be
discoverable by other means. It’s a little difficult to chew on that, but
there it is. That’s one response that’s been made to that
question. – Other things? This is an interesting one. Part of preserving data is preserving the
meaning. How do we truly do that without requiring
a brain dump of the original owner? I find myself thinking about Satya’s example
of running Ivan’s software. The answer is, can you reconstruct Ivan from
that era? – That would be impossible, but maybe the
software is a good start. I don’t know. Meaning is such a deep, deep word with so
many layers of subtlety. I will settle for execution fidelity. That’s hard enough. – In scientific papers, wouldn’t it terrific
if, instead of having a PDF, you had sort of a living mobile website that had the paper,
that had the data and the graphs, and the software was there, and the data that is there,
and it’s all encapsulated, and in such away that you could go and change some of the data,
and it would change the graphs, or you could change the software, and it’s basically an
executable versioned object? That kind of thing, especially if we could
make it portable so it could live in several libraries as opposed to just being in one
publisher that swears that they’ll keep it around or something, if we have an ability
to go and have executables that have the paper, which is kind of the explanation of what that
data means, or at least, this is an interpretation of one cut, that would be a step towards explaining
the data. When we receive just datasets that have rows,
and columns, and some titles, it’s almost completely useless to other people coming
along. It’s gotta be built into something where there’s
the software, there’s the text, and at least a use of it in such a way that people know
where to tweak and go with it. That’s an approach. – So there’s a worked example of that. In May this year at the Computer History Museum,
the National Academy of Engineering had a symposium, and a young man named Bret Victor
demonstrated exactly this capability. He showed how he would like people to write
papers for the scientific community which included not only the charts and the graphs,
but you could click on them, open up the database, run the program, change the spreadsheets,
do a bunch of other things as a way of reaching into history, in some sense, and exploring
what the capabilities were of the time, and maybe even injecting new data to find out
what they would have predicted or analyzed as a result. The answer is, it might be feasible to do
that. It’s a really cool idea. – There’s a subtlety here that needs to be
kept in mind. For example, giving the source code of something,
in theory, gives you the ability to execute it into the future. You can forward port it if the world around
it has changed, and then get it running again. I think we have to keep in mind how much time
and effort that involves. At the heart of this, we want the ability
for quick executability. If that indicates that the object is interesting,
then maybe you’ll spend the six weeks needed to forward port the source code, but implicit
in a lot of this, which I don’t think we have mentioned so far, is the upfront cost you
have to pay before you can get something working again. I think keeping that mind is actually important. If you make something very difficult, people
simply won’t do it. On the other hand, if a piece of code is easily
executable, say in the context of the scientific method, people will actually try it out and
see whether they’re getting your results, which is overall an excellent thing for the
field. – I would like to say that meaning is always
gonna involve the intepretive user, and when the thing that you want to understand is behind
a paywall, or isn’t easily accessible, you limit some of that. I love the Internet Archive and the ability
to access things more freely, and I would love to see our scholarly work continue to
build these more robust objects, and then make them freely available so that people
can interpret them and ascribe meaning. – Just keep in mind that it costs money to
do that. There has to be a business model that makes
this work. – That’s why we have the government, so that
they can pay for the things that we need. – This is exactly what UNESCO is trying to
do. UNESCO has 200 member states and affiliate
states and the Memory of the World program. Objectively, it really is, to preserve the
digital artifacts together with physical, in a digital form now, so the future generations
can read it, can reason about the past, and they would interpret things, so they would
give meaning in their own contexts. The issue for UNESCO is finding, determining,
what are the business models that could invigorate the economy so that there is demand for the
past digital artifacts to be reinstated? That’s why we need to look at this holistically,
practically, and look across disciplines science, education, cultural institution, memory institution
together because the main problem with the past is it’s not accessed often. Business models that are based on frequency
won’t cut it. The question then is, what would be the right
business model to provide access for rare use, but possibly very important use? One of the things that UNESCO is doing now,
it put in place a program called Persist. As part of the Persist, the idea is to create
the bank of all software that is relevant to preservation. The idea is to create an international bank
where the environments in which the software would be hosted. The issue of cost is gonna be, who would be
able to pay? And what sort of services would such a bank
provide to the rest of the ecosystem so that it can sustain itself? There are a number of ways one can do this,
but one could be going to the governments and taxing people, actually, first taxing
the companies that are releasing the software. It’s a very mild tax. When somebody releases a software, perhaps. They should use UNESCO as a retirement place
for their software, just in case that, if company’s not in business anymore, why wouldn’t
that be in a bank? It has to be win-win. We cannot just assume that industry, commercial,
this is evil. No, they cannot sustain things. Economically, it’s impossible to sustain the
whole history of the software that they had. That’s why there must be something in the
ecosystem that would take the secondary, and then having the final resting place, we call
it the final retirement place that UNESCO may serve. – I’ve got a counterpoint to that. I think we do need institutional support for
these things, for some of the long term aspects, but I think we need to empower the fanatics. The big heroes in our world have been people
that really care deeply about something that are outside of institutions, and there often
are these communities that gather these materials, some of it legal, some of it fringey. The idea of going and putting an umbrella
over some of these activities to go and say, “Look, you’ve got cover. “Keep going. You may not be able to distribute it infinitely,
but some of the best software archiving that I’ve seen done, the best music library, I
think, ever built by humans was called What.CD. It was brought down last year by the French
cyberpolice, but it was unbelievably well done in terms of the metadata. It was an underground network of fanatics. These are people that are mostly in the industry
that want to make sure that the stuff survives. If we can go and build bridges to make it
feel safer, Twitter to donate their archives to the Library of Congress and not feel like
it’s going to be a problem for them, or these fanatics. I think, if we can empower in a decentralized
way, we can all prosper from this. We’ll need some of the UNESCOs, but asking
UNESCO to solve the problem, I think, is unlikely. – Can I just comment? It is not about UNESCO solving the problem. In fact, UNESCO trying to provide this umbrella
to empower individual organizations like this because, you won’t believe it, how often people
come to UNESCO, say, “I’ve got this fantastic thing. “I got it from Twitter, but I’m not sure “that
I would be able to maintain it for next 10 years. “Can you do something?” Then UNESCO give them a stamp, and they say,
“Go now and get money for it.” In a way, UNESCO is not to become a centralizer,
positer, everything, it’s really to enable people who need help to get the society aware
of the issue, and hopefully get lots of rich donaters, give the money to sustain this for
the future. – Okay, we’re actually within a few seconds
of finishing. I think, with that hope for preservation,
we’re at just about the end. I wanted to give anybody who had one last
sentence or two they wanted to put in before we finish, just to say something. – Just one thought. Brewster’s comments suggest to me that we
should be building lots of tools that will allow people to preserve digital objects if
they wish to do so. – Exactly. Alrighty. – That’s it. – That’s it. – Okay. – Thank you so much. – Thank you, Brewster. Always a pleasure.

Leave a Reply

Your email address will not be published. Required fields are marked *