ACDH Lecture 4.1 – Jennifer Edmond – What can Big Data Research Learn from the Humanities?

Thank you to the Center for having me
and thank you to Charlie for that lovely introduction. You may be wondering first of all, given my biography that was just given, I do
work a lot in digital research infrastructure, and particularly at the European level,
but when you work in research infrastructure and I think to a
certain extent these days when you work in the digital humanities,
you start to find that you’ve put yourself into sometimes uncomfortable positions. You
find that you’re creating systems and very large systems that may have an
impact on how we understand history, how we understand culture, and from that
slight disease came for me an interest in questions of, well, how do the arts and
humanities help us to understand technology. We talk a lot about how technology
helps us understand the humanities, but I was interested in the
other side and that’s a bit of what you’re going to hear today – although if
you want to talk about research infrastructure afterwards
I’m always happy. Thinking about this question of, ok, well, if technology is helping us
understand the humanities what can the humanities help us understand about
technology, there are a few friends I’ve made, intellectually, along
the way. And certainly when Alan Lou started to talk about the cultural
singularity and how the lack of a cultural criticism was blocking the
digital humanities from becoming a full partner of the humanities, I felt that
this was an important moment. And I also felt that perspectives such as the one
that Gary Hall puts forward abou how it’s not just interesting what
computer science can offer the humanities but what the humanities can offer computer
science, and I thought there was a real interesting question there. And within my
institution I also wear another hat: I’m an investigator in a large national
computer science research institute. And this race of all these questions,
the conversations you have, as someone trained in literature, sitting in a
research institute for personalization and adaptive computing, do lead you to
ask these questions and come up with some answers. And the main project I’m
going to be talking about, the real context for what I’m gonna be telling
you about today is a project we call kplex or knowledge complexity. kplex is
very interesting – I don’t know if any of you have had the joy of applying for
European research funding – but this is what’s called a “sister project” –
you probably didn’t even know there was such a thing as sister projects.
And the sister projects are an instrument that was devised so that researchers coming
from an arts and humanities background could help to expose bias in
computational research. And I thought that’s the kind of thing we need to be
doing more of. So this is actually a project that’s affiliated with the big
data PPP – the public-private partnership in big data – which says to you that not
only is there a research imperative there but also a corporate imperative
that we’re resisting in our sister project bias finding sort of way. And
what we proposed to do is to look at a number of things within the culture
of big data research that we thought might expose some biases and give us
some ways to look at possible interventions from a humanistic point of view
that could be made to improve the research in terms of its social impact and
in terms of its technological robustness, because we do believe that if
you improve the research generally, you can improve the technology as well.
So the things that we’re looking at primarily are first of all discourses
of data, how we talk about data, because how we talk about things –
I don’t need to tell people in this room – how we talk about things is important to
how we understand them. We talk about hidden data. Necessarily things that are
hidden to keep them hidden, but hidden because of accidents of history. Largely
this comes from a perspective of looking at cultural heritage collections in
Europe where you would have many that are very very well exposed – the UK, France,
Germany – and then you’d have others that are essentially invisible from a digital
point of view, primarily, for example, Eastern Europe. We look at what we call
the epistemic marking of data. I’ll talk later about work about
how data is never raw. You always have someone who created the
data. And if they didn’t create the data, they created the instrument. And if they didn’t
create the instrument, they created the sensor. So data always comes from somewhere. There’s always a human bias in it. And, finally,
we’re looking at complexity, and the representations of complexity, and how in
technological systems sometimes these representations of complexity can be lost,
can be smoothed over to our detriment as users. And you can see we
have a number of partners. It’s not a big project, but it’s been a very influential
one. And it all started, I think, the day I saw this billboard in the London
Underground. It wasn’t exactly this billboard because my picture is not as good
as this. The fact that there was a data analytics company out there who could
imply but analyzing big data was the secret to living happily ever after
disturbed me greatly because there is this almost fetishization of big data.
And we know that big data can be powerful, we know that it can be deployed
for example towards public health crises and to answer certain kinds of
questions, but the idea that generically you could say, you could take the fairy
tale trope, you know, taking the literary trope – that’s my turf – so I felt like
we had to push back against this and find out, well, where is the real
intersection, because we know as well that AI and Big Data – obviously
they’re related phenomena, are quite different in some ways but they have
similar effects sociologically – have baked in prejudices, they have baked in
biases. So if we in this project were there to expose biases, then this was
certainly the place to start. So I’m going to take a few topics out of the
universe of our project and expand a bit about them. And the first one I want to
talk about is words – amongst few minutes you can always talk about words. And we
started by extracting – and you don’t need to read all this it’s more there to show
you that this exists – we started extracting some of the definitions of
data that are out there – now this is actually from the scientific literature
about data. So these are the people who are actually writing so as to define
data. And you find that there are so, there’s such a variety there, that it’s
actually quite difficult to find any sort of coherence. You have data as
pre-analytical. It’s pre factual. It’s false. But data that is false is still
data. Data has no truth. It is resisting analysis. It’s neither truth
nor reality, but it may be facts. It’s a fiction of data. It’s
an illusion. It’s performative. It is a sort of actor, and it has a very distinct
set of properties for others, for example the difference between data and capita,
something that is given and something that is taken. So we knew, once we found
this kind of diversity even in the scientific discourse of
people who are studying science and looking at data, that we were going to
find more when we looked into the practice of this. So the next thing we
looked at is, we looked at the ways in which big data researchers talk about
data. And what we kept finding is: big data researchers talk about data all the time. You can see: 659 occurrences across, what do
we have here, two, four, six papers. And in fact the worst offender we found of
using the word data so much that it almost becomes empty was one paper – 21
pages long – in which the word data was used five hundred times. So when you look at that, you realize they can’t be meaning the
same thing every time. And what we did find – again, digging through some of these
papers – is that data can mean comparatively simple strings, or it can
be complex human created records. Same word, two very different phenomena.
It can be simple records and complex hybrid objects. So it can be
individual records, or it can be agglomerations of records. It can be
something newly drawn out of the environment, or it can be something previously available
for access analysis and navigation. It can be pre-epistemic.
It can be pre-processed. It can be of direct use to humans, or purely machine
readable. And, of course, it can can inhabit all sorts of different
qualities. It can be relevant, contextual, various, external, complex, rich. Note that
none of these actually tell us what the data is, they just tell us more or less
how the researcher feels about it. And my researcher pulled out a couple of quotes
and was kind of stomping around the office one day with these simply because
she felt that this was really, these were indicative of the way in which not only
was this your normal sort of jargon, but the fact that the word data is so
prevalent in these statements that are made makes it almost obscure to anyone,
makes it obscure to understanding. So, data pretreatment module is outside
from online component and it’s done to preprocess stream data from the
original data which is produced by the previous component in the form of data
stream. Or, we calculate the standard deviation
for the entire data in the stream to check whether all of the data are of the same value or not. And, or, due to visiting data once during the processing data in stream, the performance of processing data is crucial. Data, data, data. And we thought, okay, is this just the fact that we’re looking
at big data research, is there something here that is unique.
What’s interesting is you do have scenarios and schemas and standards for
how to talk about different levels of data, for example the the NASA data
levels, but we found interesting about this is that the transformations that
occur, as you work through data. So the cleaning, the scrubbing, the cleaning and
scrubbing, as opposed to the dirty data by the way – note the words there and
how some are positively valenced and some are negatively valenced. Well, it’s
interesting about NASA is that you can process data up to a level of maybe four,
up to a level of five and so then you have much refined data. If another
researcher takes that data to use in a different context, it reverts back to
level zero. So even the more well-defined and well developed schemas for working
with data, they have a very different sort of way of viewing provenance and
the impact that the individual researchers may have on what they’re
doing to the data. Now I began to wonder if this was just an epistemic thing. So
and this is based on some work I did a few years ago about is it, does it
have to do with the way humanists, and in this case particularly, this was work done by historians, how they create data and on how they view data. Because if you read
work on epistemic cultures, like Karen Knorr Cetina’s epistemic cultures, you
see that there’s a real difference. There’s a tendency to say that
humanists don’t collaborate, or that humanists, you know, the epistemic process
is entirely capsulated in the writing, or you have others who say that humanists
don’t create knowledge at all, they make it all up. You know, you’ve heard all of
these better and worse conceptions about what makes the two cultures debate.
For me it stands in the instrumentation. So, for a physicist working at CERN
or at the European spallation source, the question of instrumentation will have to do with physical instruments. In a microbiology
lab it has to do with repeatable processes. For the humanists, it’s about
layering different kinds of source materials. So you have your primary
sources, your secondary sources; it’s more like building a dry stone wall in which
you’ll see gaps. But one of the things that you nite is if you’re looking at
these kinds of things that go into that humanistic instrument, you’re not really
gonna find anything that you can even pretend to call raw data. The
fingerprints of the human beings who’ve come before are always front and center
within that kind of of source material, which brought us to thinking about the
differences between this one epistemic culture where the word data was so
prevalent and our own culture, which leads me to the ‘you say tomato I say
data’ because all of these words that we were finding were so diverse in the
humanities research were actually quite the same. So every one of these can be
mapped to the word data in some ways in computer science research. So if there is
one sort of light motif, or ein roter Faden, for the work I’m presenting to
you today is that there’s a lot more confidence that we as
humanities researchers or those from a background in humanities research
can take when looking at technology. Because there’s a lot that we can see
and a lot that we can sense and a lot that we do differently in very positive
ways. So, what does all this mean when we come to big data? Well, big data
essentially magnifies these issues “bigly”. Any word that has
anything to do with big is very popular in my office right now.
Because obviously magnification of errors makes them bigger.
Magnification of misunderstandings makes them bigger. And when you have larger and
larger agglomerations the likelihood that these are going
to come in in ways that affect what can be done with the data, it raises. The
black boxes get deeper. They get blacker. And then there’s this risk of what we
call epistemological fallout, where if interdisciplinary work is grounded on
manifold unresolved and undocumented and potentially contradictory, aberrant
and idiosynchratic understandings of the term, then you can come to a point of
crisis. And I think anyone who works in the digital humanities has had that
conversation. I need the data. I gave you the data. What, do you want more data? Well,
I have the data. You can have these entire conversations where two people
think they agree but they mean something completely different
around single terms, single words, such as data. Data is only one. So you can have
problems coming out of this and one of the things we’re realizing is that the
problems are not just in research. The problems are also potentially social
because we know there are problems with how people out in the world deal with
their data in terms of privacy and in terms of their own how they develop
identities, how they interact with their worlds. I think a good example of this,
again, this is from the digital humanities, but I think it points in the
direction of the importance of what we call things. My engineering colleagues
have often said to me: “I have a problem to solve. I don’t want to talk about what we call things.
I want to solve the problem.” And I recognize that
that is almost a caricature of an engineering bias, but we need to be very
careful. And I don’t know if you know this Twitter back and forth between
Mariam Posner and Bethany Nowvskie. This came on the heels of a funding call,
the Digging Into Data Challenge, where the number of female applicants was so
low as to be very noticeable, and when it was queried, the funder said: “Well,
we’d love to have more female applicants, but there was no bias in the system.” And
the discussion here is, well, is there actually a bias? When you learn
those words like dig and mine is there something intensively masculine,
inherently masculine about that language that causes people maybe to pull back if that’s not how they see their research. So maybe it’s not that it’s what girls really dig
is unicorns and sparkles and boys, but maybe it’s the whole digging in trope. Is it not my personal brand of scholarship or a rhetorical turn-off? So, again, you have to wonder if words
like data can become maybe not a rhetorical turn off but a sort of a turn
off that leads people into a false sense that all data is the same and a false
inability to differentiate between the data they don’t want widely shared and
the data that we maybe do want broadcasted and widely shared. So that’s
my first topic. My second topic is about memory. Humanists talk a lot about memory.
I think memory and identity are probably two of the largest umbrellas under which
you can group research in the humanities, whether it be into literature, languages
culture. But in kplex, in this European sister project, we talk a lot about
memory as it is encoded. Memory as it is held in institutions. Memory as it is
made accessible, as cultural memory, to people who might want to research it or
people who might want to use it. And you may or may not know the ENUMERATE
survey, but when you look at the levels of how much cultural heritage material
in Europe is digitized, particularly if you look at archival material, 13%
and, actually, if you dig into those numbers it’s even a little bit lower,
because a lot of what you find that has been digitized, they’re more the administrative systems and records within the the archives. That’s fine. But more and more
there’s going to be an expectation, if there is there, that in a
big data universe we will all be able to access Big Data approaches, that problems
will be able to be solved, questions will be able to be asked, but if the data is
not there, if the data remains hybrid between the analog and the
digital, what happens then. So we have questions there around how we deal with
the cultural memory of Europe and beyond. And I’m always told- I often ask about
provenance as being an important part of cultural heritage data and I’m often
told, well, there are W3C standards for provenance. And this is a colleague from
a library [who] actually tried to map some of that out. But when I think about the
provenance of culture heritage, I think about things like this. So this was a
record I found about a collection in the West of Ireland, which is quite
interesting, because obviously the collection was related to papers of
Roger Casement, and it tells the whole story. And I don’t know what parts of this history,
of this narrative, of that particular data set, I
don’t know what parts are the most imporant. Is it that they relate to
Casement and that they’re in the Clare County Council archives and, by the way,
there’s no particular reason for casement records to be in Clare. He had
no particularly strong link to Clare. Was it the fact of who they were found
by, the fact that they were kept under lock and key, the fact that the council
didn’t even know they had them? Well, he was a controversial figure. Was it the
fact that it came from a German u-boat or that he was on a German u-boat, was it
the fact that these records were handed over by a member of the European nobility?
I mean, what is important about this provenance and how does that
map on to a standard? How could this be standardized at all? So, again, there are
things that we’re going to remember and things that we’re going to forget. And we
in the digital humanities have always recognized problems with this. And I
think one of my favorite examples of how to really look at these problems is a
Todd Presner’s article about the ethics of the algorithm, where he looks
very much at how the Shoah Visual History Archive was marked up in
a way to try and make it into a research resource,
but because it was marked up by humans, you’re always going to find human
fallibility and human interpretation in that. And I would encourage you to read
the article because obviously there are both things that make the algorithm
more ethical, because it allows you to not be distracted either by the
paradigmatic individuals or by the mass of something, the mass of the the big data
related to the Holocaust, but also how that miso layer can be problematic
if you have human consciousness behind it. Which leads me to the question of the
European Open Science Cloud. Going from the Shoa Visual History Archive to the
European Open Science Cloud implies all sorts of things, which I’m not
necessarily going to dig into, but there is an expectation that research in
Europe in the next not even five years, in the next two years will become
underpinned by this cloud of data where we’re all going to share our data. Now, if
you are or were a humanists, you recognize that there’s a problem
here. Because I work a lot with historians and they don’t own their data.
They have a shared ownership of their data with the cultural heritage
institutions. I as a literary scholar don’t own my data. I share it with the
publishers, I share it with the authors. And it was really disturbing to me to
see in programme for the governance of the European Open Science
Cloud, which is going into build phase, – this is coming – and it will be something
we will all have to use, which with my DARIAH had on I do worry about. You look
at that list of stakeholders: Where are the publishers? Where are the libraries?
Where are the museums? Where are the archives? They’re actually not there. So
the whole idea that there would be research data that would have this kind
of complex social embeddedness is something that the European Commission
even looking at research data and trying to move us to the point where we can ask
questions and discover knowledge in the big data of European research,
even there, we’re finding blind spots that as a humanist seem, well, rather
obvious. And of course there are other assumptions that we make around big data
and memory. There’s a lot of people who think, well, the fact that we have the
Internet Archive is fine, the fact that we have the Wayback Machine means that
digital memory is protected, but I certainly was surprised when I first
realized that there’s a lot of not only link rot within the the digital
archiving but also that the use of the memento protocol, which allows sites to
be sampled at different times in different measures, means that you can
find sites that actually never existed, where the the pastiche of pieces coming
together means that what you have is records of a history that never was. And
that’s a little bit scary as someone who has a has a deep investment in the
importance of historical research. Now of course there’s social levels for memory
and forgetting as well. And I think in Europe we are in an interesting place
because obviously we are the place where you can have a public dialogue and
a court based, a legal dialogue about the right to be forgotten. And of course
we’re all looking towards the idea that the general data protection
regulation is going to make as scientists our lives perhaps more
difficult but also introduce protections for people in the world of big data.
But on a more fundamental level, I think that there are things that we are outsourcing
about how we culturally remember and culturally forget. And I thought that
this quotation from Mayer-Schönberger was really interesting. The whole idea
that without some form of forgetting, forgiving becomes a difficult undertaking.
This is precisely that kind of human value that we’re seeing eroded
in the anonymity of the Internet. So the question is how can we build better
structures from both the remembering and the forgetting in the digital age.
And the third topic I wanted to talk about was complexity. I love it when computer
science researchers say: “We want to reduce complexity.” I say: “No, don’t take
away my complexity. I need my complexity, but I need a way
through it.” And, again, one of the starting points for me for thinking about this is
the fact that raw data really is an oxymoron. There is no such thing as raw
data. And one of the examples I like to give and one of the examples we’re
looking at as a sort of a place to investigate this in kplex’s machine
translation. So, google translate, you know, a haiku, a Japanese haiku, a famous Japanese
haiku: Google Translate gives us the sound of water to dive an old pond frog.
Okay, so, it gives us a bit of a word salad, but what I think is more
interesting is what human beings have done with this in the past. Old pond
frogs jumped in sound of the water. Lovely. Lafcadio Hearn,
the Irish-Japanese patriot of two countries. The old pond frog jumped in kerplunk.
Well, that’s gotta be Allen Ginsberg. You know, a nice sense of the rhythm in the
sound of the language. And, of course, I do live not too far from Limerick in
Ireland. So we have: There once was a curious frog who sat by a pond on a log
and to see what resulted in the pond catapulted with a water-noise heard round
the bog. Each of these takes the culture underpinning furu ike ya and makes use of
it in a different way and exposes it and plays with it. How can that stand against
the word salad? Now, okay, so, maybe giving Japanese haiku to Google
Translate wasn’t fair but then I see things like this and I think, okay, we don’t play fair. So this was Mark Zuckerberg post from the day when
Facebook released their deep learning algorithms underlying their machine translation. I’m gonna talk about deep learning in a
second. But I want to talk about hubris first. And, of course, he’s very pleased with himself. And it is good that Facebook was sharing
their algorithms. I have no question that this is good for computer science research.
But then we kind of get to the end of the post: “Throughout human history
language has been a barrier to communication.” I’d like to know what he’d
like to suggest we use instead. “It’s amazing we get to live in a time when
technology can change that. Understanding someone’s language brings you closer to
them and I’m looking forward to making universal translation a reality. To help
us get there faster, we’re sharing our work publicly so that all researchers
can use it to build better translation tools.” Knowing the translation of your
words does not mean that I am closer to you. That does not build intimacy.
It has a place, but I’m not sure this is it. And this is the question
where I start to think, ok, well, where are the boundaries? How can we start to
understand what technology can do and where technology can end? Because this is
the conversation I keep having. To come back for a second to those deep learning
algorithms. And I’m sorry for the quality of this. So one of the partners in the
kplex project is a latvian SME, and they’re very committed to building
machine translation engines for smaller languages, like Latvian. And, so, they’re
working on a newer neural networks based system. So, they put in these four source
sentences into an engine. So: Characteristics specialties of Latvian
cuisine or bacon pies and refreshing cold sour cream soup. Demand for mobile
telephones and Internet access has exploded. An insider’s guide to drinking
sake in Tokyo. And: Part bookshop, part gallery, an NADiff highlights Japan’s
deep appreciation for art and design. Okay, so, they’re doing this in a kind of
a tourism context. So far, so good. All four of those statements came back with
the same translation, which is there in Latvian, which translates back to
English as: fast wireless Internet is available free of charge in the guest
bedrooms. I’m so sorry. I’m so sorry. When I get excited, I start to speak too fast.
So we have the fast wireless internet available free in the guest bedrooms. So
the way this was explained to me is that somewhere in the black box of the
machine learning, there is a place where obviously, although they couldn’t say
exactly where, where obviously there was a connection made between those kinds of
sentences and that sentence. And that connection started to dominate what the
learning algorithm saw as a correct translation. And this is a real problem
with these kind of deep neural networks, because we don’t necessarily know, as
with a lot of machine learning or a lot of AI, we don’t necessarily know what’s
happening in the black box, which is really interesting, because in my mind,
once you get back to that question of not really knowing what happened and
having to make a judgment call, having to make an informed analysis of material
like that, you’re coming back to the humanities. But that’s another question.
We also have work going on about the emotional side of things. So, again, we
talk about culture. We talk about memory. We also talk about identity and emotion.
And this isn’t anything new. “Alone Together” has been out since 2011. But if
you go back, you can find all of these kind of techno skepticism going back. But
one of the things we’re looking at and querying is whether AI can be emotional
or indeed ethical. Clearly, the developers of humanoid robots like PEPPER or PARO – I
don’t know if you know PARO. PARO is a fuzzy little fur seal which responds
emotionally to you. And on some level it’s interesting, on some level it’s
quite frightening and quite touching in a way. But the questions
that you find when you talk to AI researchers is not “Can it be done? Should
it be done?” but just how it can be done. And I really find the question of
whether, for example, you know, pobably if you’ve studied philosophy, you know the
trolley problem: this question of if you have a choice to actually bring about the
death of many or the death of few. I mean, are there some deaths that mean more
than others in the face- ? I was told that that was an irrelevant question by AI
researchers. But this was never going to happen. And yet, Lexus was actually
exposed as looking into whether protecting the life of the driver of a
driverless car, of an automatically controlled vehicle, whether protecting
that driver at all costs was a corporate policy. So the trolley problem has become
real. And, of course, I think, when you start to say: I can program ethics, you
just need to tell me how. I fall back on the fact that an ethical stance is an
essentially human position. It is one fallible mortal being being able to take
responsibility for another. So there’s a lot of questions being raised there. In
the light of where we are and just to give you a sense of the project:
we’re about a year in of a project that is a year and three months. So we’re
actually going into our write up phase now and we’re looking towards the kinds
of recommendations we can make. And there’s been quite a lot of quantitative
work that will all be released in the fullness of time, but what I wanted
to do as a way of wrapping up this presentation is make five modest
proposals. And I think these are proposals for humanists but also for
digital humanists because if you work in the digital humanities you generally
occupy I don’t want to say a unique but a privileged position of being able to
understand both of those cultures, both of the epistemic cultures of the
humanities where there’s a certain prevalence and preference for sources
and for ways of thinking and ways of investigating but also the software
engineering side and the big data and the AI and the questions, the
conditions of possibility for knowledge creation in the two. So, I would say, these are things that we can investigate.
So the first modest proposal and this is specifically looking towards the
European Open Science Cloud but more than that. I think we really need a
discussion of if we’re going to create knowledge from Big Data, we need to talk
about what kinds of questions you can ask of Big Data. How do you learn to ask
research questions that can engage data from a sensor, an environmental sensor
and a literary text, and historical records? Do we know how to ask those
questions and if so, do we know how to ask them in a way that will engage the
data that we are going to be offered? And, of course, this question of shared
ownership. Because the shared ownership is not just important for archives and
researchers. Shared ownership also exists between you and me and Facebook and the
sensors that are taking our information and the and the new Amazon
grocery store where there’s no one at the till you just take what you want and
walk out and it knows what you have. There’s a shared ownership of data there
as well and this isn’t always respected. And one of the things we’re looking at
in kplex in particular in terms of shared ownership but also provenance is
the question of a data passport. So if data in a European Open Science Cloud –
obviously it’s going to have some metadata attached – but how can we get
that beyond a standard into something that really reflects where this data has
come from, what it has been gone through and what has been done to it, how its
been transformed and what can be done with it going forward. And, again, the
Commissioner has said it. He believes that the most exciting and ground
breaking work, it’s happening at the intersection of disciplines. So if we
want to take him up on his offer of a European Open Science Cloud, we really
need to think about how to do it. Well, and that is for everyone, and I think the
humanists are in a good position to actually make a real impact there. But
also something that we’re doing in DARIAH, the European research infrastructure
that I mentoned, is we’re bringing together stakeholders
to try and develop a data reuse charter because we recognize that the individual
researcher does not feel empowered to necessarily reuse data. They don’t know
the paper that they signed for the archive, does that mean that they can put
the data in an open repository? Does that mean that they have to keep it private
to themselves? What are the conditions for sharing data? Data would be better
available if it was shared more widely. It would be perhaps more sustainable if
it’s shared more widely, but there are still blockages in the cultures
especially between the researchers and the cultural heritage institutions, and
we’re trying to find ways of smoothing that over. So I think this is one of the
things we need to lookat. Another thing, as I mentioned, a lot of the problems
that we’re coming to now really need a humanistic approach. And I know there is
science and technology studies, and I have a lot of respect for a lot of the
work done in science and technology studies, but it does tend to be very
social science based. That’s what it is. What about the cultural approaches? What
about the understanding that humanists have of human motivation, of
human values, of human activities, actions? I think there’s a lot to be had there.
And, again, I can’t necessarily recommend that book to you, which was written by
someone coming out of Stanford, writing about the fuzzy and the techie, how they
together make a perfect approach to technology. But it’s actually- it’s
interesting that the book exists at all. And the book in itself is interesting for
how it views the way you can get a better intelligence out of combining
these two approaches to knowledge creation. And it’s interesting, I mention fake science here because I was asked last week by someone in the
Commission: well, are historians worried about fake science? I thought this is a
really interesting question. I said, well, you really can’t prove a lot of things
in literary research or historical research. You can’t necessarily prove them right or wrong. So we’ve developed certain ways of
actually showing an argument, of showing a provenance, of showing a way through a
set of source material which may or may not have biases, but at least the biases, they are exposed. This is what post-modernism meant to me is that I had
to be careful about my own biases. So, the idea that there are also things that we
can say about the repeatability of science is another thing that has struck
me recently as an approach that humanists and particularly digital
humanists might take. So in this world where knowledge will become more overtly messy, then we need to approach it like a Beckett text, something with doodles and
scribbles and cross-outs and things that we know a lot about. Next,
we need to get past privacy protection and approach indentity enrichment as a
goal for big data and AI. Privacy protection – this is a term I’ve taken
straight from the big data PPP. The companies are all on board for privacy
preserving technologies, which is putting, I think, the cart before the horse, but
it’s also ignoring the opportunity costs of what we allow ourselves to not be
exposed to, the ways in which the digital and in particular the sort
of the social media platforms, the way in which they are affecting identities by
not exposing us to culture. So there’s a gap there as well. So I’d love to see us
move from talking about privacy which clearly has a monetary place in the
minds of the companies to something that is more holistic and that
sees both the in and the out. Because you have people like this writer from The
Guardian who says: I’m a typical millennial; I’m glued to my phone; my
virtual life has fully merged with my real life. There’s no difference anymore.
If that’s gonna be the case, then it would be useful to think about what
kinds of identities that are being built there. Two more, quickly. Problem solving
isn’t enough. We need to be thoughtful, imaginative and disciplined about our
engineering and how we speak about it. And here is a place where,
again, you can see very good work starting. It’s quite inspired when I
first saw the Copenhagen letter. I don’t know how many of you know this, but it is
an open letter signed by, I think, about 5,000 people at the moment saying if we
are contributing to the building of technology, then we need to keep certain
things in mind. And I would highly recommend you go and look at it
as a move towards having a different kind of conscience within technology
development. And, of course, things like the PLOS computational biology
paper on 10 simple rules for responsible big data research.
It’s not rocket science, actually. There are things that can be
done. It’s computational biology, of course. There are things that we can do
and if there are values that are going to be emerging – I mean, I’m glad to see
open science emerging as a value for science in Europe – but I’d love to see
things about protecting the user, protecting the individual. I would love
to see things like that emerge in the way we talk about big data and I’d like
to see more focus. I’d like to see more what I would see of as a scientific
rigor about the way we talk about this research coming through. And, finally, I do
think that there’s a- I have a sense from the work I’ve been involved in- but
there’s always a sense towards convergence. We want to converge
everything. Everything will be digitized. Don’t worry. So all we need is the right
digital space. We need the right digital object. The right device. Well, I started
doing this ethnographic work and I started taking pictures of my work
spaces. These work spaces are not really gonna converge. They’re messy and, know what,
they’re messy for a reason. They’re messy because the information I’m dealing with
is messy. They’re messy because there’re heterogenous. They’re messy because I’m
working at different levels on different things at the same time, and if you want
me to put it into Microsoft speak, they’re messy because I’m chunking. I’m
micro tasking. You know, some of the stuff is very sexy in the tech world, but that
is something that I think we need to push for more. You know, don’t give me
another VRE, don’t give me a one-stop-shop, give me a
technical intervention that supports the way my research environment works.
Give me a technical intervention that helps the way my life works. And then I
think we’ll have a better chance of that more refined hybrid intelligence not
artificial, not human, not biased in one way or the other
but able to check and balance itself. Because I do believe in the end the fact
that I am a very human human, I suppose. We can’t feel data. This is why
seeing everything printed strikes you. We are physical creatures. We need
materiality. So, I suppose, to end the talk about big data with the words “we
need materiality” is a slightly strong stance to make, but I hope that we can
discuss it in the questions. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *