This interview is cross-posted here and on the Open GLAM blog.
At OKCon 2011, we had the opportunity to interview Brewster Kahle who is a computer engineer, internet entrepreneur, activist, and digital librarian. He is the founder and director of the Internet Archive, a non-profit digital library with the stated mission of “universal access to all knowledge”. Besides the widely known “Wayback Machine“, where archived copies of most webpages can be accessed, the Internet Archive is very active in the digitization of books, as well, and provides with the “Open Library” a free catalog that aims to describe “every book ever published”. Kahle and his wife, Mary Austin, created the Kahle/Austin Foundation that supports the Internet Archive and other non-profit organizations.
As open data enthusiasts from the library world, we were especially interested in how the activities of the Internet Archive relate to libraries. We wanted to know how its general approach and service could be useful for libraries in Europe.
Brewster Kahle, what is the Internet Archive and what is your vision for its future?
The idea is to build the library of Alexandria version 2. The idea of all books, all music, all video, all lectures, well: kind of everything, available to anybody, anywhere that would want to have access. Of course, it’s not gonna be done by one organisation, but we hope to play a role by helping move forward libraries, ourselves and making as much technology as required to be able to fulfil this goal.
What are the obstacles preventing this happening in the moment?
We see the world splitting in two parts: There are the hyper-property interests and then there are the hyper-open interests, and I’d say actually the hyper-open is much more successful, but it’s not slowing down those that want to clamp down, shut down, control.
What are the challenges faced by the Internet Archive regarding the digitization of books?
There are two big problems: there is going and building a digital collection, either by digitizing materials or buying electronic books. And the other is: how do you make this available, especially the in-copyright works? For digitizing books, it costs about 10 cents a page to do a beautiful rendition of a book. So, for approximately 30 dollars a book for 300 pages you can do a gorgeous job. Google does it much more quickly and it costs only about 5 dollars for each book. So it really is much less expensive in less quality, but they are able to do things at scale. We digitize about 1000 books every day in 23 scanning centers in six countries. We will set up scanning centers anywhere, or, if there are people that would like to staff the scanners themselves, we provide the scanners and all of the backend processing for free, until we run out of scanners and we’ve got a bunch of them. So we’re looking either for people that want to scan their own collections by providing there own labour or they can employ us to do it and all told it is 10 cent a page to complete.
Also, part of building a collection is buying electronic books. But when I say buying, I really mean buying books, not licensing books, books that are yours forever. There are a growing number of publishers that will sell ebooks like music publishers now sell MP3s. That does not mean that we can post them on the internet for free for everybody, it means we can lend them, one person at a time. So if we have a hundred copies, then we can lend them out, it’s very much like normal book collections. This has a nice characteristic that it does not build monopolies. So instead of going in licensing collections that will go away if you stop licensing them, or they are built into one centralized collection like Elsevier, JSTOR, or Hathi Trust, the idea is to buy and own these materials.
Then there is the rights issue on how to make them available. Out of copyright materials are easy, those should be available in bulk – they often aren’t, and for instance Google books are not, they are not allowed to be distributed that way. But open books and libraries and museums that care should not pursue those closed paths.
For the in-copyright materials, what we have done is to work with libriaries. We are lending them, so that the visitors to our libraries can now access over 100.000 books that have been contributed by libraries that are participating.
There are now over 1000 libraries in 6 countries that are putting books into this collection that is then available in all other libraries. And these are books from all through the 20th century. So, if there are any libraries that would like to join, all they need to do is contribute a book from the post 1923 book from their collection to be digitized and made available and the IP addresses for their library. Then we will go and turn those on, so that all of those users inside that library or those that are dialing in (for instance via VPN) can borrow any of the 100.000 books. Any patron can borrow five books at one time and the loan period is for two weeks. But there is only one book circulating at any one time. So it is very good for the long tail. We think that this lending library approach and especially the in-library lending library has been doing very well. We have universities all over the world, we have public libraries all over the world now participating in building a library by libraries for libraries.
You already talked about cooperating with traditional libraries with in-copyright books. How is the cooperation between the Internet Archive – which itself seems to be a library – and traditional libraries in other fields, with respect to the digitization of public domain books for instance?
The way the Internet Archive works with libraries all over the world is by helping them digitize their books, music, video, microfilm, microfiche very cost-effectively. So we just try to cover the costs – meaning we make no profit, we are a non-profit library – and give the full digital copy back to the libraries and also keep a copy to make available on the Internet for free public access. Another way that we work with libraries is through this lending program where libraries just have to donate at least one book – hopefully many more to come – to go and build this collection of in-copyright materials that can then be lent to patrons again that are in the participating libraries. So those are the two main ways that we work with libraries on books. We also work with libraries on building web collections, we work with a dozen or so of the national libraries and also archive it. We have our subscription based services for helping build web collections, but never do things in such a way that if they stop paying, they lose access to data. The idea is to only pay once and have forever.
Are there already cooperations between European libraries and the Internet Archive, and are scanners already located in Europe which could be used for digitization projects in university or national libraries?
Yes, we have scanners now in London and in Edinburgh – the Natural History Museum and the national library – where we are digitizing now. We would like to have more scanners in more places so that anybody that would be willing to staff one, keep it busy for 40 hours each week, we will provide all of the technology for free or we can go and cooperate and we can hire the people and operate these scanning centers. We find that scanning centers – i.e. 10 scanners – can scan 30.000 books in a year and it costs about 10 cents a page. It is very efficient and very high quality. This is including fold outs and difficult materials as well. And it is working right within the libraries, so the librarians have real access to how the whole process is going and what is digitized. It is really up to the libraries. The idea is to get the libraries online as quickly as possible and without the restrictions that come with corporations.
A very important topic is also the digitization of very old materials, valuable materials, old prints, handwritten manuscripts and so on. How do you also deal with these materials?
We have been dealing now with printed material that goes back 500 years with our scanners, sometimes under very special conditions. The 10 cents a page is basically for normal materials. When you are dealing with handwritten materials or large-format materials, it costs a little bit more, just because it takes more time. All it really is, we are dealing with the labour. We are efficient, but it still does cost to hire people to do a good job of this. We are digitizing archives, microfilm, microfiche, printed materials that are bound, unbound materials, moving images, 16 mm as well as 8 mm films, audio recordings, LPs. We can also deal with materials that have already been digitized from other projects to make unified collections.
How do you integrate all these digitized objects and how do you deal with the specific formats that are used to represent and consume digital materials?
We use the metadata that comes from the libraries themselves, so we attach MARC records that come from the libraries to make sure that there’s good metadata. As these books move through time from organization to organization, the metadata and the data stays together. We then take the books after we photograph them, run them through optical character recognition so they can be searched and move them into many different formats from PDF, deja-vu, daisy files for the blind and the dyslexic, mobi-files for Kindle users, we can also make it available in some of the digital rights management services that the publishers are currently using for their in-print books. So all of these technologies are streamlined because we have digitized well over one million books. These are all available and easily plugged together. In total there are now over two million books that are availble on the Internet Archive website to end users to use through the openlibrary.org website, where people can go and see, borrow and download all of these books. But If libraries want to go and add these to there own collections they are welcome to. So if there are 100.000 books that are in your particular language or your subject area that you would like to complete your collections with, let’s go and get each of these collections to be world class collections. We are not trying to build an empire, we are not trying to build one database, we want to build a library system that lives, thrives and also supports publishers and authors going forward.
So there’s the Internet Archive and the Open Library. Can you make the distinction any clearer for those that don’t currently understand it?
Archive.org is the website for all of our books, music, video and web pages. Openlibrary.org is a website really for books. The idea is to build an open catalog system: One webpage for every book ever published. And if it’s available online, then we point to it. If it’s available for sale in bookstores, we point to those. If it’s available in libraries for physical taking out, we point to that. Openlibrary.org is now used by over a 100,000 people a day to go and access books. Also, people have integrated it in their catalogs. That means, when people come to the library catalogs that we’re partnered with, they search their library catalog and pull down a list and either they’ve added the MARC records from the set into their catalog or, better yet, they just go and call an API such that there’s a little graphic that says: “You can get this online.” or: “This is available to be borrowed for free.” or: “It’s available for borrowing but it’s currently checked out.” So, those little lights turn on so that people can get to the book electronically right there from the catalog. We find that integration very helpful because we don’t want everyone coming to Open Library. They know how to use libraries, they use your library and your library catalog. Let’s just get them the world’s books. Right there and then.
You mentioned Google Books. There are agreements between libraries and Google for digitizing materials. What are the benefits for libraries of choosing the Internet Archive over Google Books?
Google offers to cover the costs of the labor to do the digitization, but the libraries that participated ended up paying a very large amount of money just trying to prepare the books and get them to Google. Often they spent more working with Google than they would have with the Internet Archive, and in the latter case they do not have restrictions on their materials. So Google digitizes even public domain materials and then puts restrictions on their re-use. Everybody that says that it is open has got to mean something bizarre by ‘open’. You can not go and take hundreds of these and move them to another server, it is against the law and Google will stop libraries that try to make these available to people and moving the books from place to place, so this is quite unfortunate.
Is Google reusing Internet Archive books in Google Books?
They are not, but Hathi Trust, the University of Michigan, is taking many of our books. Google is indexing them so that they are in their search engine.
People at OKCon naturally are supporters of Open Content, Open Knowledge but many libraries don’t like their digitized material to be open. Even public domain books which are digitized are normally kept on the libraries’ websites and by contracts or even by copyfraud they say: “You can not do whatever you want with this.” What would you say to libraries to really open up their digitized content?
There’s been a big change over the last four or five years on this. None of the libraries we work with – 200 libraries in a major way and now we have received books from over 500 libraries – have put any restrictions beyond the copyright of the original material. If there’s copyrighted material, then of course it has restrictions. But neither the libraries nor the Internet Archive are imposing new restrictions. You are right that there are some libraries that may not want to participate in this but this is what most libraries are doing – except for the Google libraries which are locking them up.
Do all the libraries provide high resoultion scans or do some choose to only provide PDFs?
All the high resolution, the original images, plus the cropped and descewed ones, all of these are publicly downloadable for free so that all analysis can be done. There’s now over one petabyte of book data that is available from the Internet Archive for free and active download. About 10 million books each month are being downloaded from the Internet Archive. We think this is quite good. We’d like to see even more use by building complete collections that go all the way to the current day. I’d say we are in pretty good shape on the public domain in the English language but other languages are still quite far behind. So we need to go and fill in better public domain collections. But I’d say a real rush right now is getting the newer books to our users and our patrons that are really turning to the internet for the information for their education.
To be more concrete. For libraries that are thinking about digitizing their own collections: what exactly do you offer?
Either write to email@example.com or myself: firstname.lastname@example.org. We will offer digitization services at 10 cents a page. And if there’s enough to do, we’ll do it inside your library. If the library wants to staff their own scanner and start with one, then we can provide that scanner as long as it doesn’t cost us anything. Somebody has to go and cover the transportation and the set it up, these costs will be borne by the library. But then all of the backend processing or the OCR is provided for free. In the lending system it’s at least one book, a set of IP adresses, contact information and you’re done. No contracts, nothing.
Ok. So that means you offer the scanner technology for free, you offer the knowledge about how to use it for free. Only these additional costs for transportation have to be taken by the libraries. With your experience in digitization projects, every library should – and can – contact you and you explain the process to the people, you say what you’re doing, you give your opinion on how you would do it and then, of course, the library can decide?
Absolutely. We’ll provide all the help we can for free to help people through the process. We find that many people are confused and they’ve heard contradictory things.
Have you ever tried a kind of crowdsourcing approach for library users to digitize books themselves, placing a scanner in the library and let the users do it. Or does it take to much education for handling the scanners?
We find that it actually is quite difficult to go and digitize a book well, unfortunately. Though we have just hired Dan Reetz, who is the head of a Do-it-yourself bookscanner group. And we’re now starting to make Do-it-Yourself bookscanners that people can make themselves and the software automatically uploads to the Internet Archive. So we hope that there’s a great deal of uptake from smaller organizations or from indivudals. In Japan, for instance, many people scan books and we receive those. People upload maybe one or two hundred books to us a day. So, people are uploading things often from the Arab world. They are digitizing on their own and we want those as well. So, we can work with people if they have PDFs of scanned books or just sets of images from either past projects or current projects or if they want to get involved. There are many different ways we would love to help.
Does the Internet Archive collaborate with Europeana in some way, for example for making material from the Internet Archive available in Europeana?
We’ve met with some of the people from Europeana and I believe they have downloaded all of our metadata. All of our metadata is available for free. I’m encouraged by some of what I’ve seen from Europeana towards being a search engine. To the extent that they may grow into being the library for Europe I think this is not a good idea. I like to see many libraries, many publishers, many booksellers, many, many authors and everyone being a reader. What we want are many winners, we don’t want just one. So, Europeana to the extent that it’s just a metadata service, I think is a good project.
You just mentioned the metadata. So everything that you have, not only the digitized versions of the books but also the enrichments, the metadata about it, the OCR result, for example, everything is free and open. So, if I would like to, I could take the whole stuff and put it on my own server, re-publish it in the way that I want?
Yes, absolutely. All the metadata, the OCR files, the image files are all available. There are a lot of the challenges maintaining the files over time and we are committed to do this but we don’t want to be the only one. So the University of Toronto has taken back all of the 300,000 books that were digitized from their collections to put them on their servers and they’re now starting to look at other collections from other libraries to add those. As we move to digital libraries we don’t necessarily just need digital versions of the physical books we own, we want digital books that are of interest to our patrons. Yes, it is all available and it’s forming a new standard for openness.
The MARC records you mentioned, they are of course also available. So it makes sense for a library to include not only their own books but every book in the Internet Archive in their own catalog. Because, in fact, it is available to all the patrons. So, you could think of it as a possession of every library in the world. Is that right?
Yes, we see this as building the ecology of libraries. The really valuable thing about libraries, – yes: there are great collections – but the real value are the librarians, the experts, the people that know about the materials and can bring those to people in new and different ways. That’s the real value to our library system. So, let’s make sure, as we go through this digitization wave we don’t end up with just one library that kills off all of the others, which is a danger.
Thank you for the interview.
Kai Eckert is computer scientist and vice head of the IT departement of the Mannheim University Library. He coordinates the linked open data activities and developed the linked data service of the library. He held various presentations, both national and international, about linked data and open data.
Adrian Pohl has been working at the Cologne-based North Rhine-Westphalian Library Service Center (hbz) since 2008. His main focuses are Open Data, Linked Data and its conceptual, theoretical and legal implications. Since June 2010 Adrian has been coordinating the Open Knowledge Foundation’s Working Group on Open Bibliographic Data.
Acknowledgements: The introductory questions were taken from a former interview on the same day, conducted by Lucy Chambers.