Always Already Computational is a project bringing together a variety of different perspectives to develop “a strategic approach to developing, describing, providing access to, and encouraging reuse of collections that support computationally-driven research and teaching” in subject areas relating to library and museum collections.  This post is adapted from my Position Statement for the initial workshop.  You can find out more about the project at https://collectionsasdata.github.io.

Earlier this year, I spent two and a half days in beautiful University of California Santa Barbara at a workshop speaking with librarians, developers, and museum and library collection managers about data.  Attendees at this workshop represented a variety of respected cultural institutions including the New York Public Library, the British Library, the Internet Archive, and others.

Our task was to build a collective sense of what it means to treat library and museum “collections”—the (increasingly) digitized catalogs of their holdings—as data for analysis, art, research, and other forms of re-use.  We gathered use cases and user stories in order to start the conversation on how to best publish collections for these purposes.  Look for further outputs on the project website: https://collectionsasdata.github.io .  For the moment, here are my thoughts on the experience and how it relates to work at Open Knowledge International, specifically, Frictionless Data.

Always Already Computational

Open Access to (meta)Data

The event organizers—Thomas Padilla (University of California Santa Barbara), Laurie Allen (University of Pennsylvania), Stewart Varner (University of Pennsylvania), Sarah Potvin (Texas A&M University), Elizabeth Russey Roke (Emory University), Hannah Frost (Stanford University)—took an expansive view of who should attend.  I was honored and excited to join, but decidedly new to Digital Humanities (DH) and related fields.  The event served as an excellent introduction, and I now understand DH to be a set of approaches toward interrogating recorded history and culture with the power of our current tools for data analysis, visualization, and machine learning.  As part of the Frictionless Data project at Open Knowledge International, we are building apps, libraries, and specifications that support the basic transport and description of datasets to aid in this kind of data-driven discovery.  We are trialling this approach across a variety of fields, and are interested to determine the extent to which it can improve research using library and museum collection data.

What is library and museum collection data?  Libraries and museums hold physical objects which are often (although not always) shared for public view on the stacks or during exhibits.  Access to information (metadata) about these objects—and the sort of cultural and historical research dependent on such access—has naturally been somewhat technologically, geographically, and temporally restricted.  Digitizing the detailed catalogues of the objects libraries and museums hold surely lowered the overhead of day-to-day administration of these objects, but also provided a secondary public benefit: sharing this same metadata on the web with a permissive license allows a greater variety of users in the public—researchers, students of history, and others—to freely interrogate our cultural heritage in a manner they choose.  

There are many different ways to share data on the web, of course, but they are not all equal.  A low impact, open, standards-based set of approaches to sharing collections data that incorporates a diversity of potential use cases is necessary.  To answer this need, many museums are currently publishing their collection data online, with permissive licensing, through GitHub: The Tate Galleries in the UK, Cooper Hewitt, Smithsonian Design Museum and The Metropolitan Museum of Art in New York have all released their collection data in CSV (and JSON) format on this popular platform normally used for sharing code.  See A Nerd’s Guide To The 2,229 Paintings At MoMA and An Excavation Of One Of The World’s Greatest Art Collections both published by FiveThirtyEight for examples of the kind of exploratory research enabled by sharing museum collection data in bulk, in a straightforward, user-friendly way.  What exactly did they do, and what else may be needed?

Packages of Museum Data

Our current funding from the Sloan Foundation enables us to focus on this researcher use case for consuming data.  Across fields, the research process is often messy, and researchers, even if they are asking the right questions, possess a varying level of skill in working with datasets to answer them.  As I wrote in my position statement:

Such data, released on the Internet under open licenses, can provide an opportunity for researchers to create a new lens onto our cultural and artistic history by sparking imaginative re-use and analysis.  For organizations like museums and libraries that serve the public interest, it is important that data are provided in ways that enable the maximum number of users to easily process it.  Unfortunately, there are not always clear standards for publishing such data, and the diversity of publishing options can cause unnecessary overhead when researchers are not trained in data access/cleaning techniques.

My experience at this event, and some research beforehand, suggested that there is a spectrum of data release approaches ranging from a basic data “dump” as conducted by the museums referenced above to more advanced, though higher investment, approaches such as publishing data as an online service with a public “API” (Application Programming Interface).  A public API can provide a consistent interface to collection metadata, as well as an ability to request only the needed records, but comes at the cost of having the nature of the analysis somewhat preordained by its design.  In contrast, in the data dump approach, an entire dataset, or a coherent chunk of it, can be easier for some users to access and load directly into a tool like R (see this UK Government Digital Service post on the challenges of each approach) without needing advanced programming.  As a format for this bulk download, CSV is the best choice as the MoMa reflected when releasing their collection data online:

CSV is not just the easiest way to start but probably the most accessible format for a broad audience of researchers, artists, and designers.  

This, of course, comes at the cost of not having a less consistent interface for the data, especially in the case of the notoriously underspecified CSV format.  The README file will typically go into some narrative detail about how to best use the dataset, some expected “gotchas” (e.g. “this UTF-8 file may not work well with Excel on a Mac”).  It might also list the columns in a tabular data file stored in the dataset, expected types and formats for values in each column (e.g. the date_acquired column should, hopefully, contain dates in a one or another international format).  This information is critical for actually using the data, and the automated export process that generates the public collection dataset from the museum’s internal database may try to ensure that the data matches expectations, but bugs exist, and small errors may go unnoticed in the process.

The Data Package descriptor (described in detail on our specifications site), used in conjunction with Data Package-aware tooling, is meant to somewhat restore the consistent interface provided by an API by embedding this “schema” information with the data.  This allows the user or the publisher to check that the data conforms to expectations without requiring modification of the data itself: a “packaged” CSV can still be loaded into Excel as-is (though without the benefit of type checking enabled by the Data Package descriptor).  The Carnegie Museum of Art, in its release of its collection data, follows the examples set by the Tate, the Met, the Moma, and Cooper-Hewitt as described above, but opted to also include a Data Package descriptor file to help facilitate online validation of the dataset through tools such as Good Tables.  As tools come online for editing, validating, and transforming Data Packages, users of this dataset should be able to benefit from those, too: http://frictionlessdata.io/tools/.

We are a partner in the Always Already Computational: Collections as Data project, and as part of this work, we are working with Carnegie Museum of Art to provide a more detailed look at the process that went into the creation of the CMOA dataset, as well as sketching a potential ways in which the Data Package might help enable re-use of this data.  In the meantime, check out our other case studies on the use of Data Package in fields as diverse as ecology, cell migration, and energy data:

http://frictionlessdata.io/case-studies/

Also, pay your local museum or library a visit.

+ posts

Dan Fowler contributes to various projects at Open Knowledge and currently serves as developer advocate helping to connect a community of makers and doers around open data with the technology work conducted by Open Knowledge International. He has a Master’s degree in Information and Communication Technologies for Development from Royal Holloway, University of London and a Bachelor’s degree in Psychology from Princeton University. Between degrees, he worked as a sysadmin for an investment bank in New York.