The following guest post is from Nancy Ide, Professor of Computer Science at Vassar College, Technical Director of the American National Corpus project and member of the Open Knowledge Foundation’s Working Group on Open Linguistic Data.

The American National Corpus (ANC) project is creating a collection of texts produced by native speakers of American English since 1990. Its goal is to provide at least 100 million words of contemporary language data covering a broad and representative range of genres, including but not limited to fiction, non-fiction, technical writing, newspaper, spoken transcripts of various verbal communications, as well as new genres (blogs, tweets, etc.). The project, which began in 1998, was originally motivated by three major groups: linguists, who use corpus data to study language use and change; dictionary publishers, who use large corpora to identify new vocabulary and provide examples; and computational linguists, who need very large corpora to develop robust language models—that is, to extract statistics concerning patterns of lexical, syntactic, and semantic usage—that drive natural language understanding applications such as machine translation and information search and retrieval (à la Google).

Corpora for computational linguistics and corpus linguistics research are typically annotated for linguistic features, so that, for example, every word is tagged with its part of speech, every sentence is annotated for syntactic structure, etc. To be of use to the research and development community, it should be possible to re-distribute the corpus with its annotations so that others can reuse and/or enhance it, if only to replicate results as is the norm for most scientific research. The redistribution requirement has proved to be a major roadblock to creating large linguistically annotated corpora, since most language data, even on the web, is not freely redistributable. As a result, the large corpora most often used for computational linguistics research on English are the Wall Street Journal corpus, consisting of material from that publication produced in the early ‘90s, and the British National Corpus (BNC), which contains varied genre British English produced prior to 1994, when it was first released. Neither corpus is ideal, the first because of the limited genre, and the second because it includes strictly British English and is annotated for part of speech only. In addition, neither reflects current usage (for example, words like “browser” and “google” do not appear).

The ANC was established to remedy the lack of large, contemporary, richly annotated American English corpora representing a wide range of genres. In the original plan, the project would follow the BNC development model: a consortium of dictionary publishers would provide both the initial funding and the data to include in the corpus, which would be distributed by the Linguistic Data Consortium (LDC) under a set of licenses reflecting the restrictions (or lack thereof) imposed by these publisher-donors. These publishers would get the corpus and its linguistic annotations for free and could use it as they wished to develop their products; commercial users who had not contributed either money or data would have to pay a whopping $40,000 to the LDC for the privilege of using the ANC for commercial purposes. The corpus would be available for research use only for a nominal fee.

The first and second releases (a total of 22 million words) of the ANC were distributed through LDC from 2003 onward under the conditions described above. However, shortly after the second ANC release in 2005, we determined that the license for 15 of the 22 million words in the ANC did not restrict its use in any way—it could be redistributed and used for any purpose, including commercial. We had already begun to distribute additional annotations (which are separate from and indexed into the corpus itself) on our web site, and it occurred to us that we could freely distribute this unrestricted 15 million words as well. This gave birth to the Open ANC (OANC), which was immediately embraced by the computational linguistics community. As a result, we decided that from this point on, additions to the ANC would include only data that is free of restrictions concerning redistribution and commercial use. Our overall distribution model is to enable anyone to download our data and annotations for research or commercial development, asking (but not requiring) that they give back any additional annotations or derived data they produce that might be useful for others, which we will in turn make openly available.

Unfortunately, the ANC has not been funded since 2005, and only a few of the consortium publishers provided us with texts for the ANC. However, we have continued to gather millions of words of data from the web that we hope to be able to add to the OANC in the near future. We search for current American English language data that is either clearly identified as public domain or licensed with a Creative Commons “attribution” license. We stay away from “share-alike” licenses because of the potential restriction for commercial use: a commercial enterprise would not be able to release a product incorporating share-alike data or resources derived from it under the same conditions. It is here that our definition of “open” differs from the Open Knowledge Definition—until we can be sure that we are wrong, we regard the viral nature of the share-alike restriction as prohibitive for some uses, and therefore data with this restriction are not completely “open” for our purposes.

Unfortunately, because we don’t use “share-alike” data, the web texts we can put in the OANC are severely limited. A post on this blog by Jordan Hatcher a little while ago mentioned that the popularity of Creative Commons licenses has muddied the waters, and we at the ANC project agree, although for different reasons. We notice that many people—particularly producers of the kinds of data we most want to get our hands on, such as fiction and other creative writing—tend to automatically slap at least a “share-alike” and often also a “non-commercial” CC license on their web-distributed texts. At the same time, we have some evidence that when asked, many of these authors have no objection to our including their texts in the OANC, despite the lack of similar restrictions. It is not entirely clear how the SA and NC categories became an effective default standard license, but my guess is that many people feel that SA and NC are the “right” and “responsible” things to do for the public good. This, in turn, may result from the fact that the first widely-used licenses, such as the GNU Public License, were intended for use with software. In this context, share-alike and non-commercial make some sense: sharing seems clearly to be the civic-minded thing to do, and no one wants to provide software for free that others could subsequently exploit for a profit. But for web texts, these criteria may make less sense. The market value of a text that one puts on the web for free use (e.g., blogs, vs. works published via traditional means and/or available through electronic libraries such as Amazon) is potentially very small, compared to that of a software product that provides some functionality that a large number of people would be willing to pay for. Because of this fact, use of web texts in a corpus like the ANC might qualify as Fair Use—but so far, we have not had the courage to test that theory.

We would really like to see something like Open Data Commons Attribution License (ODC-BY) become the license that authors automatically reach for when they publish language data on the web, in the way the CC-BY-SA-NC license is now. ODC-BY was developed primarily for databases, but it would not take much to apply it to language data, if it has not been done already (see, e.g., the Definition of Free Cultural Works). Either that, or we determine if in fact, because of the lack of monetary value, Fair Use could apply to whole texts (see for example, Bill Graham Archives v. Dorling Kindersley Ltd., 448 F. 3d 605 – Court of Appeals, 2nd Circuit 2006 concerning Fair Use applied to entire works).

In the meantime, we continue to collect texts from the web that are clearly usable for our purposes. We also have a web page set up where one can contribute their writing of any kind (fiction, blog, poetry, essay, letters, email) – with a sign off on rights – to the OANC. So far, we have managed to collect mostly college essays, which college seniors seem quite willing to contribute for the benefit of science upon graduation. We welcome contributions of texts (check the page to see if you are a native speaker of American English), as well as input on using web materials in our corpus.

+ posts

This post is by a guest poster. If you would like to write something for the Open Knowledge Foundation blog, please see the submissions page.

3 thoughts on “Opening up linguistic data at the American National Corpus”

  1. Great post Nancy — I can hardly thing of a better exemplar for open data that the king of annotated linguistic corpora you are creating.

    In fact, this example strikes as having interesting similarities with what goes on bioinformatics: there people are ‘annotating’ resources like the the genome and have a similar requirements for openness to be able to share easily.

  2. a commercial enterprise would not be able to release a product incorporating share-alike data or resources derived from it under the same conditions.

    That’s not true. A commercial enterprise could release a product incorporating copyleft/ShareAlike works, including their modifications under the same terms. Happens all the time. The enterprise might not want to, and it’s understandable why if you desire maximum immediate use you’d want to avoid copyleft, but “would not be able to” is misleading. Furthermore, copyleft requirements are typically only triggered when a derivative work is created. It’s perfectly possible for a proprietary software program to ship with copyleft content/data, and vice versa.

    We would really like to see something like Open Data Commons Attribution License (ODC-BY) become the license that authors automatically reach for when they publish language data on the web,

    Could you give an example of language data ODC-BY would be ideal for? Note that the license covers the database, not the database contents. I imagine language data is varied, but the first thing I think of regarding IP problems would be copyright on text one would one to include in a corpus — as you say later in the paragraph (re fair use (go fair use!)) “whole texts”.

    in the way the CC-BY-SA-NC license is now.

    I was not aware CC-BY-NC-SA is now widely used for language data. Please share examples.

    ODC-BY was developed primarily for databases, but it would not take much to apply it to language data, if it has not been done already (see, e.g., the Definition of Free Cultural Works)

    I’m scratching my head over what the above parenthetical means.

    Finally, the link in the last paragraph should be to (lowercase ‘C’).

Comments are closed.