Support Us

You are browsing the archive for WG Linguistics.

Version Variation Visualisation

Tom Cheesman - February 8, 2013 in Featured Project, Open Content, Public Domain, WG Linguistics

In 2010, I had a long paper about the history of German translations of Othello rejected by a prestigious journal. The reviewer wrote: “The Shakespeare Industry doesn’t need more information about world Shakespeare. We need navigational aids.”

About the same time, David Berry turned me on to Digital Humanities. I got a team together (credits) and we’ve built some cool new tools.

All culturally important works are translated over and over again. The differences are interesting. Different versions of Othello reflect changing, contested ideas about race, gender and sexuality, political power, and so on, over the centuries, right up to the present day. Hence any one translation is just one snapshot from its local place and moment in time, just one interpretation, and what’s interesting and productive is the variation, the diversity.

But with print culture tools, you need a superhuman memory, a huge desk and ideally several assistants, to leaf backwards and forwards in all the copies, so you can compare and contrast. And when you present your findings, the minutiae of differences can be boring, and your findings can’t be verified. How do you know I haven’t just picked quotations that support my argument?

But with digital tools, multiple translations become material people can easily work and play with, research and create with, and we can begin to use them in totally new ways.

Recent work

vvv screenshot2 We’ve had funding from UK research councils and Swansea University to digitize 37 German versions of Othello (1766-2010) and build these prototype tools. There you can try out our purpose-built database and tools for freely segmenting and aligning multiple versions; our timemap of versions; our parallel text navigation tool which uses aligned segment attributes for navigation; and most innovative of all: the tool we call ‘Eddy and Viv’. This lets you compare all the different translations of any segment (with help from machine translation), and it also lets you read the whole translated text in a new way, though the variety of translations. You don’t need to know the translating language.

This is a radical new idea (more details on our platform). Eddy and Viv are algorithms: Eddy calculates how much each translation of each segment differs in wording from others, then Viv maps the variation in the results of that analysis back onto the translated text segments.

This means you can now read Shakespeare in English, while seeing how much all the translators disagree about how to interpret each speech or line, or even each word. It’s a new way of reading a literary work through translators’ collective eyes, identifying hotspots of variation. And you don’t have to be a linguist.

Future plans and possible application to collections of public domain texts

I am a linguist, so I’m interested in finding new ways to overcome language barriers, but equally I’m interested in getting people interested in learning languages. Eddy and Viv have that double effect. So these are not just research tools: we want to make a cultural difference.

We’re applying for further funding. We envisage an open library of versions of all sorts of works, and a toolsuite supporting experimental and collaborative approaches to understanding the differences, using visualizations for navigation, exploration and comparison, and creating presentations for research and education.

The tools will work with any languages, any kinds of text. The scope is vast, from fairy tales to philosophical classics. You can also investigate versions in just one language – say, different editions of an encyclopedia, or different versions of a song lyric. It should be possible to push the approach beyond text, to audio and video, too.

Shakespeare is a good starting point, because the translations are so numerous, increasing all the time, and the differences are so intriguing. But a few people have started testing our tools on other materials, such as Jan Rybicki with Polish translations of Joseph Conrad’s work. If we can demonstrate the value, and simplify the tasks involved, people will start on other ‘great works’ – Aristotle, Buddhist scripture, Confucius, Dante (as in Caroline Bergvall’s amazing sound work ‘Via’), Dostoyevski, Dumas…

Many translations of transculturally important works are in the public domain. Most are not, yet. So copyright is a key issue for us. We hope that as the project grows, more copyright owners will be willing to grant more access. And of course we support reducing copyright restrictions.

Tim Hutchings, who works on digital scriptures, asked me recently: “Would it be possible to create a platform that allowed non-linguist readers to appreciate the differences in tone and meaning between versions in different languages? … without needing to be fluent in all of those languages.” – Why not, with imaginative combinations of various digital tools for machine translation, linguistic analysis, sentiment analysis, visualization and not least: connecting people.

Opening up linguistic data at the American National Corpus

Guest - January 15, 2011 in External, Featured Project, Open Data, Open Knowledge Definition, Open Knowledge Foundation, Open/Closed, Releases, WG Linguistics, Working Groups

The following guest post is from Nancy Ide, Professor of Computer Science at Vassar College, Technical Director of the American National Corpus project and member of the Open Knowledge Foundation’s Working Group on Open Linguistic Data.

The American National Corpus (ANC) project is creating a collection of texts produced by native speakers of American English since 1990. Its goal is to provide at least 100 million words of contemporary language data covering a broad and representative range of genres, including but not limited to fiction, non-fiction, technical writing, newspaper, spoken transcripts of various verbal communications, as well as new genres (blogs, tweets, etc.). The project, which began in 1998, was originally motivated by three major groups: linguists, who use corpus data to study language use and change; dictionary publishers, who use large corpora to identify new vocabulary and provide examples; and computational linguists, who need very large corpora to develop robust language models—that is, to extract statistics concerning patterns of lexical, syntactic, and semantic usage—that drive natural language understanding applications such as machine translation and information search and retrieval (à la Google).

Corpora for computational linguistics and corpus linguistics research are typically annotated for linguistic features, so that, for example, every word is tagged with its part of speech, every sentence is annotated for syntactic structure, etc. To be of use to the research and development community, it should be possible to re-distribute the corpus with its annotations so that others can reuse and/or enhance it, if only to replicate results as is the norm for most scientific research. The redistribution requirement has proved to be a major roadblock to creating large linguistically annotated corpora, since most language data, even on the web, is not freely redistributable. As a result, the large corpora most often used for computational linguistics research on English are the Wall Street Journal corpus, consisting of material from that publication produced in the early ‘90s, and the British National Corpus (BNC), which contains varied genre British English produced prior to 1994, when it was first released. Neither corpus is ideal, the first because of the limited genre, and the second because it includes strictly British English and is annotated for part of speech only. In addition, neither reflects current usage (for example, words like “browser” and “google” do not appear).

The ANC was established to remedy the lack of large, contemporary, richly annotated American English corpora representing a wide range of genres. In the original plan, the project would follow the BNC development model: a consortium of dictionary publishers would provide both the initial funding and the data to include in the corpus, which would be distributed by the Linguistic Data Consortium (LDC) under a set of licenses reflecting the restrictions (or lack thereof) imposed by these publisher-donors. These publishers would get the corpus and its linguistic annotations for free and could use it as they wished to develop their products; commercial users who had not contributed either money or data would have to pay a whopping $40,000 to the LDC for the privilege of using the ANC for commercial purposes. The corpus would be available for research use only for a nominal fee.

The first and second releases (a total of 22 million words) of the ANC were distributed through LDC from 2003 onward under the conditions described above. However, shortly after the second ANC release in 2005, we determined that the license for 15 of the 22 million words in the ANC did not restrict its use in any way—it could be redistributed and used for any purpose, including commercial. We had already begun to distribute additional annotations (which are separate from and indexed into the corpus itself) on our web site, and it occurred to us that we could freely distribute this unrestricted 15 million words as well. This gave birth to the Open ANC (OANC), which was immediately embraced by the computational linguistics community. As a result, we decided that from this point on, additions to the ANC would include only data that is free of restrictions concerning redistribution and commercial use. Our overall distribution model is to enable anyone to download our data and annotations for research or commercial development, asking (but not requiring) that they give back any additional annotations or derived data they produce that might be useful for others, which we will in turn make openly available.

Unfortunately, the ANC has not been funded since 2005, and only a few of the consortium publishers provided us with texts for the ANC. However, we have continued to gather millions of words of data from the web that we hope to be able to add to the OANC in the near future. We search for current American English language data that is either clearly identified as public domain or licensed with a Creative Commons “attribution” license. We stay away from “share-alike” licenses because of the potential restriction for commercial use: a commercial enterprise would not be able to release a product incorporating share-alike data or resources derived from it under the same conditions. It is here that our definition of “open” differs from the Open Knowledge Definition—until we can be sure that we are wrong, we regard the viral nature of the share-alike restriction as prohibitive for some uses, and therefore data with this restriction are not completely “open” for our purposes.

Unfortunately, because we don’t use “share-alike” data, the web texts we can put in the OANC are severely limited. A post on this blog by Jordan Hatcher a little while ago mentioned that the popularity of Creative Commons licenses has muddied the waters, and we at the ANC project agree, although for different reasons. We notice that many people—particularly producers of the kinds of data we most want to get our hands on, such as fiction and other creative writing—tend to automatically slap at least a “share-alike” and often also a “non-commercial” CC license on their web-distributed texts. At the same time, we have some evidence that when asked, many of these authors have no objection to our including their texts in the OANC, despite the lack of similar restrictions. It is not entirely clear how the SA and NC categories became an effective default standard license, but my guess is that many people feel that SA and NC are the “right” and “responsible” things to do for the public good. This, in turn, may result from the fact that the first widely-used licenses, such as the GNU Public License, were intended for use with software. In this context, share-alike and non-commercial make some sense: sharing seems clearly to be the civic-minded thing to do, and no one wants to provide software for free that others could subsequently exploit for a profit. But for web texts, these criteria may make less sense. The market value of a text that one puts on the web for free use (e.g., blogs, vs. works published via traditional means and/or available through electronic libraries such as Amazon) is potentially very small, compared to that of a software product that provides some functionality that a large number of people would be willing to pay for. Because of this fact, use of web texts in a corpus like the ANC might qualify as Fair Use—but so far, we have not had the courage to test that theory.

We would really like to see something like Open Data Commons Attribution License (ODC-BY) become the license that authors automatically reach for when they publish language data on the web, in the way the CC-BY-SA-NC license is now. ODC-BY was developed primarily for databases, but it would not take much to apply it to language data, if it has not been done already (see, e.g., the Definition of Free Cultural Works). Either that, or we determine if in fact, because of the lack of monetary value, Fair Use could apply to whole texts (see for example, Bill Graham Archives v. Dorling Kindersley Ltd., 448 F. 3d 605 – Court of Appeals, 2nd Circuit 2006 concerning Fair Use applied to entire works).

In the meantime, we continue to collect texts from the web that are clearly usable for our purposes. We also have a web page set up where one can contribute their writing of any kind (fiction, blog, poetry, essay, letters, email) – with a sign off on rights – to the OANC. So far, we have managed to collect mostly college essays, which college seniors seem quite willing to contribute for the benefit of science upon graduation. We welcome contributions of texts (check the page to see if you are a native speaker of American English), as well as input on using web materials in our corpus.

WordNet: A Large Lexical Database for English

Guest - June 30, 2010 in External, Featured Project, OKF Projects, Open Data, Open Knowledge Foundation, WG Linguistics

The following guest post is from Christiane Fellbaum at Princeton University who is working on a statistical picture of how words are related to each other as part of the WordNet project.

Information retrieval, document summarization and machine translation are among the many applications that require automatic processing of natural language. Human language is amazingly complex and making it “understandable” to computers is a significant challenge. While automatic systems can segment text into words (or tokens), strip them of their inflectional endings, identify their part of speech and analyze their syntactic (grammatical) function fairly accurately, they cannot determine the context-appropriate meaning of a polysemous word. Somewhat perversely, the words we use most often also have the greatest number of different senses (try to think of the many meanings of “check” or “case” for example).

WordNet organizes over 150 000 different English words into a huge, multi-dimensional semantic network. A word is linked to many other words to which it is meaningfully related. Thus, one sense of “check” is related to “chess,” another to “bank cheque” and a third to “houndstooth”. Based on the assumption that words in a context are similar in meaning to one another, a system can simply navigate along the arcs connecting WordNet’s words and measure how close or distant a given word is from another one in a text. Thus, if “check” occurs in the context of “draft,” WordNet will suggest that the appropriate sense “check” here is “bank cheque” as there only a few arcs connecting that sense of “check” with “draft”, while there are more (or none at all) connecting “draft” with the chess or textile pattern senses of “check.”

wordnet

WordNet is a major tool for word sense disambiguation in many Natural Language Processing applications. A number of terminological databases build on WordNet as a general lexicon to which domain-specific terminology can be added. WordNet is furthermore used for research in linguistics and psycholinguistics, for language pedagogy (English as a First and Second Language) and it has been integrated into many on-line dictionaries, including Google’s “define” function.

Being freely and publicly available, WordNet is queried tens of thousands of times daily and the database is downloaded some 6 000 times every month from the Princeton website.

Wordnet image from W3

Wordnet image from W3

Work on WordNet continues with support from the U.S. National Science Foundation. We are currently annotating selected words in the American National Corpus, a freely available text collection of modern American English, with WordNet senses. The annotated corpus will illustrate the use of specific word meanings for study and applications by both human users and computers, who can “learn” from examples to better identify context-appropriate word meanings. Another goal is to increase the internal connectivity of the semantic network by collecting human ratings of semantic similarity among words. The similarity ratings, once integrated into WordNet, will create many more connections among words and senses and improve automatic sense discrimination and identification.

Get Updates