The following guest post is from Christiane Fellbaum at Princeton University who is working on a statistical picture of how words are related to each other as part of the WordNet project.
Information retrieval, document summarization and machine translation are among the many applications that require automatic processing of natural language. Human language is amazingly complex and making it “understandable” to computers is a significant challenge. While automatic systems can segment text into words (or tokens), strip them of their inflectional endings, identify their part of speech and analyze their syntactic (grammatical) function fairly accurately, they cannot determine the context-appropriate meaning of a polysemous word. Somewhat perversely, the words we use most often also have the greatest number of different senses (try to think of the many meanings of “check” or “case” for example).
WordNet organizes over 150 000 different English words into a huge, multi-dimensional semantic network. A word is linked to many other words to which it is meaningfully related. Thus, one sense of “check” is related to “chess,” another to “bank cheque” and a third to “houndstooth”. Based on the assumption that words in a context are similar in meaning to one another, a system can simply navigate along the arcs connecting WordNet’s words and measure how close or distant a given word is from another one in a text. Thus, if “check” occurs in the context of “draft,” WordNet will suggest that the appropriate sense “check” here is “bank cheque” as there only a few arcs connecting that sense of “check” with “draft”, while there are more (or none at all) connecting “draft” with the chess or textile pattern senses of “check.”
WordNet is a major tool for word sense disambiguation in many Natural Language Processing applications. A number of terminological databases build on WordNet as a general lexicon to which domain-specific terminology can be added. WordNet is furthermore used for research in linguistics and psycholinguistics, for language pedagogy (English as a First and Second Language) and it has been integrated into many on-line dictionaries, including Google’s “define” function.
Being freely and publicly available, WordNet is queried tens of thousands of times daily and the database is downloaded some 6 000 times every month from the Princeton website.
Work on WordNet continues with support from the U.S. National Science Foundation. We are currently annotating selected words in the American National Corpus, a freely available text collection of modern American English, with WordNet senses. The annotated corpus will illustrate the use of specific word meanings for study and applications by both human users and computers, who can “learn” from examples to better identify context-appropriate word meanings. Another goal is to increase the internal connectivity of the semantic network by collecting human ratings of semantic similarity among words. The similarity ratings, once integrated into WordNet, will create many more connections among words and senses and improve automatic sense discrimination and identification.