Cataloguing Bibliographic Data with Natural Language and RDF
August 9, 2010 in Uncategorized
In the grand tradition of W3C IRC bots, I’ve started some speculative work on a robot that tries to understand natural language descriptions of works and their authors and generates RDF. It is written in Python and uses ORDF, the NLTK and FuXi. Before going into implementation details, here’s an example of a session:
The natural language parsing is somewhat simplistic, the kinds of grammatical constructions it can understand are limited (but growing), the resolution of pronouns (e.g. he, they) only looks at the previous named subject and it will get confused if there is more than one pronoun referring to a different thing in the same sentence but all of these things can be improved. Broadly, the process follows the following steps:
and the constituent parts bubble up and return an RDF Graph that looks like this:And this sort of structure is the basis for the reasoning step. Provenance information, using OPMV is also kept, pointing back to the original IRC message that was parsed so the entire process should be repeatable. I suppose since IRC is not necessarily the most accessible of media — though I can’t really see why — the same engine could be easily glued to a web server with a simple chat-like interface. Perhaps this is easier or more natural than web forms. Perhaps not. More research is needed. In any case, I’m working on improving the natural language parsing and the inference rules as time permits so hopefully the robot will become more and more clever. Source code for the IRC bot is available at: http://bitbucket.org/ww/sembot You can play with a live version of the bot by joining irc://irc.oftc.net/ and joining #okfn or engaging in a private chat with biblio. It understands the command “sembot help” and I’ll try not to break it too badly while anyone’s playing with it.
12:41 < ww> biblio forget 12:41 < biblio> ww: ok 12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn" 12:42 < ww> He was born on December 11th 1918 12:42 < ww> He died on August 3rd 2008 12:42 < ww> He wrote TFC in 1968 12:42 < ww> TFC's title is "The First Circle" 12:42 < ww> "YMCA"'s name is "YMCA Press" 12:42 < ww> They published TFC in 1978 12:42 < ww> biblio think 12:42 < biblio> ww: I learned 25 things in 0:00:00.218296 12:42 < ww> biblio paste 12:42 < biblio> ww: http://pastebin.ca/1913826
- (NLTK) Tokenise the sentence and classify for parts of speech
- Create references for named entities (capitalised words, URIs and phrases enclosed in double quotes)
- (NLTK) Create a lexicon, the part of a grammar that grounds it to individual words and append it to the canned grammar that describes the structure of sentences. This is a feature grammar not a context-free grammar
- (NLTK) Parse the input sentences creating a syntax tree with the root at the main verb in the sentence
- The syntax tree is annotated with the logical structure of the sentence (see Analysing the meaning of sentences). This logical representation is cunningly constructed so as to also be runnable Python code (with eval). Running it transforms the syntax tree into an RDF representation.
- (FuXi) the “biblio think” command causes the RDF of the current session to be run through a number of inference rules that encode higher level meaning. That if “X wrote Y” then X must be a person, Y must be a work and X must have contributed to Y.
statement( predicate( bnode( rdf_type(umbel("Verb")), label("is"), racine("be"), tense(nlp("Present")) ), named("aHLIkuXm14335") # "The First Circle" ), posessive( bnode( rdf_type(umbel("Noun")), label("title"), racine("title") ), named("aHLIkuXm14333") # "TFC" ) )
entity:aHLIkuXm14333 a nlp:NamedEntity; rdfs:label "TFC". entity:aHLIkuXm14335 a nlp:NamedEntity; rdfs:label "The First Circle". [ a umbel:Verb; rdfs:label "is"; lvo:nearlySameAs lve:be; nlp:directObject entity:aHLIkuXm14335; nlp:subject [ a umbel:Noun; rdfs:label "title"; lvo:nearlySameAs lve:title; nlp:owner entity:aHLIkuXm14333]; nlp:tense nlp:Present].