Cataloguing Bibliographic Data with Natural Language and RDF

In the grand tradition of W3C IRC bots, I’ve started some speculative work on a robot that tries to understand natural language descriptions of works and their authors and generates RDF. It is written in Python and uses ORDF, the NLTK and FuXi.

Before going into implementation details, here’s an example of a session:

12:41 < ww> biblio forget
12:41 < biblio> ww: ok
12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn"
12:42 < ww> He was born on December 11th 1918
12:42 < ww> He died on August 3rd 2008
12:42 < ww> He wrote TFC in 1968
12:42 < ww> TFC's title is "The First Circle"
12:42 < ww> "YMCA"'s name is "YMCA Press"
12:42 < ww> They published TFC in 1978
12:42 < ww> biblio think
12:42 < biblio> ww: I learned 25 things in 0:00:00.218296
12:42 < ww> biblio paste
12:42 < biblio> ww: http://pastebin.ca/1913826

The natural language parsing is somewhat simplistic, the kinds of grammatical constructions it can understand are limited (but growing), the resolution of pronouns (e.g. he, they) only looks at the previous named subject and it will get confused if there is more than one pronoun referring to a different thing in the same sentence but all of these things can be improved.

Broadly, the process follows the following steps:

(NLTK) Tokenise the sentence and classify for parts of speech
Create references for named entities (capitalised words, URIs and phrases enclosed in double quotes)
(NLTK) Create a lexicon, the part of a grammar that grounds it to individual words and append it to the canned grammar that describes the structure of sentences. This is a feature grammar not a context-free grammar
(NLTK) Parse the input sentences creating a syntax tree with the root at the main verb in the sentence
The syntax tree is annotated with the logical structure of the sentence (see Analysing the meaning of sentences). This logical representation is cunningly constructed so as to also be runnable Python code (with eval). Running it transforms the syntax tree into an RDF representation.
(FuXi) the “biblio think” command causes the RDF of the current session to be run through a number of inference rules that encode higher level meaning. That if “X wrote Y” then X must be a person, Y must be a work and X must have contributed to Y.

The neat bit is really the way it generates RDF, translating a logical structure that looks like,

statement(
  predicate(
    bnode(
      rdf_type(umbel("Verb")),
      label("is"),
      racine("be"),
      tense(nlp("Present")) 
    ), 
    named("aHLIkuXm14335") # "The First Circle" 
  ), 
  posessive(
    bnode(
      rdf_type(umbel("Noun")), label("title"), racine("title")
    ), 
   named("aHLIkuXm14333") # "TFC"
  ) 
)

and the constituent parts bubble up and return an RDF Graph that looks like this:

 entity:aHLIkuXm14333 a nlp:NamedEntity;
      rdfs:label "TFC".

 entity:aHLIkuXm14335 a nlp:NamedEntity;
      rdfs:label "The First Circle".

 [ a umbel:Verb;
      rdfs:label "is";
      lvo:nearlySameAs lve:be;
      nlp:directObject entity:aHLIkuXm14335;
      nlp:subject [ a umbel:Noun;
                    rdfs:label "title";
                    lvo:nearlySameAs lve:title;
                    nlp:owner entity:aHLIkuXm14333];
      nlp:tense nlp:Present].

And this sort of structure is the basis for the reasoning step. Provenance information, using OPMV is also kept, pointing back to the original IRC message that was parsed so the entire process should be repeatable.

I suppose since IRC is not necessarily the most accessible of media — though I can’t really see why — the same engine could be easily glued to a web server with a simple chat-like interface. Perhaps this is easier or more natural than web forms. Perhaps not. More research is needed.

In any case, I’m working on improving the natural language parsing and the inference rules as time permits so hopefully the robot will become more and more clever.

Source code for the IRC bot is available at: http://bitbucket.org/ww/sembot

You can play with a live version of the bot by joining irc://irc.oftc.net/ and joining #okfn or engaging in a private chat with biblio. It understands the command “sembot help” and I’ll try not to break it too badly while anyone’s playing with it.

5 Comments

Pingback: (pluri)TAL / ILPGA [U. Paris 3]
Iain Emsley says:

August 23, 2010 at 18:28

I’ve been meaning to play around with NLTK on some other stuff. Looks like a fun project to look at and get my head around fitting it in with other stuff. Best, Iain

Augusto Herrmann says:

August 17, 2010 at 10:19

William,

I saw your post about this on the FuXi discussion list and I found your bot to be a very cool idea.

I’ve been meaning to do some playing around with nltk + rdflib + FuXi myself, but have just started to read the ntlk book. This will certainly be a nice example to follow. Thanks!

Best regards,
Augusto Herrmann

Pingback: Cataloguing Bibliographic Data with Natural Language and RDF « ResourceShelf
George Oates says:

August 9, 2010 at 18:49

Hi Rufus – You might be interested in some work being done with the Internet Archive full text corpus in a similar vein:

http://blog.openlibrary.org/2010/08/02/open-library-ore-a-mysql-data-dump-is-available/