Cataloguing Bibliographic Data with Natural Language and RDF
August 9, 2010 in Uncategorized
In the grand tradition of W3C IRC bots, I’ve started some speculative work on a robot that tries to understand natural language descriptions of works and their authors and generates RDF. It is written in Python and uses ORDF, the NLTK and FuXi. Before going into implementation details, here’s an example of a session:
The natural language parsing is somewhat simplistic, the kinds of grammatical constructions it can understand are limited (but growing), the resolution of pronouns (e.g. he, they) only looks at the previous named subject and it will get confused if there is more than one pronoun referring to a different thing in the same sentence but all of these things can be improved. Broadly, the process follows the following steps:
and the constituent parts bubble up and return an RDF Graph that looks like this:
And this sort of structure is the basis for the reasoning step. Provenance information, using OPMV is also kept, pointing back to the original IRC message that was parsed so the entire process should be repeatable. I suppose since IRC is not necessarily the most accessible of media — though I can’t really see why — the same engine could be easily glued to a web server with a simple chat-like interface. Perhaps this is easier or more natural than web forms. Perhaps not. More research is needed. In any case, I’m working on improving the natural language parsing and the inference rules as time permits so hopefully the robot will become more and more clever. Source code for the IRC bot is available at: http://bitbucket.org/ww/sembot You can play with a live version of the bot by joining irc://irc.oftc.net/ and joining #okfn or engaging in a private chat with biblio. It understands the command “sembot help” and I’ll try not to break it too badly while anyone’s playing with it.
12:41 < ww> biblio forget 12:41 < biblio> ww: ok 12:41 < ww> Solzhenitsyn's name is "Aleksander Isayevitch Solzhenitsyn" 12:42 < ww> He was born on December 11th 1918 12:42 < ww> He died on August 3rd 2008 12:42 < ww> He wrote TFC in 1968 12:42 < ww> TFC's title is "The First Circle" 12:42 < ww> "YMCA"'s name is "YMCA Press" 12:42 < ww> They published TFC in 1978 12:42 < ww> biblio think 12:42 < biblio> ww: I learned 25 things in 0:00:00.218296 12:42 < ww> biblio paste 12:42 < biblio> ww: http://pastebin.ca/1913826
- (NLTK) Tokenise the sentence and classify for parts of speech
- Create references for named entities (capitalised words, URIs and phrases enclosed in double quotes)
- (NLTK) Create a lexicon, the part of a grammar that grounds it to individual words and append it to the canned grammar that describes the structure of sentences. This is a feature grammar not a context-free grammar
- (NLTK) Parse the input sentences creating a syntax tree with the root at the main verb in the sentence
- The syntax tree is annotated with the logical structure of the sentence (see Analysing the meaning of sentences). This logical representation is cunningly constructed so as to also be runnable Python code (with eval). Running it transforms the syntax tree into an RDF representation.
- (FuXi) the “biblio think” command causes the RDF of the current session to be run through a number of inference rules that encode higher level meaning. That if “X wrote Y” then X must be a person, Y must be a work and X must have contributed to Y.
statement(
predicate(
bnode(
rdf_type(umbel("Verb")),
label("is"),
racine("be"),
tense(nlp("Present"))
),
named("aHLIkuXm14335") # "The First Circle"
),
posessive(
bnode(
rdf_type(umbel("Noun")), label("title"), racine("title")
),
named("aHLIkuXm14333") # "TFC"
)
)
entity:aHLIkuXm14333 a nlp:NamedEntity; rdfs:label "TFC". entity:aHLIkuXm14335 a nlp:NamedEntity; rdfs:label "The First Circle". [ a umbel:Verb; rdfs:label "is"; lvo:nearlySameAs lve:be; nlp:directObject entity:aHLIkuXm14335; nlp:subject [ a umbel:Noun; rdfs:label "title"; lvo:nearlySameAs lve:title; nlp:owner entity:aHLIkuXm14333]; nlp:tense nlp:Present].
Related posts:
- ‘The Future of Bibliographic Control’ and Licensing Policies for Bibliographic Data Last week the Working Group on the Future of Bibliographic Control at the Library of Congress released their Draft Report. They are soliciting for public comment until the 15th December, in good time for final submission on the 9th January....
- CERN opens up bibliographic metadata! As regular readers of the Open Knowledge Foundation blog will know, bibliographic metadata is a subject close to our heart (see e.g., here, here and here). Hence we were delighted to see today’s announcement that CERN Library are releasing their...
Open Knowledge Foundation Blog
George Oates said on August 9, 2010
Hi Rufus – You might be interested in some work being done with the Internet Archive full text corpus in a similar vein:
http://blog.openlibrary.org/2010/08/02/open-library-ore-a-mysql-data-dump-is-available/
Augusto Herrmann said on August 17, 2010
William,
I saw your post about this on the FuXi discussion list and I found your bot to be a very cool idea.
I’ve been meaning to do some playing around with nltk + rdflib + FuXi myself, but have just started to read the ntlk book. This will certainly be a nice example to follow. Thanks!
Best regards, Augusto Herrmann
Iain Emsley said on August 23, 2010
I’ve been meaning to play around with NLTK on some other stuff. Looks like a fun project to look at and get my head around fitting it in with other stuff. Best, Iain