OCRing Shakespeare Entry from Encyclopaedia Britannica 11th Edition
August 14, 2007 in Open Shakespeare, Technical, Texts
One of next things we want to do for open shakespeare is provide an open introduction for to his works. The obvious idea for this was to use the Shakespeare entry in the 11th ed of the Encyclopaedia Britannica as detailed in this ticket:
http://p.knowledgeforge.net/shakespeare/trac/ticket/24
We’ve now written code to grab the relevant tiffs off wikimedia:
http://p.knowledgeforge.net/shakespeare/svn/trunk/src/shakespeare/src/eb.py
You can also find them online (28 pages) starting at:
Next step is to then OCR this stuff (after that we can move on to proofing whether by ourselves or via http://pgdp.net). When we first had a stab at this back in April we tried using gocr. Unfortunately the results were so bad that they were unusable. Recently an old ocr engine of HP’s has been released as open source under the name of tesseract:
http://code.google.com/p/tesseract-ocr/
We’re going to have a go using this — though if there is anyone out there with access to an alternative system we’d love to hear about it.
Related posts:
- v0.4 of Open Shakespeare Released A new version of open shakespeare is out. Get it via the code page: http://www.openshakespeare.org/code/ Changelog Annotation of texts (js-based in browser) (ticket:20, ticket:21) (http://www.openshakespeare.org/2007/04/10/annotation-is-working/) Switch to unicode for internal string handling (resolves ticket:23: some texts breaking the viewer) Add...
- v0.3 of Open Shakespeare Released We’ve been doing quite a bit of work on the Open Shakespeare project (which we’ve mentioned before). Given that a brief search on the net turns up many sites about Shakespeare and lots of online copies of shakespeare’s texts you...
- Open Shakespeare v0.2 With a little bit of free time over the last couple of weeks I’ve managed to do some more work on open shakespeare. The new version (v0.2dev) is up and running on the site: http://openshakespeare.org/ (formerly http://demo.openshakespeare.org/). NB: concordance only...
Open Knowledge Foundation Blog
Recent Comments