Improvements to the Concordance

One of the main items scheduled for v0.4 of open shakespeare is improvements to the responsiveness of the concordance. Using the v0.3 codebase, using just the sonnets as test material, loading up the list of words for the concordance alone took around 24s on my laptop. This is because even with a single text there are already over 18,000 items in the concordance and we were having to read through all of these to generate the list of words. Some recent commits (e.g. r:72) have gone some way to improving this responsiveness (loading word list is now 3s now compared to 24s) but the result is not entirely satisfactory (printing full statistics is 13s compared to 40s previously). One obvious way to go futher is to use caching — either of individual web pages or of particular key parts such as all the distinct words occurring in the concordance (caching works because the concordance only changes when new texts are added which will usually only happen once — when the system is first initialised).

Relatedly and r:74 is a first step on filtering the concordance — in this case to exclude roman numerals and various non-words. Doing this made me think about whether the concordance should be storing actual words or just stems — for example, it does not seem to make much sense to have different entries for kill, kills, killed etc. Using a stemming algorithm such as the porter stemmer (which I notice has a nice python implementation directly available) we can easily stem each word as we go along. This would have several benefits one of the most prominent being a dramatic reduction in the basic dictionary size (i.e. the number of distinct words in the concordance).