In collaboration with the School of History, we have applied OCR to printed French bibles from the 16th century. We have observed that spaces between words come in several sizes. Closely linked words can have smaller spaces between them than other words. Modern OCR technology treats all spaces as equal.
We aim to automatically reconstruct the different sizes of spaces used in scanned text. This will also involve determining the widths of the type pieces of the letters. There are linguistic applications, next to insight it gives us into the printing technology.
Techniques to apply will probably include HMMs. Some background in NLP is therefore highly desirable.
Java or Python code to analyse scans of printed texts.