A project funded by the Undergraduate Research Assistant Scheme has successfully completed the first stage of interdisciplinary work, between the Institute of Mediaeval Studies and the School of Computer Science. The long-term aim is to digitise and analyse early French bibles.
In this pilot project, undergraduate student Gregor Haywood, under the supervision of Prof. Clive Sneddon and Dr. Mark-Jan Nederhof, explored the feasibility of large-scale OCR technology for early printed text. Scans from a French bible from 1543 were provided by the Special Collections of the University Library. Much of the project consisted of iterations of automatic transcription, manual correction, retraining, and evaluation of accuracy. In addition, problems were investigated that specifically arise from taking OCR technology designed for modern printed documents and applying it on early documents. Such problems include non-standard character sets, non-standard page layout, faded or smudged ink, and torn pages.
Despite of these problems, it was demonstrated that error rates below 3% are achievable, which paves the way for a continuation of these efforts.