Sunday, July 6, 2014

TILT recognises words in manuscripts, typescripts, books

The next milestone has been reached. TILT can now recognise words on the lines identified earlier with reasonable accuracy. What it does is pretty simple: it just looks for black text in a strip on either side of the lines identified in the previous step and then extends any black lines discovered outwards. It finally draws a polygon around the discovered word.

The next step will be to refine these words so they represent as closely as possible the words in the transcribed text. Then it should be easy to align them with the words in the transcription of the page (which we often already have) and hand over to Anna for development of the GUI. Here's a sample of TILT's current performance using polygons. These can be reduced to rectangles easily if desired.

The main problem in recognising words turned out to be the different way that spaces are used in printed and handwritten texts. In the former there are lots of little inter-character gaps that mostly aren't present in manuscripts. Try as I might I couldn't find a single setting that worked well for both. These images show the performance on a typescript, manuscript and a printed page. The colours are used alternately to show where word-divisions have been recognised. To get this performance in practice the GUI will have to specify the image type.

The next stage has some ability to split/merge words, based on their alignment with a known text, but it would be better if a good word identification can be attained at this stage.

No comments:

Post a Comment