Wednesday, July 9, 2014

Getting word-spacing right

As mentioned in the previous post the hardest thing to get right in TILT is to accurately estimate the minimum space between words. A little reflection will show that manuscripts, typescripts and printed texts all employ very different conventions on the use of spaces between words. How can you estimate word-spaces in manuscripts? It's hopeless, surely?

In fact, there is a trivial solution. A page-image is made up of 'blobs', that is pixels that are joined together. Wrapping each such blob in a polygon allows you to compute the distance between blobs on a line. In a printed text there will be one blob per letter. In a manuscript, because of joined-up writing, there will be lots of characters per blob. And then every now and again there will be a gap between blobs that is not a word-space. So how can these informal gaps be distinguished from real word-spaces? Another problem is that there are columns where the inter-word gap is measured in hundreds of pixels. Just measuring gaps in a line and averaging the result thus has no hope of success. How can these huge inter-word gaps be excluded? But then I realised that the number of words on a page is already known, because TILT needs the text of each page to align it with the image. So all I had to do was to find all the gaps on a page, then sort them by decreasing size. By assigning one gap to each word in the text the last one chosen would be the width of the minimum word-gap.

This works so well, I have updated the test script to show it. It will work for any page from a manuscript, typescript, inscription or printed book. The only gotcha is that you must know the number of words, or use an estimate. Also it can never be perfect. There is no one setting for a minimal word-gap, since an author can write two words with less than this separation and separate two halves of one word with greater than this. But it's still an optimal solution.

Just to give you some idea of how much the minimum word-gap varies between the test examples:

TypeAuthorNumber of wordsMinimum word-space (pixels)
PrintedDe Roberto29112
ManuscriptDe Roberto3536

Now for the text-to-word alignment. That's the last stage.

Sunday, July 6, 2014

TILT recognises words in manuscripts, typescripts, books

The next milestone has been reached. TILT can now recognise words on the lines identified earlier with reasonable accuracy. What it does is pretty simple: it just looks for black text in a strip on either side of the lines identified in the previous step and then extends any black lines discovered outwards. It finally draws a polygon around the discovered word.

The next step will be to refine these words so they represent as closely as possible the words in the transcribed text. Then it should be easy to align them with the words in the transcription of the page (which we often already have) and hand over to Anna for development of the GUI. Here's a sample of TILT's current performance using polygons. These can be reduced to rectangles easily if desired.

The main problem in recognising words turned out to be the different way that spaces are used in printed and handwritten texts. In the former there are lots of little inter-character gaps that mostly aren't present in manuscripts. Try as I might I couldn't find a single setting that worked well for both. These images show the performance on a typescript, manuscript and a printed page. The colours are used alternately to show where word-divisions have been recognised. To get this performance in practice the GUI will have to specify the image type.

The next stage has some ability to split/merge words, based on their alignment with a known text, but it would be better if a good word identification can be attained at this stage.