Wednesday, July 9, 2014

Getting word-spacing right

As mentioned in the previous post the hardest thing to get right in TILT is to accurately estimate the minimum space between words. A little reflection will show that manuscripts, typescripts and printed texts all employ very different conventions on the use of spaces between words. How can you estimate word-spaces in manuscripts? It's hopeless, surely?

In fact, there is a trivial solution. A page-image is made up of 'blobs', that is pixels that are joined together. Wrapping each such blob in a polygon allows you to compute the distance between blobs on a line. In a printed text there will be one blob per letter. In a manuscript, because of joined-up writing, there will be lots of characters per blob. And then every now and again there will be a gap between blobs that is not a word-space. So how can these informal gaps be distinguished from real word-spaces? Another problem is that there are columns where the inter-word gap is measured in hundreds of pixels. Just measuring gaps in a line and averaging the result thus has no hope of success. How can these huge inter-word gaps be excluded? But then I realised that the number of words on a page is already known, because TILT needs the text of each page to align it with the image. So all I had to do was to find all the gaps on a page, then sort them by decreasing size. By assigning one gap to each word in the text the last one chosen would be the width of the minimum word-gap.

This works so well, I have updated the test script to show it. It will work for any page from a manuscript, typescript, inscription or printed book. The only gotcha is that you must know the number of words, or use an estimate. Also it can never be perfect. There is no one setting for a minimal word-gap, since an author can write two words with less than this separation and separate two halves of one word with greater than this. But it's still an optimal solution.

Just to give you some idea of how much the minimum word-gap varies between the test examples:

TypeAuthorNumber of wordsMinimum word-space (pixels)
TypescriptHarpur1504
PrintedDe Roberto29112
ManuscriptDe Roberto3536
ManuscriptCapuana2057

Now for the text-to-word alignment. That's the last stage.

No comments:

Post a Comment