Friday, June 27, 2014

TILT recognises lines in manuscript/print books

Perhaps the hardest thing to get right in the TILT design is to reliably recognise lines on a page where division into lines may be irregular. For example, you can easily have uneven line-spacing, inserted, warped and tilted lines, but in order to recognise the words on a page you first have to work out roughly here they are. TILT has shown, early in its life, that the basic idea for its line-recognition method works. There is a live demo here. It is slow only because the server is slow. TILT is actually pretty fast. Once you've loaded a page you can click on some buttons to see how TILT processes a page-image. First it reduces it to greyscale, then to pure black and white, then it removes residual borders (which are ordinary OCR steps). Finally it searches the page for lines, using a grid of rectangles, about 25 across and 200 down. The reason for this strange proportion is that lines are pretty much shaped that way. So if lines slant down or up or curve, it should be able to track their progress across the page. So far it has demonstrated that it can discover small lines in-between the main ones. Ordinary OCR programs can't do this. They assume that text has evenly-spaced lines. TILT's test interface draws a line over the top of each line of text just for the demonstration but in the real product these lines will be invisible. Along this line it will later attempt to recognise words, and to align those words with the already transcribed text. But this step brings that much closer.

What this makes possible is the offline processing of large numbers of page-images, creating page-image to text links, which can then be uploaded. They won't be perfect without editing (which is what the graphical user interface is needed for) but for a first pass it will suffice for now.

Wednesday, June 11, 2014

TILT is born again

This is a fresh start for the text-to-image linking tool (TILT). TILT is a tool for linking areas on a page-image taken from an old book, be it manuscript or print, and a clear transcription of its contents. As we rely more and more on the Web there is a danger that we will leave behind the great achievements of our ancestors in written form over the past 4,000 years. On the Web what happens to all those printed books, handwritten manuscripts on paper, vellum, papyrus, stone, or clay tablets etc.? Can we only see and study them by actually visiting a library or museum? Or is there some way that they can come to us, so they can be properly searched and studied, commented on and examined by anyone with a computer and an Internet link?

So how do we go about that? Haven't Google and others already put lots of old books onto the Web by scanning images of pages and their contents using OCR (optical character recognition)? Sure they have, and I don't mean to play down the significance of that, but for objects of greater than usual interest you need a lot more than mere page-images and unchecked OCR of its contents. For a start you can't OCR manuscripts, or not well enough at least. And OCR of even old printed books produces lots of errors. Laying the text directly on top of the page-images means that you can't see the transcription to verify its accuracy. Although you can search it you can't comment on it, format or edit it. And in an electronic world, where we expect so much more of a Web page than for it merely to sit there dumbly to be stared at, the first step in making the content more useful and interactive is to separate the transcription from the page-images.

Page-image and content side by side

Page images are useful because they show the true nature of the original artefact. Not so for transcriptions. These are composed of mere symbols that, by convention, were chosen to represent the contents of writing. You can't use just text on a line to represent complex mathematical formulae, drawings or wood-cuts, the typography, layout, or the underlying medium. So you still need an image of the original to provide supplementary information, and not least because you might want to verify that the transcription is a true representation of it. So the only practical way to do this is to put the transcription next to the image.

Now the problems start. One of the principles of HCI (human-computer interaction) design is that you have to to minimise the effort or ‘excise’ as the user goes about doing his or her tasks. And putting the text next to the image creates a host of problems that increase excise dramatically.

As the user scrolls down the transcription, reading it, at some point the page-image will need refreshing. And likewise if the user moves on to another page image, the transcription will have to move down also. So some linkage between the two is already needed even at the page-level of granularity.

And if the text is reformatted for the screen, perhaps on a small device like a tablet or a mobile phone, the line-breaks will be different from the original. So even if the printed text is perfectly clear, it won't be clear, as you read the transcription, where the corresponding part of the image is. You may say that this is easily solved by enforcing line-breaks exactly as they are in the original. But if you do that and the lines don't fit in the available width – and remember that half the screen is already taken up with the page-image – then the ends of each enforced line must wrap around onto the next line, or else they will become invisible off to the right. Either way it is pretty ugly and not at all readable. And consider also that the line height, or distance between lines in the transcription can never match that of the page-image. So at best you'll struggle to align even one line at a time in both halves of the display.

So what's the answer? It is, as several others have already pointed out (e.g. TILE, TBLE, EPT), to link the transcription to the page-image at the word-level. As the user moves the mouse over, or taps on, a word in the image or in the transcription the corresponding word can be highlighted in the other half of the display, even when the word is split over a line. And if needed the transcription can be scrolled up or down so that it automatically aligns with the word on the page. And now the ‘excise’ drops back to a low level.

Making it practical

The technology already exists to make these links, but the problem is, how? Creating them by hand is incredibly time-consuming and also very dull work. So automation is the key to making it work in practice. The idea of TILT is to make this task as easy and fast as possible, so we can create hundreds or thousands of such text-to-image linked pages at low cost, and make all this material truly accessible and usable. The old TILT was written at great speed for a conference in 2013. What it did well was outline how the process could be automated, but it had a number of drawbacks that can, now they are understood properly, be remedied in the next version. So this blog is to be a record of our attempts to make TILT into a practical tool. The British Library Labs ran a competition recently and we were one of two winners. They are providing us with support, materials and some publicity for the project. We aim to have it finished in demonstrable and usable form by October 2014.