A Software Tool for Historical Manuscripts.
School of Computer Science,
University of Birmingham,
Birmingham B15 2TT, UK.
Paper given at the 1995 AHC Conference at Montreal.
As the availability of digitised images of manuscript pages increases, so does the need for software tools to assist in their processing. The work described here relates to such a tool and is intended to assist in the production of a text copy of the content of a manuscript. It will always be possible to produce such a copy "manually". that is by displaying the page on a computer screen, reading it, and then typing the text in on the keyboard. This is a slow process and can become very tedious. The software currently being produced will be considered a success if it is quicker, easier and more interesting than the manual alternative.
The choice of this research topic is to some extent an act of faith. It assumes that archives will offer more and more of their documents as digitised images rather than continuing to allow all users to handle the original documents. Since even the most careful user cannot avoid wear and damage to the fragile original, it seems obvious that those responsible for the preservation of these documents will wish to offer the digital alternative. This will only be possible if resources are available and the likely shortage of funding will make such conversion a slow process. These financial problems also make it less likely that the provision of text copies will keep pace with the digitisation of the documents. Hence the need for such software tools.
Conversations at the recent AHC conference at Montreal tend to reinforce this impression. It seems that many archives are in a state of crisis over funding and those who are able to develop database of digitised images of their document are most unlikely to be able to provide text copies in addition to the images. This stresses the need for software tools to assist historians in the use of the digitised images, both tools for enhancing the images and for obtaining text copies and studying the handwriting of such documents.
After digitisation of the original manuscript, there may be a need for computer enhancement to improve the image. Examples of such enhancement have been described in discussions of the Archivo Generale des Indias project (ref ???). Faded ink may be darkened and browned paper returned to its original whiteness. Fuzzy images may be sharpened and stains removed from the document. All of these techniques may be applied to produce the clearest possible image and users may come to prefer and expect the improved quality of these images. None of this is seen as part of the software currently under production - this software is assumed to start with a high-quality image of a manuscript page.
The first process to be carried out by this package is the segmentation of the manuscript page, first into lines of text and then into individual words. The segmentation section converts from one large image of a complete manuscript page into an ordered sequence of smaller images, hopefully each one containing a single word. The user will have the option of reviewing and correcting this segmentation or of accepting the output of this process without checking and it will always be possible to return to this section and edit the segmentation if it becomes obvious at a later stage that some changes are necessary.
The next process is the generation of images of known words to be compared with the unknown word-images from the manuscript. For this two things will be needed - a computer description of the hand used and a vocabulary of likely words for this type of document.
Computer Description of Hands.
For each hand, a computer description is needed of each of the letters (including alternative versions for some of the letters). In the case of cursive hands, a description of how these letters are joined into words is also needed. The computer description will produce a simulation of the handwriting process, so some discussion of human handwriting is relevant at this point.
Originally the letters were produced by moving the pen-nib across the page and different letters corresponded to different pen-paths across the page. The end of the nib is approximately a straight line of known width and it may be manipulated to produce three types of imprint on the page. The pen-path may be defined as the path of the centre of the nib in the first two cases and of the corner in contact with the page in the third case.
(a) the whole nib is in contact, giving a straight or curved path of known width.
(b) pressure is applied to the nib, causing it to splay apart slightly, which produces a slightly wider path. Too much pressure will both damage the nib and cause a gap in the centre of the path, so the increase in width is limited. This usually applies for only a short part of the path.
(c) only the corner of the nib is in contact with the page. This gives a very thin line.
A hand such as gothic uses a wide nib and each letter is made up of several strokes. The nib is assumed to be held at a given angle to the horizontal (possibly varying slightly during production of the letter). In the computer simulation, through each point along the pen-path a line is drawn at the correct angle and of the correct length. This builds up an image of the letter. Examples of gothic hands show that the letters are made up of thick, linear strokes and thin curved flourishes. The pen-path for these strokes may be recorded as the coordinates of a few points and the intervening points may be interpolated by a sequence of straight line segments. Data files containg these descriptions have been prepared and may be used with software for polyline interpolation to generate images of letters and words written using the gothic hand.
The other example which has been studied so far is copperplate, a cursive script produced with a much narrower nib. To obtain the flowing curves typical of this hand, the pen-path must be interpolated by a series of piecewise smooth cubics rather than linear segments. It also differs from the gothic hand in that a single stroke now includes several letters or parts of letters. As the words are built up, additional sections of curve must be generated to join the end of one letter to the begining of the next. For this cursive hand, the start or end of a stroke must be defined as:
(a) the start or end of a word.
(b) a reversal of direction of the pen-path.
In additional, there will be some extra strokes, such as the cross-bar of a letter "t".
To calculate cubic interpolation, each point recorded must have the two coordinates and also the slope (or gradient) of the curve at that point. This allows a unique cubic to be calculated for each interval for which interpolation is needed and also ensures that the gradients at the joins of adjacent cubics are continuous. Further work is needed to complete the software and data sets for this example of a cursive hand and then similar analyses of other hands will need to be carried out to build up a comprehensive description of possible hands.
It would in theory be possible to take a complete dictionary for the language in which the document is written and compare every word in the langauge with each of the unknown word-images in the document. This is impractical because it would take far too long. To obtain a response in a reasonable length of time, it will be necessary to restrict the vocabulary to those words which are most likely to occur in a document of this type and date. The system will also allow the user to type in additional words for comparison, or to speed things up by removing words which are considered less likely. At the end of the process, a vocabulary for this particular document will exist and this may be of interest in its own right, especially if the vocabulary includes terms peculiar to a particular date or profession. It is assumed that it will be possible to find historians who are willing to suggest word-lists for different types of documents and different time-periods and that these may be used as a starting point for the processing.
There will always be problems due to non-standard spelling and abbreviations (possibly also non-standard). One solution would be to store lists of synonyms, giving the various possible spellings of each word. This would allow the storage of abbreviations as additional synonyms. This is always possible, but may require unduely large amount of storage to contain all the data. Another solution would be to store rules for generating all possible spellings of a given word and apply this to generate, first the words and then the images, as and when they are needeed. In the long term, this solution is preferred because it is more flexible, however some provision for the storage of synonyms may always be needed. Computer systems can seldom keep up with human ingenuity, in spelling as in any other area.
Justification of the Approach.
Since this method of handwriting recognition differs greatly from other methods applied to this problem, some justification may be needed. In general, humans are still more successful at reading difficult handwriting than any computer system so far developed. This suggests that an approach based on the way in which humans read difficult handwriting may be more successful in some cases. The problem is to determine what the human reader does to decipher the handwritten text. The human reader very rarely attempts to break down a difficult word into strokes and then build them up again into letters and then words. It is far more natural to use his or her knowledge of the context and recognise whole words or phrases at a time. In studying the psychological texts on reading, those relating to "fluent reading" are closest to the techniques applied here. This uses a great deal of contextual information to guess ahead and predict what will come next on the page, and the actual letters on the paper either confirm or change these guesses. There is some evidence that the initial letters of a word are used to reduce the number of possibilities and guide the guesswork involved in such fast reading. All these methods can be copied in the software in the current system and may allow a faster recognition of handwritten texts than other approaches.
One vital part of the system will be the recognition or rejection of the whole words in the unknown document. Whole word comparison to determine which of the generated images is most similar to the unknown word-image is again a minority interest in the texts on handwriting recognition. To get a suitable speed of response, it wll be necessary to identify to the computer those words which are obviously not the same. For example, a short word such as "man" or "may" is obviously not the same as a much longer word such as "manuscript". This is so obvious to the human that it would never occur to him or her to try and compare them. To make this equally obvious to the computer system, an explicit test will be needed. The word images will need to be normalised (scaled to the same "size" or more likely the same height). Then the length (or the ratio of length to height) will have to be calculated and if the difference in lengths exceeds a certain value (which will have to be determined by experience) then the two images cannot represent the same word.
Having normalised the images and rejected those which are of different length, the two images may be superimposed and the difference evaluated. This will allow the ordering of the various possibilites and the selection of a small number of the most similar. Having reduced the possible candidates to a very small number, more detailed matching may be applied. There are various fuzzy matching techniques, mostly using neural nets of some form, which could be applied. Another possibility is the search for recognisable features in the images (for example the number of up and down strokes and their position within the word) and finally transformations to use an "elastic background" to try and superimpose these features and see whether or not this forces the rest of the word to match up or not. Much work remains to be done in comparing these various techniques and deciding which are most useful for this problem.
This is the basic outline of my current research. I shall develop an example of each section in order to produce a prototype which can be tested on some real problems and expect to get a PC-based version working within the next few years. I shall be especially interested to hear from historians willing to work with me on this project and test out portions of the prototype as they are developed.