Type:
handwriting recognition
Dataset Availability:
Free
Description of Benchmark:
The PLANET SynMS (synthetic manuscript) family is a collection of databases of images of synthetically generated cursive writings. The generation process is based on real-writer fonts modelled to capture individual handwriting styles of real persons. As a result, the writings look very realistic.
Overview
The PLANET SynMS (synthetic manuscript) family is a collection of databases of images of synthetically generated cursive writings. The generation process is based on real-writer fonts modelled to capture individual handwriting styles of real persons. As a result, the writings look very realistic.
The writings are typically captured as gray-scale images with a resolution that varies for individual databases. Due to the generation process, there are no truthing errors and no errors of writing segmentation. In addition, the creation of the databases is relatively cheap and effortless, involving simply a fontbase and a dictionary.
For every writing, the coordinates of the positions of every individual character and of all their non-white pixels can be obtained. Also, there typically is additional data representing the result of some standard writing-normalisation and feature-extraction routines.
Purpose
The databases of the family are primarily intended to support the training and evaluation of the recognition part of reader systems which require training targets for individual characters or targets for their non-white pixels.
With the additional feature extraction results, training and evaluation may start directly from the features extracted from the writing.
Because of the availability of position information for characters and their non-white pixels, the databases are also likely to support the development and evaluation of distortion and deformation models as well as writing-normalisation procedures.
Moreover, the development of image degradation models may additionally benefit from the fact that the modelling can start from ideally clean, undisturbed images. The robustness of reader systems against image degradations can be tested too.
Examples
For random samples which illustrate the quality of the writings, see the folder "samples" on the DVD.
Composition
Per database, there is at least a test set and a training set. Every set is produced with the help of a fontbase (of real-writer fonts) and a dictionary, and possibly involves random selection from the set of possible writings. Fonts used to generate the test set will not be used to generate the training set. Hence evaluations based on the test set will measure writer generalisation.
Some additional training sets may bring in additional fonts (if available) which are less realistic. For some training sets, there may exist modified sets generated via predefined deformation routines.
If, in addition to the training sets, separate validation sets are required, they will have to be extracted from the training sets. Typically, one would want to do that in a way that fonts corresponding to different writers end up in different sets. To this end the font information is kept with the writings.
Fontbases
- US: 225 real-writer fonts which capture the writing styles of writers from the United States. The character connections in the writings depend on the characters connected. In addition, the same character is often rendered in different ways.
- CS: Connected script: 46 Latin fonts, a mixture of fonts created to resemble real handwriting and stylized script fonts. Here character connections always look the same - they are guaranteed by a passe-partout letter pattern design. (Some of these fonts base on the handwriting of real writers - but they are simplified and contain standardized letter patterns only.)
- UCS: Unconnected script & block script (& machine writing fonts close to block script), 275 Latin fonts, no character connections. (Some of these fonts base on the block script of real writers - but they are simplified and contain standardized letter patterns only.)
Databases
SynMS.USC500 & SynMS.USC50:
These databases provide synthetic writings of cities of the United States based on the US fontbase. There are 50 fonts in the test set and 175 fonts in the basic training set.
The USC500 database provides writings of 500 different cities. The USC50 database essentially is a subset of 50 cities of the USC500 database. It contains an additional test set of real-world handwritings (images + text references).
Additional training sets are based on fontbases CS and UCS.
The intermediate data is the result of normalisation routines and a subsequent feature extraction based on Gabor wavelets. Other normalisation or feature-extraction routines may be added in future expansions of the databases.
For more details see the files 'content_500.txt' and 'content_50.txt', respectively, provided with the databases.
Note about ligatures
In some fonts some typical character combinations are represented by separate patterns. The mapping of the pixels of this ligature-pattern to the characters the ligature consists of is based on an automatic x-coordinate related division of the pixel set. This can cause to some inconsistencies in the pixel references.