1. Title of Database:
UJIpenchars: A Pen-Based Classification Task
for Isolated Handwritten Characters
2. Source:
D. Llorens, F. Prat, A. Marzal, J. M. Vilar
Departamento de Lenguajes y Sistemas Informáticos
Universitat Jaume I, 12071 Castellón (SPAIN)
fprat@lsi.uji.es
June 2007
3. Past Usage:
R. Ramos-Garijo, S. Martín, A. Marzal, F. Prat, J. M. Vilar, and D. Llorens:
"An Input Panel and Recognition Engine for On-Line Handwritten Text Recognition"
Artificial Intelligence Research and Development, pp. 223-232, IOS Press, 2007.
F. Prat, A. Marzal, S. Martín, and R. Ramos-Garijo:
"A Two-Stage Template-Based Recognition Engine for On-Line Handwritten Characters"
Proceeding of the Asia-Pacific Workshop 2007 on Visual Information Processing, pp. 77-82, 2007.
Best error rate: 10.85%
4. Relevant Information:
We create a character database by collecting samples from 11
writers. Each writer contributed with letters (lower and
uppercase), digits, and other characters (letters with Spanish
diacritics and punctuation marks) that we have not employed in
our experiments and are not included in this database version.
Two samples have been collected for each pair writer/character,
so the total number of samples in this database version is
1364:
11 writers x 2 repetitions x (2x26 letters + 10 digits)
The proposed task is a writer-independent one consisting of 11
leaving-one-writer-out tests, so the effective training set
size (for each one of the 1364 test samples) is 1240:
10 writers x 2 repetitions x (2x26 letters + 10 digits)
Moreover, this classification task is a 35-class one because
we have not considered a different class for each different
character: each one of the 26 letters is considered as a
case-independent class, there are 9 additional clases for
non-zero digits, and the zero is included in the same class as
oes.
This database is available in a UNIPEN-like format, trying to
mimic the original Pendigits database we found in
on May 29th, 2007, files:
pendigits-orig.names
pendigits-orig.tra.Z
pendigits-orig.tes.Z
The distibution of our database consists of 12 files:
This "uji.names"
One file "UJIpenchars-w" per writer, where = "01", "02"... "11"
The handwriting samples were collected on a Toshiba Portégé
M400 Tablet PC using its cordless stylus. Each one of the 11
writers completed 2 non-consecutive sessions. In each session,
the corresponding writer was asked to write one exemplar for
each character in a fixed set including lowercase letters,
uppercase ones, and digits, along with other characters omitted
from this database version. The acquisition program shows a set
of boxes on the screen, a different one for each required
character, and writers are told to write only inside those
boxes. If they make a mistake or are unhappy with a character
writing, they are instructed to clear the content of the
corresponding box by using an on-screen button and try
again. Subjects are monitored only when writing their first
exemplars and every sample considered OK by its writer was
accepted as such.
Only X and Y coordinate information was recorded along the
strokes by the acquisition program, without, for instance,
pressure level values or timing information. Thus, in
multi-stroke samples, no information at all was recorded
between strokes; however, in this database version we have
included a ".DT 100" line in sample files after each stroke,
following the Pendigits database criterion.
We have observed that runs of consecutive points with
identical coordinates were frequently acquired inside strokes;
such runs were preserved in this database version, so each
database user must decide whether to avoid them by an
appropriate preprocessing step or not.
5. Number of Instances
UJIpenchars-w01 Writer 1 124
UJIpenchars-w02 Writer 2 124
UJIpenchars-w03 Writer 3 124
UJIpenchars-w04 Writer 4 124
UJIpenchars-w05 Writer 5 124
UJIpenchars-w06 Writer 6 124
UJIpenchars-w07 Writer 7 124
UJIpenchars-w08 Writer 8 124
UJIpenchars-w09 Writer 9 124
UJIpenchars-w10 Writer 10 124
UJIpenchars-w11 Writer 11 124
----
1364
As said before, when using any sample for testing, only the
1240 samples from other writers (ie those found in files
different from the test sample one) are employed as training
set, giving rise to a writer-independent task.
6. Number of Attributes
For each sample, you can find:
a. The character it represents.
b. The class it belongs to.
c. The sequence of strokes it consists of.
When testing, you are only allowed to read the sequence of
strokes of a sample in order to predict its class.
7. For Each Attribute:
As said before, this database is available in a UNIPEN-like
format, trying to mimic the original Pendigits database.
A definition of UNIPEN format can be found in
Regarding the attributes of a sample, you can find them in the
file format as follows:
a. Character name: Each sample begins with a ".SEGMENT"
line. The last component of that line shows the character
name, one out of 62 possibilities. The complete set of
possibilities is shown in the first line of each file, a
".LEXICON" line. Those possibilities are repeated here:
"a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
"n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
"N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
"0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
b. Class name: The class name of a sample appears in the
".COMMENT" line that follows its ".SEGMENT" line. This
name is one out of 35 possibilities. In each file, the
complete set of possibilities is shown in ".COMMENT"
lines between the ".LEXICON" line and a ".HIERARCHY" one.
Those class definitions are repeated here:
[A] = { "a" , "A" }
[B] = { "b" , "B" }
[C] = { "c" , "C" }
[D] = { "d" , "D" }
[E] = { "e" , "E" }
[F] = { "f" , "F" }
[G] = { "g" , "G" }
[H] = { "h" , "H" }
[I] = { "i" , "I" }
[J] = { "j" , "J" }
[K] = { "k" , "K" }
[L] = { "l" , "L" }
[M] = { "m" , "M" }
[N] = { "n" , "N" }
[O] = { "o" , "O" , "0" }
[P] = { "p" , "P" }
[Q] = { "q" , "Q" }
[R] = { "r" , "R" }
[S] = { "s" , "S" }
[T] = { "t" , "T" }
[U] = { "u" , "U" }
[V] = { "v" , "V" }
[W] = { "w" , "W" }
[X] = { "x" , "X" }
[Y] = { "y" , "Y" }
[Z] = { "z" , "Z" }
[1] = { "1" }
[2] = { "2" }
[3] = { "3" }
[4] = { "4" }
[5] = { "5" }
[6] = { "6" }
[7] = { "7" }
[8] = { "8" }
[9] = { "9" }
c. Sequence of strokes: After the ".SEGMENT" and ".COMMENT"
lines of a sample, a sequence of one or more strokes
follows until the beginning of a new sample or the end of
the file. Each stroke begins whith a ".PEN_DOWN" line
and ends with a sequence ".PEN_UP", ".DT 100"; in
between, a sequence of lines, each one representing X and
Y coordinates of a point, where X grows left-to-right and
Y grows downwards. Coordinates are integer numbers.
8. Missing Attribute Values
None
9. Class Distribution
Number of samples in each file:
Class [A]: 4
Class [B]: 4
Class [C]: 4
Class [D]: 4
Class [E]: 4
Class [F]: 4
Class [G]: 4
Class [H]: 4
Class [I]: 4
Class [J]: 4
Class [K]: 4
Class [L]: 4
Class [M]: 4
Class [N]: 4
Class [O]: 6
Class [P]: 4
Class [Q]: 4
Class [R]: 4
Class [S]: 4
Class [T]: 4
Class [U]: 4
Class [V]: 4
Class [W]: 4
Class [X]: 4
Class [Y]: 4
Class [Z]: 4
Class [1]: 2
Class [2]: 2
Class [3]: 2
Class [4]: 2
Class [5]: 2
Class [6]: 2
Class [7]: 2
Class [8]: 2
Class [9]: 2
Total number of samples:
Class [A]: 44
Class [B]: 44
Class [C]: 44
Class [D]: 44
Class [E]: 44
Class [F]: 44
Class [G]: 44
Class [H]: 44
Class [I]: 44
Class [J]: 44
Class [K]: 44
Class [L]: 44
Class [M]: 44
Class [N]: 44
Class [O]: 66
Class [P]: 44
Class [Q]: 44
Class [R]: 44
Class [S]: 44
Class [T]: 44
Class [U]: 44
Class [V]: 44
Class [W]: 44
Class [X]: 44
Class [Y]: 44
Class [Z]: 44
Class [1]: 22
Class [2]: 22
Class [3]: 22
Class [4]: 22
Class [5]: 22
Class [6]: 22
Class [7]: 22
Class [8]: 22
Class [9]: 22