UJI Pen Characters

Donated on 5/31/2007

Data consists of written characters in a UNIPEN-like format

Dataset Characteristics

Multivariate, Sequential

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

Integer

# Instances

1364

# Features

Dataset Information

Additional Information

We create a character database by collecting samples from 11 writers. Each writer contributed with letters (lower and uppercase), digits, and other characters (Spanish diacritics and punctuation marks) that we have not employed in our experiments and are not included in this database version. Two samples have been collected for each pair writer/character, so the total number of samples in this database version is 1364: 11 writers x 2 repetitions x (2x26 letters + 10 digits) The proposed task is a writer-independent one consisting of 11 leaving-one-writer-out tests, so the effective training set size (for each one of the 1364 test samples) is 1240: 10 writers x 2 repetitions x (2x26 letters + 10 digits) Moreover, this classification task is a 35-class one because we have not considered a different class for each different character: each one of the 26 letters is considered as a case-independent class, there are 9 additional clases for non-zero digits, and the zero is included in the same class as o's. This database is available in a UNIPEN-like format, trying to mimic the original Pendigits database. Two versions of that database are available; see folder: http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits The distribution of our database consists of 12 files: uji.names One file "UJIpenchars-w<i>NN</i>" per writer, where <i>NN</i> = "01", "02"... "11" The handwriting samples were collected on a Toshiba Portégé M400 Tablet PC using its cordless stylus. Each one of the 11 writers completed 2 non-consecutive sessions. In each session, the corresponding writer was asked to write one exemplar for each character in a fixed set including lowercase letters, uppercase ones, and digits, along with other characters omitted from this database version. The acquisition program shows a set of boxes on the screen, a different one for each required character, and writers are told to write only inside those boxes. If they make a mistake or are unhappy with a character writing, they are instructed to clear the content of the corresponding box by using an on-screen button and try again. Subjects are monitored only when writing their first exemplars and every sample considered OK by its writer was accepted as such. Only X and Y coordinate information was recorded along the strokes by the acquisition program, without, for instance, pressure level values or timing information. Thus, in multi-stroke samples, no information at all was recorded between strokes; however, in this database version we have included a ".DT 100" line in sample files after each stroke, following the Pendigits database criterion. We have observed that runs of consecutive points with identical coordinates were frequently acquired inside strokes; such runs were preserved in this database version, so each database user must decide whether to avoid them by an appropriate preprocessing step or not.

Has Missing Values?

Variable Information

For each sample, you can find: a. The character it represents. b. The class it belongs to. c. The sequence of strokes it consists of. When testing, you are only allowed to read the sequence of strokes of a sample in order to predict its class. For Each Attribute: As said before, this database is available in a UNIPEN-like format, trying to mimic the original Pendigits database. A definition of UNIPEN format can be found in ftp://ftp.cis.upenn.edu/pub/UNIPEN-pub/definition/unipen.def Regarding the attributes of a sample, you can find them in the file format as follows: a. Character name: Each sample begins with a ".SEGMENT" line. The last component of that line shows the character name, one out of 62 possibilities. The complete set of possibilities is shown in the first line of each file, a ".LEXICON" line. Those possibilities are repeated here: "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" b. Class name: The class name of a sample appears in the ".COMMENT" line that follows its ".SEGMENT" line. This name is one out of 35 possibilities. In each file, the complete set of possibilities is shown in ".COMMENT" lines between the ".LEXICON" line and a ".HIERARCHY" one. Those class definitions are repeated here: [A] = { "a" , "A" } [B] = { "b" , "B" } [C] = { "c" , "C" } [D] = { "d" , "D" } [E] = { "e" , "E" } [F] = { "f" , "F" } [G] = { "g" , "G" } [H] = { "h" , "H" } [I] = { "i" , "I" } [J] = { "j" , "J" } [K] = { "k" , "K" } [L] = { "l" , "L" } [M] = { "m" , "M" } [N] = { "n" , "N" } [O] = { "o" , "O" , "0" } [P] = { "p" , "P" } [Q] = { "q" , "Q" } [R] = { "r" , "R" } [S] = { "s" , "S" } [T] = { "t" , "T" } [U] = { "u" , "U" } [V] = { "v" , "V" } [W] = { "w" , "W" } [X] = { "x" , "X" } [Y] = { "y" , "Y" } [Z] = { "z" , "Z" } [1] = { "1" } [2] = { "2" } [3] = { "3" } [4] = { "4" } [5] = { "5" } [6] = { "6" } [7] = { "7" } [8] = { "8" } [9] = { "9" } c. Sequence of strokes: After the ".SEGMENT" and ".COMMENT" lines of a sample, a sequence of one or more strokes follows until the beginning of a new sample or the end of the file. Each stroke begins with a ".PEN_DOWN" line and ends with a sequence ".PEN_UP", ".DT 100"; in between, a sequence of lines, each one representing X and Y coordinates of a point, where X grows left-to-right and Y grows downwards. Coordinates are integer numbers.

Dataset Files

File	Size
UJIpenchars-w08	114.8 KB
UJIpenchars-w11	90.1 KB
UJIpenchars-w07	86.6 KB
UJIpenchars-w03	84 KB
UJIpenchars-w09	75.2 KB

Rows per page

0 to 5 of 12

Download (282.3 KB)

0 citations

1638 views

Creators

D. Llorens

F. Prat

A. Marzal

J. Vilar

DOI

10.24432/C5731G

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.