Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Online Handwritten Assamese Characters Dataset Data Set
Download: Data Folder, Data Set Description

Abstract: This is a dataset of 8235 online handwritten assamese characters. The “online” process involves capturing of data as text is written on a digitizing tablet with an electronic pen.

Data Set Characteristics:  

Multivariate, Sequential

Number of Instances:

8235

Area:

Computer

Attribute Characteristics:

Integer

Number of Attributes:

N/A

Date Donated

2011-04-01

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

32491


Source:

Creators:
Udayan Baruah¹ ² and Shyamanta M Hazarika¹

1. Department of Computer Science and Engineering
Tezpur University
Assam, India, 784028.
udayanbaruah '@' yahoo.co.in
shyamanta '@' ieee.org

2. Department of Information Technology
Sikkim Manipal Institute of Technology
Sikkim, India, 737136.
udayanbaruah '@' yahoo.co.in

Donor:

Udayan Baruah
Department of Information Technology
Sikkim Manipal Institute of Technology
Sikkim, India, 737136.
udayanbaruah '@' yahoo.co.in


Data Set Information:

A dataset of online handwritten assamese characters by collecting samples from 45 writers is created. Each writer contributed 52 basic characters, 10 numerals and 121 assamese conjunct consonants. The total number of entries corresponding to each writer is 183 (= 52 characters + 10 numerals + 121 conjunct consonants). The total number of samples in the dataset is 8235 ( = 45 × 183 ).

The handwriting samples were collected on an iball 8060U external digitizing tablet connected to a laptop using its cordless digital stylus pen. The data acquisition program consists of a GUI which shows a box on the screen along with other controls. The writers are instructed to write only inside the acquisition box. The acquisition program records the handwriting as a stream of (X, Y) coordinate points using the appropriate pen position sensor along with the pen-up/pen-down switching. No pressure level was recorded.

The distribution of the dataset consists of 45 folders (one for each writer) and a “Data_Table.pdf” file. This file contains information about the character id (ID), character name (Label) and actual shape of the character (Char).

Each folder contains 183 text files corresponding to the 183 characters written by a single writer. Each file is named based on the pair (M, N). The text file “M.N.txt” represents the character with ID “M” written by the writer with ID “N”. For instance the file “132.10.txt” represents the character with ID “132” written by the writer with ID “10”.


Attribute Information:

1. Character Name: The first line of each sample is “CHARACTER_NAME: Character”. The “Character” is the Name of any one of the 183 characters listed below:

Here “ID [i]” represents the name of the character with the ID “i”.

ID [1] = “A”
ID [2] = “AA”
ID [3] = “E”
ID [4] = “EE”
ID [5] = “U”
ID [6] = “UU”
ID [7] = “REE”
ID [8] = “AE”
ID [9] = “OI”
ID [10] = “O”
ID [11] = “OU”
ID [12] = “KA”
ID [13] = “KHA”
ID [14] = “GA”
ID [15] = “GHA”
ID [16] = “NG”
ID [17] = “CA”
ID [18] = “CCA”
ID [19] = “JA”
ID [20] = “JHA”
ID [21] = “NIYA”
ID [22] = “MTA”
ID [23] = “MTHA”
ID [24] = “MDA”
ID [25] = “MDHA”
ID [26] = “MNA”
ID [27] = “TA”
ID [28] = “THA”
ID [29] = “DA”
ID [30] = “DHA”
ID [31] = “NA”
ID [32] = “PA”
ID [33] = “PHA”
ID [34] = “BA”
ID [35] = “BHA”
ID [36] = “MA”
ID [37] = “AJA”
ID [38] = “RA”
ID [39] = “LA”
ID [40] = “WA”
ID [41] = “TXA”
ID [42] = “MXA”
ID [43] = “DXA”
ID [44] = “HA”
ID [45] = “KHYA”
ID [46] = “AYA”
ID [47] = “DRA”
ID [48] = “DHRA”
ID [49] = “KTA”
ID [50] = “ANSR”
ID [51] = “BXG”
ID [52] = “CBN”
ID [53] = “KK”
ID [54] = “KT”
ID [55] = “KTT”
ID [56] = “KS”
ID [57] = “KL”
ID [58] = “KM”
ID [59] = “GL”
ID [60] = “CC”
ID [61] = “CCC”
ID [62] = “JJ”
ID [63] = “JB”
ID [64] = “BJ”
ID [65] = “GN”
ID [66] = “TN”
ID [67] = “JJB”
ID [68] = “LG”
ID [69] = “TT”
ID [70] = “GDH”
ID [71] = “GM”
ID [72] = “GHN”
ID [73] = “MDD”
ID [74] = “NT”
ID [75] = “NN”
ID [76] = “NMM”
ID [77] = “TTT”
ID [78] = “TTB”
ID [79] = “TM”
ID [80] = “TR”
ID [81] = “NTT”
ID [82] = “RRG”
ID [83] = “NDD”
ID [84] = “NTH”
ID [85] = “NDH”
ID [86] = “NNN”
ID [87] = “NB”
ID [88] = “NS”
ID [89] = “NM”
ID [90] = “DB”
ID [91] = “QJ”
ID [92] = “PTT”
ID [93] = “PL”
ID [94] = “DV”
ID [95] = “BL”
ID [96] = “BD”
ID [97] = “TB”
ID [98] = “MM”
ID [99] = “MV”
ID [100] = “MP”
ID [101] = “MN”
ID [102] = “NTR”
ID [103] = “MB”
ID [104] = “LK”
ID [105] = “MND”
ID [106] = “FK”
ID [107] = “LD”
ID [108] = “LL”
ID [109] = “LP”
ID [110] = “LT”
ID [111] = “SN”
ID [112] = “SC”
ID [113] = “SM”
ID [114] = “SB”
ID [115] = “FN”
ID [116] = “FT”
ID [117] = “SK”
ID [118] = “SSTH”
ID [119] = “SSM”
ID [120] = “SSN”
ID [121] = “SSB”
ID [122] = “ST”
ID [123] = “SP”
ID [124] = “SPH”
ID [125] = “STH”
ID [126] = “SKH”
ID [127] = “NGG”
ID [128] = “NGC”
ID [129] = “FP”
ID [130] = “NGN”
ID [131] = “XM”
ID [132] = “NGJ”
ID [133] = “MNTH”
ID [134] = “NGK”
ID [135] = “KR”
ID [136] = “TRU”
ID [137] = “BHR”
ID [138] = “THB”
ID [139] = “DG”
ID [140] = “DGH”
ID [141] = “DD”
ID [142] = “DDH”
ID [143] = “HR”
ID [144] = “GGU”
ID [145] = “GGN”
ID [146] = “NKH”
ID [147] = “NGH”
ID [148] = “NGKH”
ID [149] = “TTH”
ID [150] = “PN”
ID [151] = “HN”
ID [152] = “XN”
ID [153] = “MF”
ID [154] = “BB”
ID [155] = “LB”
ID [156] = “LM”
ID [157] = “BHM”
ID [158] = “ML”
ID [159] = “SL”
ID [160] = “PS”
ID [161] = “KHR”
ID [162] = “GR”
ID [163] = “GHR”
ID [164] = “JR”
ID [165] = “TRR”
ID [166] = “DRR”
ID [167] = “DHRR”
ID [168] = “PRR”
ID [169] = “BRR”
ID [170] = “MRR”
ID [171] = “TSR”
ID [172] = “DSR”
ID [173] = “HRR”
ID [174] = “SUNYA”
ID [175] = “EK”
ID [176] = “DUI”
ID [177] = “TINI”
ID [178] = “CARI”
ID [179] = “PAC”
ID [180] = “CAY”
ID [181] = “XAT”
ID [182] = “ATH”
ID [183] = “NAA”

2. The total number of strokes in the sample: The total number of strokes used to write a character is represented by the line “STROKE_COUNT: Number”, where “Number” is an integer value.

3. Sequence of Strokes: Each stroke begins with the “PEN_DOWN” information and there is a “PEN_UP” information followed by the “PEN_DOWN” information between two consecutive strokes. The end of a sample is represented by the “PEN_UP” information followed by the “END_CHARACTER: Character” information. Each stroke consists of a sequence of X and Y coordinates values which are given in the first and the second columns respectively. Corresponding to each pair of values of X and Y coordinates, there are “STYLUS_STATE” and “STROKE” information given in the third and the fourth columns respectively. “STYLUS_STATE” is either 1 or 0. Corresponding to each recorded (X, Y) point, “STYLUS_STATE” is 1 and corresponding to the “PEN_UP” information “STYLUS_STATE” is 0. “STYLUS_STATE” is kept blank corresponding to each “PEN_DOWN” information. The “STROKE” information represents the serial number of a constituent stroke of a sample. The value of X grows left-to-right and that of Y grows downwards. Coordinates are integer numbers ranging from 0 to 4392 for X and 0 to 4868 for Y respectively.


Relevant Papers:

Provide references to papers that have cited this data set in the past (if any).



Citation Request:

Please refer to the Machine Learning Repository's Citation policy


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML