Connectionist Bench (Nettalk Corpus)

Donated on 10/10/1954

The file "nettalk.data" contains a list of 20,008 English words, along with a phonetic transcription for each word. The task is to train a network to produce the proper phonemes

Dataset Characteristics

Multivariate

Subject Area

Other

Associated Tasks

Feature Type

Categorical

# Instances

20008

# Features

Dataset Information

Additional Information

This is an updated and corrected version of the data set used by Sejnowski and Rosenberg in their influential study of speech generation using a neural network [1]. The file "nettalk.data" contains a list of 20,008 English words, along with a phonetic transcription for each word. The task is to train a network to produce the proper phonemes, given a string of letters as input. This is an example of an input/output mapping task that exhibits strong global regularities, but also a large number of more specialized rules and exceptional cases. Please see original readme file for more information.

Has Missing Values?

Variables Table

Variable Name	Role	Type	Description	Units	Missing Values
					no
					no
					no
					no

Rows per page

0 to 4 of 4

Additional Variable Information

The pronouncing dictionary was created to study the translation process between written English, using graphemes or letters as units, and spoken English, using phonemes as units. The dictionary includes 20008 aligned letter and phonetic representations with stresses. The dictionary contains four tab separated fields of information for each word. The fields are: 1) a letter representation 2) a phonemic representation 3) stress and syllabic structure 4) an integer indicating foreign and irregular words Please see original readme file for more information.

Dataset Files

File	Size
nettalk.data	528.6 KB
nettalk.names	13.4 KB
Index	114 Bytes

Reviews

There are no reviews for this dataset yet.

Download (185.7 KB)

0 citations

1414 views

Creators

Terry Sejnowski

Charles Rosenberg

DOI

10.24432/C5VP6T

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.