Connectionist Bench (Nettalk Corpus) Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Connectionist Bench (Nettalk Corpus) data set page.
Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.
networks for this purpose and found that they have a much slower learning rate than MLR. For example, MLR only took 2.9 seconds as compare to 4790 seconds for the neural network in the nettalk dataset; while both have the same error rate. Other possible candidates are the multinomial logit model (Jordan & Jacobs, 1994), which is a special case of generalized linear models (McCullagh & Nelder,
Steven Salzberg. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Min. Knowl. Discov, 1. 1997.
major new results using wellstudied and widely shared data. For example, Fisher's iris data has been around for 60 years and has been used in hundreds (maybe thousands) of studies. The NetTalk dataset of English pronunciation data (introduced by Sejnowski and Rosenberg,  has been used in numerous experiments, as has the protein secondary structure data (introduced by Qian and Sejnowski ),
Kai Ming Ting and Boon Toh Low. Model Combination in the Multiple-Data-Batches Scenario. ECML. 1997.
settings are the same as those used in IB1 3 in all experiments. No parameter settings are required for NB*. Our studies employ two artificial domains (i.e., waveform and LED24) and four real-world datasets (i.e., euthyroid, nettalk stress), splice junction and protein coding) obtained from the UCI repository of machine learning databases (Merz & Murphy, 1996). The two noisy artificial domains are
Thomas G. Dietterich and Ghulum Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes. CoRR, csAI/9501101. 1995.
employed in the study. The glass, vowel, soybean, audiologyS, ISOLET, letter, and NETtalk data sets are available from the Irvine Repository of machine learning databases (Murphy & Aha, 1994). 1 The POS (part of speech) data set was provided by C. Cardie (personal communication); an earlier
Dietrich Wettschereck and David W. Aha. Weighting Features. ICCBR. 1995.
contains some redundant features, while the NETtalk dataset has no irrelevant or redundant features (Wettschereck, 1994). Each dataset was randomly partitioned 25 times into disjoint training and test sets. Table 4 lists the algorithms' average test set
Wl/odzisl/aw Duch and Jerzy J. Korczak. Optimization and global minimization methods suitable for neural networks. Department of Computer Methods, Nicholas Copernicus University.
by NOVEL and SIMANN. Genetic algorithms achieved the worst results, below 60% in all cases, being unable to find good solutions. NOVEL has also been tried on Sonar, Vovel, 10-parity and NetTalk datasets from the UCI repository , using different number of hidden units, achieving very good results on the test sets, and falling behind TN-MS only in one case. From these few comparisons scattered
Rayid Ghani. KDD Project Report Using Error-Correcting Codes for Efficient Text Classification with a Large Number of Categories. Center for Automated Learning and Discovery, School of Computer Science, Carnegie Mellon University.
After removing tokens that occur only once, the corpus contains 1.2 million words with a vocabulary size of 29964. 4.1.2 HOOVERS DATASET This corpus of company web pages was assembled using the Hoovers Online Web resource (www.hoovers.com) by obtaining a list of the names and home-page URLs for 4285 companies on the web and using a
Kai Ming Ting and Boon Toh Low. Theory Combination: an alternative to Data Combination. University of Waikato.
on or off, plus seventeen irrelevant binary attributes. Each attribute value is inverted with a probability of 0.1. The task is to classify the input as one of the ten digits. The four real-world datasets are the euthyroid, nettalk stress), splice junction and protein coding. The selection criteria are that the datasets must have large number of instances and each class must be supported by large
Sherrie L. W and Zijian Zheng. A BENCHMARK FOR CLASSIFIER LEARNING. Basser Department of Computer Science The University of Sydney.
e.g. Promoter (106) and Lymphography (148) ffl Medium (between 210 and 3170), e.g. Diabetes (768) and Thyroid (3163) ffl Large (more than 3170), e.g. NetTalk (Phoneme) (5438) and Mushroom (8124) 8. Dataset density (3 values): Usually a classifier learning algorithm can learn a more accurate theory from a larger number of training examples than from fewer examples. However, because different domains
Steve Whittaker and Loren G. Terveen and Bonnie A. Nardi. Let's stop pushing the envelope and start addressing it: a reference task agenda for HCI. a Senior Research Scientist in the Human Computer Interaction Department of AT&T LabsResearch.
(Marcus, 1992, Price, 1991, Stern, 1990, Wayne, 1989). A dataset consists of a publicly available corpus of spoken sentences, divided into training and test sentences. The initial task was to recognise the individual sentences in the corpus. There was no
Rong Jin and Yan Liu and Luo Si and Jaime Carbonell and Alexander G. Hauptmann. A New Boosting Algorithm Using Input-Dependent Regularizer. School of Computer Science, Carnegie Mellon University.
from the UCI repository (Blake & Merz, 1998) and a benchmark of text categorization evaluation -- the ApteMod version of Reuters-21578 corpus are used as testbeds. All of UCI data sets are binary classification problems and the detailed information is listed in Table 1. Reuters-21578 corpus consists of a training set of 7,769 documents and a test set of 3,019 documents with 90