Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Molecular Biology (Protein Secondary Structure) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with

Return to Molecular Biology (Protein Secondary Structure) data set page.

Jianbin Tan and David L. Dowe. MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes. Australian Conference on Artificial Intelligence. 2003.

decision graphs using MML [16, 19, 17]. The machine-learning technique of decision graphs was successfully applied to the inference of a theory of protein secondary structure from a particular dataset by Dowe et al. [4] (see Section 4.4). The resulting decision graphs provided both an explanation and a prediction method for the problem. However, the Oliver-Wallace coding scheme [12, 11] only

Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.

contains a total of 3,370 sequences, which is equal to the number of coding and non-coding regions in the e-coli genome. The Protein Structure dataset is made of amino acid sequences and addresses the problem of assigning a secondary structure to a protein sequence. The dataset was created by processing the SWISS-PROT [BA99] database to obtain

Steven Eschrich and Nitesh V. Chawla and Lawrence O. Hall. Generalization Methods in Bioinformatics. BIOKDD. 2002.

Wealsoinvestigate the abilityofover-generalization in each classifier of an ensemble to more accurately predict the non-homologous structures seen within the protein secondary structure prediction dataset. Several key decisions must be made with regard to the ensemble of subsamples algorithm. Random subsampling of the dataset can be done with or without replacement. Subsampling with replacement is

Andreas L. Prodromidis. On the Management of Distributed Learning Agents Ph.D. Thesis Proposal CUCS-032-97. Department of Computer Science Columbia University. 1998.

of real credit card transactions and two molecular biology sequence analysis data sets, were used in our experiments. The credit card data sets were provided by the Chase and First Union Banks, members of FSTC (Financial Services Technology Consortium) and the molecular biology

Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996.

(representing an increase in accuracy from 93.3% to 98.9%!) and by large (around 3 or 4) factors for LED and Tic-tactoe. The molecular biology data sets also experienced significant reduction with the error being halved (for DNA this represented an increase in accuracy from 67.9% to 86.8%!). The error reduction is least for the noisy KRK and LED

Kuan-ming Lin and Chih-Jen Lin. A Study on Reduced Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

for handwritten digit recognition. The problem ijcnn1 is from the first problem of IJCNN challenge 2001 [21]. Note that we use the winner's transformation of raw data [2]. Problem protein, is a data set for protein secondary structure prediction [26]. Finally, the problem adult, from the UCI "adult" data set [1] and compiled by Platt [20], is also included. For the adult dataset, there are several

Return to Molecular Biology (Protein Secondary Structure) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML