Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Protein Data Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Protein Data data set page.


Qingping Tao Ph. D. MAKING EFFICIENT LEARNING ALGORITHMS WITH EXPONENTIALLY MANY FEATURES. Qingping Tao A DISSERTATION Faculty of The Graduate College University of Nebraska In Partial Fulfillment of Requirements. 2004.

all sequences to 8-dimensional profiles based on the numeric properties of Kim et al. [24] and used them as inputs to the multiple-instance learning algorithm. GMIL-1 is not suitable for the protein data set. As an example why this is a problem, consider the case where 5 clusters is used to built the grid on the 9-dimensional space ( 8dimensional profiles and an additional dimension for identifying the


Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. SDM. 2004.

"5". Finally, because some of the vertices in the resulting graph had a very high degree (i.e., authorities and hubs), we kept only the vertices whose degree was less or equal to 15. The Contact Map dataset is made of 170 proteins from the Protein Data Bank [5] with pairwise sequence identity lower than 25%. The vertices in these graphs correspond to the different amino acids and the edges connect two


Mikhail Bilenko and Sugato Basu and Raymond J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. ICML. 2004.

from the UCI repository: Iris, Wine, and Ionosphere (Blake & Merz, 1998); the Protein dataset used by Xing et al. (2003) and Bar-Hillel et al. (2003), and randomly sampled subsets from the Digits and Letters handwritten character recognition datasets, also from the UCI repository. For Digits


Qingping Tao and Stephen Scott and N. V. Vinodchandran and Thomas T. Osugi. SVM-based generalized multiple-instance learning via approximate box counting. ICML. 2004.

results of our new kernel on applications such as content-based image retrieval, prediction of drug affinity to bind to multiple sites simultaneously, protein sequence identification, and the Musk data sets. Finally, we conclude in Section 7. 2. Notation and Definitions Let X denote {0, . . . , s} d (though our results trivially generalize to X = Q d i=1 {0, . . . , s i }). Let BX denote the set of


Jianbin Tan and David L. Dowe. MML Inference of Decision Graphs with Multi-way Joins and Dynamic Attributes. Australian Conference on Artificial Intelligence. 2003.

decision graphs using MML [16, 19, 17]. The machine-learning technique of decision graphs was successfully applied to the inference of a theory of protein secondary structure from a particular dataset by Dowe et al. [4] (see Section 4.4). The resulting decision graphs provided both an explanation and a prediction method for the problem. However, the Oliver-Wallace coding scheme [12, 11] only


Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003.

The task of the classifier is to predict whether a DNA sequence from E.coli is either a promoter or not (Towell et al., 1990). The input data is a 57-nucleotide sequence (A, C, T or G). HIV data set -- The data set contains 362 octamer protein sequences each of which needs to be classified as an HIV protease cleavable site or uncleavable site (Cai and Chou, 1998). Data set E.coli Yeast


Michael L. Raymer and Travis E. Doom and Leslie A. Kuhn and William F. Punch. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33. 2003.

computer-based archival file for macromolecular structures," J. Mol. Biol., vol. 112, pp. 535--542, 1977. [46] U. Hobohm, M. Scharf, R. Schneider, and C. Sander, "Selection of representative Protein data sets," Protein Sci., vol. 1, pp. 409--417, 1992. [47] L. A. Kuhn, C. A. Swanson, M. E. Pique, J. A. Tainer, and E. D. Getzoff, "Atomic and residue hydrophilicity in the context of folded protein


Steven Eschrich and Nitesh V. Chawla and Lawrence O. Hall. Generalization Methods in Bioinformatics. BIOKDD. 2002.

Wealsoinvestigate the abilityofover-generalization in each classifier of an ensemble to more accurately predict the non-homologous structures seen within the protein secondary structure prediction dataset. Several key decisions must be made with regard to the ensemble of subsamples algorithm. Random subsampling of the dataset can be done with or without replacement. Subsampling with replacement is


Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.

contains a total of 3,370 sequences, which is equal to the number of coding and non-coding regions in the e-coli genome. The Protein Structure dataset is made of amino acid sequences and addresses the problem of assigning a secondary structure to a protein sequence. The dataset was created by processing the SWISS-PROT [BA99] database to obtain


Andreas L. Prodromidis. On the Management of Distributed Learning Agents Ph.D. Thesis Proposal CUCS-032-97. Department of Computer Science Columbia University. 1998.

class label (fraud/legitimate transaction). Some of the fields are arithmetic and the rest categorical, i.e. numbers were used to represent a few discrete categories. The secondary protein structure data set (SS) [36], courtesy of Qian and Sejnowski, contains 21,625 sequences of amino acids and secondary structures at the corresponding positions. There are three structures (classes) and 20 amino acids


Kai Ming Ting and Boon Toh Low. Model Combination in the Multiple-Data-Batches Scenario. ECML. 1997.

DNA nucleotide positions and each position can have one of the four base values. The task is to recognize, given a DNA sequence, two types of the splice junction or neither. The protein coding dataset, introduced by Craven and Shavlik (1993), contains DNA nucleotide sequences and its classification task is to differentiate the coding sequences from the non-coding ones. Each sequence has fifteen


Kai Ming Ting and Boon Toh Low. Theory Combination: an alternative to Data Combination. University of Waikato.

3190 sequences where a small number of them contains some combination values (i.e., values combined from four base values). These sequences are eliminated in our experiments. 6 The protein coding dataset, introduced by Craven and Shavlik (1993), contains DNA nucleotide sequences and its classification task is to differentiate the coding sequences from the non-coding ones. Each sequence has fifteen


Zoran Obradovic and Slobodan Vucetic. Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples. Center for Information Science and Technology Temple University.

is compared to the accuracy of the global classifier. The algorithm proceeds by gradually partitioning disordered proteins into more subsets in an attempt to further improve the accuracy. Data sets with 145 nonredundant disordered protein regions and about equal number of ordered proteins of similar length were used in the experiment. For each position in a protein sequence we extracted 18


Daichi Mochihashi and Gen-ichiro Kikui and Kenji Kita. Learning Nonstructural Distance Metric by Minimum Cluster Distortions. ATR Spoken Language Translation research laboratories.

0 . 6 0 . 7 0 . 8 0 . 9 1 2 5 1 0 1 5 2 0 D i m e n s i o n P r e c i s i o n (b) protein dataset 0 . 7 0 . 8 0 . 9 1 1 2 3 4 D i m e n s i o n P r e c i s i o n (c) "iris" dataset 0 . 6 0 . 7 0 . 8 0 . 9 1 1 2 5 1 0 2 0 3 5 D i m e n s i o n P r e c i s i o n (d) "soybean" dataset Figure 4:


Mehmet Dalkilic and Arijit Sengupta. A Logic-theoretic classifier called Circle. School of Informatics Center for Genomics and BioInformatics Indiana University.

8 attributes per iteration took less than 5 minutes, with approximately 90% test accuracy with a 20% cross-trained data. Table 1 reports some of these results. Table 1 includes experiments ran on a dataset generated from the PROSITE Protein family dataset that included nine families, which includes information on 1446 diŽerent proteins, with over 5000 subsequences. Since 1000 is a typical limit on the


Kuan-ming Lin and Chih-Jen Lin. A Study on Reduced Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

for handwritten digit recognition. The problem ijcnn1 is from the first problem of IJCNN challenge 2001 [21]. Note that we use the winner's transformation of raw data [2]. Problem protein is a data set for protein secondary structure prediction [26]. Finally, the problem adult, from the UCI "adult" data set [1] and compiled by Platt [20], is also included. For the adult dataset, there are several


Return to Protein Data data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML