Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Molecular Biology (Splice-junction Gene Sequences) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Molecular Biology (Splice-junction Gene Sequences) data set page.


Jinyan Li and Limsoon Wong. Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL. WAIM. 2003.

(hypothyroid); and Boosting won the best accuracy on 4 data sets (i.e., hepatitis, lymph, sick and splice . -- Comparing between PCL and C4.5, PCL won on 8 data sets, while C4.5 won on the rest 2 data sets. -- Comparing between PCL and Bagging, PCL won on 6 data


Michael G. Madden. Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. CoRR, csLG/0211003. 2002.

(based on a paired T-test at the 99% confidence level) and they outperform the other algorithms, they are both highlighted in bold. For example, K2 and MBBC are both best on the DNA Splice dataset and all four are equally good on the Breast Cancer dataset. Na´ve TAN K2 MBBC Chess 87.63▒ 1.61 91.68▒ 1.09 94.03▒ 0.87 97.03▒ 0.54 WBCD 97.81▒ 0.51 97.47▒ 0.68 97.17▒ 1.05 97.30▒ 1.01 LED-24 73.28▒


Xiaojin Zhu. Label Propagation for Eukaryotic Splice Junction Identification. 2002.

biology problem of eukaryotic splice junction identification. We compare the classi#cation accuracy of using unlabeled data versus not using them, under the same amount of labeled data. 2 The dataset The dataset is the Primate splice-junction gene sequences dataset, available at the UCI machine learning data repository [2]. There are 3190 examples in this dataset. Each example is a sequence of


Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.

# Sessions Avg. Length Dataset # Sessions Avg. Length Splice (S-EI) 1, 527 60.0 Peptidias (P-) 1,584 511.3 Exon 762 60.0 cysteine 416 854.3 Intron 765 60.0 metallo 580 512.6 Mouse Genome (MG-GN) 10,918 361.6 serine 775 500.5 exon


Susanne Hoche and Stefan Wrobel. Scaling Boosting by Margin-Based Inclusionof Features and Relations. ECML. 2002.

in four domains, and without a significant deterioration of predictive accuracy in the one domain where only few features are present. C 2 RIB shows a poor performance on the splice junction dataset, most likely due to the great number of features. However, C 2 RIB D clearly outperforms C 2 RIB both in accuracy and learning time. Table 3. Accuracy, standard deviation and learning time in


S. Sathiya Keerthi and Kaibo Duan and Shirish Krishnaj Shevade and Aun Neow Poo. A Fast Dual Algorithm for Kernel Logistic Regression. ICML. 2002.

software at the site http://www.ece.nwu.edu/~nocedal/lbfgs.html was used. The Gaussian kernel K(x; # x) = exp( kx # xk 2 2# 2 ) was used. In all the experiments, # was set to 10 6 . Five benchmark datasets were used: Banana, Image, Splice Waveform and Tree. The Tree dataset was originally used in (Bailey, Pettit, Borocho#, Manry, and Jiang, 1993). Detailed information about the remaining datasets


Jinyan Li and Kotagiri Ramamohanarao and Guozhu Dong. Combining the Strength of Pattern Frequency and Distance for Classification. PAKDD. 2001.

containing pure categorical attributes. The accuracy is sometimes better (e.g., on the tic-tac-toe data set), but sometimes worse (e.g., on the splice data set). 4.2 Accuracy Variation among Folds The next set of experimental results are used to demonstrate the accuracy variations among the ten folds. We


Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao and Limsoon Wong. DeEPs: A New Instance-based Discovery and Classification System. Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases. 2001.

(This will be explained in Section 8.3). Note that for data sets such as chess, flare, splice mushroom, voting, soybean-l, t-t-t, and zoo which do not contain any continuous attributes, DeEPs does not require an ff. The accuracies of k-nearest neighbor and C5.0


Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning, 40. 2000.

the effect of classification noise, we added random class noise to nine domains (audiology, hypo, king-rook-vs-king-pawn (krkp), satimage, sick, splice segment, vehicle, and waveform). These data sets were chosen because at least one pair of the ensemble methods gave statistically significantly different performance on these domains. We did not perform noise experiments with letter-recognition


Marina Meila and Michael I. Jordan. Learning with Mixtures of Trees. Journal of Machine Learning Research, 1. 2000.

0.05 0.06 0.07 0.08 Error rate #: 0 0 0 1 10 100 0 1 10 100 KBNN NN TANB NB | {z } Tree | {z } MT m = 3 Figure 15: Comparison of classification performance of the MT and other models on the SPLICE data set when N train = 2000, N test = 1175. Tree represents a mixture of trees with m = 1, MT is a mixture of trees with m = 3. KBNN is the Knowledge based neural net, NN is a neural net. 5.3.5 The SPLICE


Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao. Instance-Based Classification by Emerging Patterns. PKDD. 2000.

(as explained in [10]). Note that for the datasets such as chess, flare, nursery, splice mushroom, voting, soybean-l, t-t-t, and zoo which do not contain any continuous attributes, DeEPs does not need ff. Columns 5, 6, 7, 8, and 9 give the


Lorne Mason and Jonathan Baxter and Peter L. Bartlett and Marcus Frean. Boosting Algorithms as Gradient Descent. NIPS. 1999.

0% noise - AdaBoost 0% noise - DOOM II 15% noise - AdaBoost 15% noise - DOOM II Figure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label noise for the breast-cancer and splice data sets. Given that AdaBoost suffers from overfitting and minimizes an exponential cost function of the margins, this cost function certainly does not relate to test error. Howdoesthevalue of our proposed


Kagan Tumer and Nikunj C. Oza. Decimated Input Ensembles for Improved Generalization. NASA Ames Research Center. 1999.

improvement in the classification accuracy through ensembles. For the Gene data, the average combiner was significantly more accurate than the single MLP, while for the Satellite Image and Splice data sets, the combiner was only marginally more accurate. TABLE I Average Accuracy of Original Network and Combiners Single Average Corr. Gene 83.417 Sigma .796 86.418 Sigma .342 .7910 Splice 84.722


Blaz Zupan and Marko Bohanec and Janez Dem#sar and Ivan Bratko. Learning by Discovering Concept Hierarchies. Artif. Intell, 109. 1999.

do not necessarily have such characteristics, which may be the reason why for these domains HINT's performance is worse. For example, the domain theory given with the SPLICE dataset [31] mentions several potentially useful intermediate concepts that share attributes. Thus these concepts form a concept lattice rather than a concept tree, and therefore can not be discovered by


Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.

can easily be interpreted. Examples of the combination weights it derives (for the probabilitybased model ~ M 0 ) appear in Table 5 for the Horse, Credit, Splice Abalone, Waveform, Led24 and Vowel datasets. The weights indicate the relative importance of the level-0 generalizers for each prediction class. For example, in the Splice dataset (in Table 5(b)), NB is the dominant generalizer for


Yoav Freund and Lorne Mason. The Alternating Decision Tree Learning Algorithm. ICML. 1999.

1 Proportion +ve Prediction splice train test 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 Proportion +ve Prediction sick-euthyroid train test Figure 7: Calibration graphs for the splice and sick-euthyroid data sets for train and test after 100 rounds of ADTree. In addition to justifying our interpretation, the calibration graphs can potentially be used to improveour performance situation where our


Manoranjan Dash and Huan Liu. Hybrid Search of Feature Subsets. PRICAI. 1998.

having a large N and a small M values such as Lung Cancer, Promoters, Soybean, Splice datasets ABB takes very long time (a number of hours) to terminate. For datasets having large N value and substantially big M value such as Splice dataset FocusM takes many hours to terminate. The


Adam J. Grove and Dale Schuurmans. Boosting in the Limit: Maximizing the Margin of Learned Ensembles. AAAI/IAAI. 1998.

that this depends crucially on the base learner always being able to find a sufficiently good hypothesis if one exists; see Section 5 for further discussion of this issue. 8 However, for some large data sets, chess and splice we inverted the train/test proportions. FindAttrTest Adaboost LP-Adaboost DualLPboost Data set error% win% error% margin error% win% margin error% win% margin Audiology 52.30


Foster J. Provost and Tom Fawcett and Ron Kohavi. The Case against Accuracy Estimation for Comparing Induction Algorithms. ICML. 1998.

has exactly 50 instances of each class. The splice junction data set (DNA) has 50% donor sites, 25% acceptor sites and 25% nonboundary sites, even though the natural class distribution is very skewed: no more than 6% of DNA actually codes for human genes (Saitta and


Andreas L. Prodromidis. On the Management of Distributed Learning Agents Ph.D. Thesis Proposal CUCS-032-97. Department of Computer Science Columbia University. 1998.

of real credit card transactions and two molecular biology sequence analysis data sets, were used in our experiments. The credit card data sets were provided by the Chase and First Union Banks, members of FSTC (Financial Services Technology Consortium) and the molecular biology


Kai Ming Ting and Boon Toh Low. Model Combination in the Multiple-Data-Batches Scenario. ECML. 1997.

settings are the same as those used in IB1 3 in all experiments. No parameter settings are required for NB*. Our studies employ two artificial domains (i.e., waveform and LED24) and four real-world datasets (i.e., euthyroid, nettalk(stress), splice junction and protein coding) obtained from the UCI repository of machine learning databases (Merz & Murphy, 1996). The two noisy artificial domains are


Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996.

(representing an increase in accuracy from 93.3% to 98.9%!) and by large (around 3 or 4) factors for LED and Tic-tactoe. The molecular biology data sets also experienced significant reduction with the error being halved (for DNA this represented an increase in accuracy from 67.9% to 86.8%!). The error reduction is least for the noisy KRK and LED


Gustavo E. A and Gustavo E A P A Batista and Ronaldo C. Prati and Maria Carolina Monard. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. Instituto de Ci ^ encias Matem aticas e de Computac~ ao.

having more than two classes, we chose the class with fewer examples as the positive class, and collapsed the remainder as the negative class. As the Letter and Splice data sets have a similar number of examples in the minority classes, we created two data sets with each of them: Letter-a and Letter-vowel, Splice-ie and Splice-ei. In our experiments, we used release 8 of


Kai Ming Ting and Boon Toh Low. Theory Combination: an alternative to Data Combination. University of Waikato.

on or off, plus seventeen irrelevant binary attributes. Each attribute value is inverted with a probability of 0.1. The task is to classify the input as one of the ten digits. The four real-world datasets are the euthyroid, nettalk(stress), splice junction and protein coding. The selection criteria are that the datasets must have large number of instances and each class must be supported by large


Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore.

used in the experiments are publicly available via anonymous ftp from ics.uci.edu [14]. These dataset are: 1. The splice junction dataset. The characteristics of the patterns in this dataset have been described in the previous section. The dataset that consists of 1006 patterns was used. 2. The 3


Rong-En Fan and P. -H Chen and C. -J Lin. Working Set Selection Using the Second Order Information for Training SVM. Department of Computer Science and Information Engineering National Taiwan University.

was originally used in (Bailey et al., 1993). The problem mg is a Mackey-Glass time series. The data sets cpusmall and splice are from the Delve archive (http://www.cs.toronto.edu/~delve). Problem fourclass is from (Ho and Kleinberg, 1996) and we further transform it to a two-class set. The problem


M. A. Galway and Michael G. Madden. DEPARTMENT OF INFORMATION TECHNOLOGY technical report NUIG-IT-011002 Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. Department of Information Technology National University of Ireland, Galway.

(based on a paired T-test at the 99% confidence level) and they outperform the other algorithms, they are both highlighted in bold. For example, K2 and MBBC are both best on the DNA Splice dataset and all four are equally good on the Breast Cancer dataset. Na´ve TAN K2 MBBC Chess 87.63▒ 1.61 91.68▒ 1.09 94.03▒ 0.87 97.03▒ 0.54 WBCD 97.81▒ 0.51 97.47▒ 0.68 97.17▒ 1.05 97.30▒ 1.01 LED-24 73.28▒


Pedro Domingos. Using Partitioning to Speed Up Specific-to-General Rule Induction. Department of Information and Computer Science University of California, Irvine.

rules on different partitions, to the increase in accuracy that can result from combining multiple models (Wolpert 1992; Breiman in press), and possibly to other factors. On the splice junctions dataset, the success of applying partitioning to RISE using a simple combination scheme contrasts with the results obtained by Chan and Stolfo for general-to-specific learners (Chan & Stolfo 1995a). In


Kai Ming Ting and Ian H. Witten. Stacked Generalization: when does it work. Department of Computer Science University of Waikato.

other three level-1 generalizers in that its model can easily be interpreted. Examples of the combination weights it derives (for the probability-based model ~ M 0 ) appear in Table 5 for the Splice dataset. The weights indicate the relative importance of the level-0 generalizers for each prediction class. In this dataset, NB is the dominant generalizer for predicting class 2, NB and IB1 are both good


Cesar Guerra-Salcedo and Stephen Chen and Darrell Whitley and Sarah Smith. Fast and Accurate Feature Selection Using Hybrid Genetic Strategies. Department of Computer Science Colorado State University.

is a DNA dataset. The dataset represents Primate splice junction gene sequences (DNA). There are 2000 training cases, 1186 test cases, and 180 binary features for each case. Three different classes exist in this


Return to Molecular Biology (Splice-junction Gene Sequences) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML