Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Molecular Biology (Promoter Gene Sequences) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Molecular Biology (Promoter Gene Sequences) data set page.


Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin. Linear dimensionalityreduction using relevance weighted LDA. School of Electrical and Electronic Engineering Nanyang Technological University. 2005.

to compare LDA, aPAC, WLDR, EWLDR. The six data sets are landsat, optdigits, vehicle, DNA thyroid disease and vowel data sets. Landsat. The Landsat data set is generated from landsat multi-spectral scanner image data. It has 36 dimensions, 4435


Wei-Chun Kao and Kai-Min Chung and Lucas Assun and Chih-Jen Lin. Decomposition Methods for Linear Support Vector Machines. Neural Computation, 16. 2004.

in (Keerthi and Lin 2003), due to the di▒culty on solving linear SVMs, Algorithm 1 is only tested on small two-class problems. Here, we would like to evaluate this algorithm on large multi-class data sets. We consider problems dna satimage, letter, and shuttle, which were originally from the statlog collection (Michie, Spiegelhalter, and Taylor 1994) and were used in (Hsu and Lin 2002a). Except


Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003.

with the addition of nuclear localisation information. Promoter data set. The task of the classifier is to predict whether a DNA sequence from E.coli is either a promoter or not (Towell et al., 1990). The input data is a 57-nucleotide sequence (A, C, T or G). HIV data


Giorgio Valentini. Ensemble methods based on bias--variance analysis Theses Series DISI-TH-2003. Dipartimento di Informatica e Scienze dell'Informazione . 2003.

of MLP, as well as ensemble methods based on resampling techniques, such as bagging and boosting, have been applied to the analysis of DNA microarray data [192, 158, 54, 178, 185]. 141 6.5.1 Data set and experimental set-up. We used DNA microarray data available on-line. In particular we used the GCM data set obtained from the Whitehead Institute, Massachusetts Institute of Technology Center for


Zoubin Ghahramani and Hyun-Chul Kim. Bayesian Classifier Combination. Gatsby Computational Neuroscience Unit University College London. 2003.

and using different component classifiers. We used Satellite and DNA data sets from the Statlog project([8]) and the UCI digit data set ([1]) 3 . Our goal was not to obtain the best classifier performance---for this we would have paid very careful attention to the component


Jinyan Li and Limsoon Wong. Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL. WAIM. 2003.

(i.e., breast-w, cleve, heart, HIV, and promoter ; Bagging won on 1 data set (hypothyroid); and Boosting won the best accuracy on 4 data sets (i.e., hepatitis, lymph, sick and splice). -- Comparing between PCL and C4.5, PCL won on 8 data sets, while C4.5 won on the rest 2


Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.

to the alphabet, which is assumed to be present at the beginning of each sequences [DEKG98]. Equation 3 contains these extra states. Figure 1 illustrates an example of sequence classification on a dataset of DNA sequences. Figure 1(a) shows the training sequences with their respective class labels, these sequences are first split into two parts for computing the TPM associated with each class label.


Takashi Matsuda and Hiroshi Motoda and Tetsuya Yoshida and Takashi Washio. Mining Patterns from Structured Data by Beam-Wise Graph-Based Induction. Discovery Science. 2002.

canonical labeling to enumerate identical patterns accurately. This new algorithm is implemented and now called Beam-wise GBI, B-GBI for short. Second, we report on an experiment using the promoter dataset (a small DNA dataset) from UCI repository and show the improvements work as intended. Effect of beam width on the number of discovered patterns and predictive accuracy were evaluated. The best


Michael G. Madden. Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. CoRR, csLG/0211003. 2002.

(based on a paired T-test at the 99% confidence level) and they outperform the other algorithms, they are both highlighted in bold. For example, K2 and MBBC are both best on the DNA Splice dataset and all four are equally good on the Breast Cancer dataset. Na´ve TAN K2 MBBC Chess 87.63▒ 1.61 91.68▒ 1.09 94.03▒ 0.87 97.03▒ 0.54 WBCD 97.81▒ 0.51 97.47▒ 0.68 97.17▒ 1.05 97.30▒ 1.01 LED-24 73.28▒


Marina Meila and Michael I. Jordan. Learning with Mixtures of Trees. Journal of Machine Learning Research, 1. 2000.

exon, or coding section, begins. Hence, the class variable can take 3 values (EI, IE or no junction) and the other variables take 4 values corresponding to the 4 possible DNA bases (C, A, G, T). The dataset consists of 3,175 labeled examples 2 . We ran two series of experiments comparing the MT model with competing models. In the first series of experiments, we compared to the results of [41], who used


Mark A. Hall and Lloyd A. Smith. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. FLAIRS Conference. 1999.

numeric value. Future work will aim at extending CFS to handle problems where the class is numeric. -50 -40 -30 -20 -10 0 10 20 30 40 mu vo v1 cr ly pt bc dna au sb hc kr tree size difference dataset Figure 2: Average change in the size of the trees induced by C4.5 when features are selected by the wrapper (left) and CFS (right). 0 10 20 30 40 50 60 70 80 mu vo v1 as ly pt bc dna au sb hc kr #


Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999.

is to predict whether cancer will recur in patients. There are 9 nominal attributes describing characteristics such as tumour size and location. There are 286 instances. dna promoter (dna) A small dataset containing 53 positive examples of E. coli promoter gene sequences and 53 negative examples. There are 55 nominal attributes representing the gene sequence. Each attribute is a DNA nucleotide


Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. UAI. 1999.

the value "?" in our experiments. Chess: Chess end-game result classification based on board-descriptions. DNA Recognizing the boundaries between exons and introns given a sequence of DNA. Car dataset: Car evaluation based on the six features of a car. Flare: Classifying the number of times of occurrence of certain type of solar flare. Vote: Using voting records to classify Congressmen as


Ismail Taha and Joydeep Ghosh. Symbolic Interpretation of Artificial Neural Networks. IEEE Trans. Knowl. Data Eng, 11. 1999.

that have been used as benchmarks for rule extraction approaches are the Monk [43], Mushroom [32] and the DNA promoter [47] data sets. All three of these data sets inputs are symbolic/discrete by nature. Since we want to test more general problems that may include continuous valued variables, Iris and Breast-Cancer were preferred


Cesar Guerra-Salcedo and L. Darrell Whitley. Genetic Approach to Feature Selection for Ensemble Creation. GECCO. 1999.

belonging to a particular class depend on the number of elements for that class and e. This modification allows us to avoid elements from different classes being mixed in one cluster. Table 1: Dataset employed for the experiments. In the DNA dataset the attributes values are 0 or 1. In the Segment dataset the attributes values are floats. In the LandSat dataset the attribute values are integers.


Foster J. Provost and Tom Fawcett and Ron Kohavi. The Case against Accuracy Estimation for Comparing Induction Algorithms. ICML. 1998.

has exactly 50 instances of each class. The splice junction data set DNA has 50% donor sites, 25% acceptor sites and 25% nonboundary sites, even though the natural class distribution is very skewed: no more than 6% of DNA actually codes for human genes (Saitta and


Andreas L. Prodromidis. On the Management of Distributed Learning Agents Ph.D. Thesis Proposal CUCS-032-97. Department of Computer Science Columbia University. 1998.

of real credit card transactions and two molecular biology sequence analysis data sets, were used in our experiments. The credit card data sets were provided by the Chase and First Union Banks, members of FSTC (Financial Services Technology Consortium) and the molecular biology


Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : : : : : : : : : : 118 xvi 4.12 Relationships between component accuracy and diversity for the Promoter data set for the four boosting algorithms. "c" represents the Coarse Reclassification algorithm; "d", Deliberate Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : :


Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996.

(representing an increase in accuracy from 93.3% to 98.9%!) and by large (around 3 or 4) factors for LED and Tic-tactoe. The molecular biology data sets also experienced significant reduction with the error being halved (for DNA this represented an increase in accuracy from 67.9% to 86.8%!). The error reduction is least for the noisy KRK and LED


Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. ICML. 1996.

include: the Corral data which was artificially constructed by John et al (1994) specifically for research in feature selection; the LED24, Vote, and DNA datasets from the UCI repository (Murphy & Aha 1995); and two datasets which are a subset of the Reuters document collection (Reuters 1995). These datasets are detailed in Table 1. We selected these


Ron Kohavi and Dan Sommerfield. Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology. KDD. 1995.

in error. The execution time on a Sparc20 for feature subset selection using ID3 ranged from under five minutes for breast-cancer (Wisconsin), cleve, heart, and vote to about an hour for most datasets. DNA took 29 hours, followed by chess at four hours. The DNA run took so long because of ever increasing estimates that did not really improve the test-set accuracy. 7 Conclusions We reviewed the


Ron Kohavi. The Power of Decision Tables. ECML. 1995.

Breiman et al. (1984), Devijver & Kittler (1982)). The results demonstrate that IDTM can achieve high accuracy in discrete domains using the simple hypothesis space of DTMs. In corral, dna the Monk Dataset Features sizes Accuracy Accuracy Accuracy Accuracy australian 14 690 CV 55.5Sigma2.3 85.4Sigma1.1 84.9Sigma 1.7 89.4Sigma1.3 breast 10 699 CV 65.5Sigma1.7 95.4Sigma0.7 90.6Sigma 0.9


Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore.

M-of-N rule extraction algorithm. This problem has been used by Towell and Shavlik [3] to test their algorithms which extract symbolic rules from knowledge based neural networks. Each sample in the dataset is described by a 60-nucleotide-long DNA sequence. A nucleotide may assume one of the 4 possible values: G (Guanine), T (Thymine), C (Cytosine), or A (Adenine). These values are binary coded as


Ron Kohavi and Dan Sommerfield. To Appear in KDD-98 Targeting Business Users with Decision Table Classifiers. Data Mining and Visualization Silicon Graphics, Inc.

In fact, it is quite surprising to see how well decision tables perform with five or fewer attributes. Out of 16 natural datasets, only DNA and german used more than five attributes (chess is semi-natural) with DTMaj, and only DNA and letter used more than seven attributes with DTLoc. Such relatively small classifiers can be


Warodom Geamsakul and Takashi Matsuda and Tetsuya Yoshida and Hiroshi Motoda and Takashi Washio. Constructing a Decision Tree for Graph Structured Data. Institute of Scientific and Industrial Research, Osaka University.

is employed to extract good enough discriminative patterns within the greedy search framework. Pessimistic pruning is incorporated to avoid overfitting to the training data. Experiments using a DNA dataset were conducted to see the effect of the beam width, the number of chunking at each node of a decision tree, and the pruning. The results indicate that DT-GBI that does not use any prior domain


Ivor W. Tsang and James T. Kwok. Distance Metric Learning with Kernels. Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay Hong Kong.

the data distributions using the two distance metrics. As can be seen, similar patterns become more 2 ionosphere, sonar and wine are from the UCI repository [2], and microarray (a DNA microarray dataset for colon cancer) is from http://www.kyb.tuebingen.mpg.de/bs/people/weston/l0. clustered while dissimilar patterns are more separated. Table II reports the classification accuracies, averaged over


Norbert Jankowski. Survey of Neural Transfer Functions. Department of Computer Methods, Nicholas Copernicus University.

results of RBF are not similar to the results of MLP. In some cases crossvalidation errors were twice as large using MLP than RBF (for example for the DNA dataset RBF gave 4.1% of error and was the best of all methods while MLP gave 8.8 % of errors and was on 12-th position, while for the Belgian Power data the situation is reversed, with MLP giving 1.7% of


Ron Kohavi and George H. John. Automatic Parameter Selection by Minimizing Estimated Error. Computer Science Dept. Stanford University.

so if we evaluate n nodes in the search, the total time is O(ntkA). In our experiments, t was limited to three, k was 10, and the number of nodes expanded was about 60. For example, on the dna dataset (which took the most time) C4.5 took just under 2 minutes, while C4.5-AP took 6.8 hours. 6 Related Work In statistics, model selection refers to the general problem of selecting a learning algorithm


Ron Kohavi and Barry G. Becker and Dan Sommerfield. Improving Simple Bayes. Data Mining and Visualization Group Silicon Graphics, Inc.

was large or artificial, indicating that a single test set would yield accurate estimates, we used a training-set/test-set as defined in the source for the dataset (e.g., Statlog defined the splits for DNA letter, satimage; CART defined the training size for waveform and led24) or a 2/3, 1/3 split, and ran the inducer once; otherwise, we performed 10-fold


Vikas Sindhwani and P. Bhattacharya and Subrata Rakshit. Information Theoretic Feature Crediting in Multiclass Support Vector Machines.

include : a synthetic dataset that we constructed, LED-24, Waveform-40, DNA Vehicle, and Satellite Images (SAT) drawn from the UCI repository [19]; and three datasets which are a subset of the Reuters document collection [20].


C. esar and Cesar Guerra-Salcedo and Darrell Whitley. Feature Selection Mechanisms for Ensemble Creation : A Genetic Search Perspective. Department of Computer Science Colorado State University.

Features Classes Train Size Test Size LandSat 36 6 4435 2000 DNA 180 39 2000 1186 Segment 19 7 210 2100 Cloud 204 10 1000 633 Table 1: Dataset employed for the experiments. In the DNA dataset the attributes values are 0 or 1. In the Segment and the Cloud dataset the attributes values are floats. In the LandSat dataset the attribute values


Alain Rakotomamonjy. Analysis of SVM regression bounds for variable ranking. P.S.I CNRS FRE 2645, INSA de Rouen Avenue de l'Universite.

We have tested our algorithms on some real-world problems such as QSAR data and DNA microarray problems. Informations on these datasets, including references where they can be found, are given in Table (5). As described some of these datasets deal with classification problems. In such cases, we have still addressed such problems by


Cesar Guerra-Salcedo and Stephen Chen and Darrell Whitley and Sarah Smith. Fast and Accurate Feature Selection Using Hybrid Genetic Strategies. Department of Computer Science Colorado State University.

were employed and one artificially generated classification problem. The real-world classification problems are: satellite classification dataset (LandSat), a DNA classification dataset and a Cloud classification dataset. On the other hand, the artificially generated classification problem rely on a LED identification problem. LED cases are


Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.

attributes, 2 classes. Attributes selected by SBC = 5. Promoter Gene Sequences 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 99 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 8. Gene Promoter dataset. 106 instances, 57 attributes, 2 classes. Attributes selected by SBC = 5. Soybean 30 40 50 60 70 80 90 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 9. Soybean-large


M. A. Galway and Michael G. Madden. DEPARTMENT OF INFORMATION TECHNOLOGY technical report NUIG-IT-011002 Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. Department of Information Technology National University of Ireland, Galway.

(based on a paired T-test at the 99% confidence level) and they outperform the other algorithms, they are both highlighted in bold. For example, K2 and MBBC are both best on the DNA Splice dataset and all four are equally good on the Breast Cancer dataset. Na´ve TAN K2 MBBC Chess 87.63▒ 1.61 91.68▒ 1.09 94.03▒ 0.87 97.03▒ 0.54 WBCD 97.81▒ 0.51 97.47▒ 0.68 97.17▒ 1.05 97.30▒ 1.01 LED-24 73.28▒


Kuan-ming Lin and Chih-Jen Lin. A Study on Reduced Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

votes. The implementation of all methods mentioned above is available upon request. V. EXPERIMENTS In this section we conduct experiments on some commonly used problems. We choose large multiclass datasets from the Statlog collection: dna satimage, letter, and shuttle [16]. We also consider mnist [9], an important benchmark for handwritten digit recognition. The problem ijcnn1 is from the first


Chih-Wei Hsu and Cheng-Ru Lin. A Comparison of Methods for Multi-class Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

iris, wine, glass, and vowel. Those problems had already been tested in [27]. From Statlog collection we choose all multi-class datasets: vehicle, segment, dna satimage, letter, and shuttle. Note that except problem dna we scale all training data to be in [-1, 1]. Then test data are adjusted using the same linear transformation.


Return to Molecular Biology (Promoter Gene Sequences) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML