Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Waveform Database Generator (Version 2) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Waveform Database Generator (Version 2) data set page.


Giorgio Valentini. Random Aggregated and Bagged Ensembles of SVMs: An Empirical Bias?Variance Analysis. Multiple Classifier Systems. 2004.

software library [13] and the SVMlight applications [9]. 4.2 Results In particular we analyzed the relationships of the components of the error with the kernels and kernel parameters, using data sets from UCI [14] Waveform Grey-Landsat, Letter-Two, Letter-Two with added noise, Spam, Musk) and the P2 synthetic data set 1 . We achieved a characterization of the bias--variance decomposition of


Zhi-Hua Zhou and W-D Wei and Gang Li and Honghua Dai. On the Size of Training Set and the Benefit from Ensemble. PAKDD. 2004.

3,772 22 7 2 kr-vs-kp 3,196 36 0 2 led7 2,000 7 0 10 led24 2,000 24 0 10 sat 6,435 0 36 6 segment 2,310 0 19 7 sick 3,772 22 7 2 sick-euthyroid 3,156 22 7 2 waveform 5,000 0 21 3 Each original data set is partitioned into ten subsets with similar distributions. At the first time, only one subset is used; at the second time, two subsets are used; and so on. The earlier generated data sets are


Eibe Frank and Mark Hall and Bernhard Pfahringer. Locally Weighted Naive Bayes. UAI. 2003.

3772 6.0 23 6 2 sonar 208 0.0 60 0 2 soybean 683 9.8 0 35 19 splice 3190 0.0 0 61 3 vehicle 846 0.0 18 0 4 vote 435 5.6 0 16 2 vowel 990 0.0 10 3 11 waveform 5000 0.0 40 0 3 zoo 101 0.0 1 15 7 19 datasets for k = 5 and k = 10 respectively. When distance weighting is used with k-nearest neighbours, our method is significantly more accurate on 13 and 17 datasets for k = 5 and k = 10 respectively.


Giorgio Valentini and Thomas G. Dietterich. Low Bias Bagged Support Vector Machines. ICML. 2003.

P2 Polyn. 0.1687 0.1863 0.1892 4-1-0 4-1-0 1-4-0 Gauss. 0.1429 0.1534 0.1605 4-1-0 5-0-0 3-2-0 Data set Waveform Linear 0.0811 0.0821 0.0955 2-3-0 5-0-0 5-0-0 Polyn. 0.0625 0.0677 0.0698 2-3-0 2-3-0 3-2-0 Gauss. 0.0574 0.0653 0.0666 4-1-0 4-1-0 2-3-0 Data set Grey-Landsat Linear 0.0508 0.0510 0.0601


Joao Gama and Ricardo Rocha and Pedro Medas. Accurate decision trees for mining high-speed data streams. KDD. 2003.

when classifying test examples: classifying using the majority class (VFDTcMC) and classifying using naive Bayes (VFDTcNB) at leaves. The experimental work has been done using the Waveform and LED datasets. These are well known artificial datasets. We have used the two versions of the Waveform dataset available at the UCI repository [1]. Both versions are problems with three classes. The first


Giorgio Valentini. Ensemble methods based on bias--variance analysis Theses Series DISI-TH-2003. Dipartimento di Informatica e Scienze dell'Informazione . 2003.

polynomial degrees: (a) degree = 2, (b) degree = 3, (c) degree = 5, (d) degree = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.17 Bias in polynomial SVMs with (a) Waveform and (b) Spam data sets, varying both C and polynomial degree. . . . . . . . . . . . . . . . . . . . . . . 68 4.18 Bias-variance decomposition of error in bias, net variance, unbiased and biased variance in polynomial


S. Sathiya Keerthi and Kaibo Duan and Shirish Krishnaj Shevade and Aun Neow Poo. A Fast Dual Algorithm for Kernel Logistic Regression. ICML. 2002.

software at the site http://www.ece.nwu.edu/~nocedal/lbfgs.html was used. The Gaussian kernel K(x; # x) = exp( kx # xk 2 2# 2 ) was used. In all the experiments, # was set to 10 6 . Five benchmark datasets were used: Banana, Image, Splice, Waveform and Tree. The Tree dataset was originally used in (Bailey, Pettit, Borocho#, Manry, and Jiang, 1993). Detailed information about the remaining datasets


James Bailey and Thomas Manoukian and Kotagiri Ramamohanarao. Fast Algorithms for Mining Emerging Patterns. PKDD. 2002.

using thresholds. We see that mining with a threshold value of 4 is substantially faster than mining the complete set of JEPs using a ratio tree. Classification accuracy is degraded for three of the datasets (Vehicle, Waveform and Letter-recognition) though. Analysis of the vehicle and chess datasets aid in explaining this outcome (supporting figures have been excluded due to lack of space). It is


Juan J. Rodr##guez and Carlos J. Alonso and Henrik Bostrom. Boosting Interval Based Literals. 2000.

iterations, and a comparison of the results for settings 2{5 (combinations of interval literals) against the results for setting 1 (point based literals), using the McNemar's test. 4.1 Waveform This data set was introduced by [BFOS93]. The purpose is to distinguish between three classes, defined by the evaluation for i = 1; 2 : : : 21, of the following functions: x 1 (i) = uh 1 (i) + (1 u)h 2 (i) + #(i)


Bede Liu and Mingzeng Hu and Wynne Hsu. Multi-level organization and summarization of the discovered rules. KDD. 2000.

646 91 0.05 14 kdd 5914 68 0.22 15 mushroom 2398 57 0.08 16 pima 44 12 0.01 17 satimage 7515 191 0.30 18 splice 4302 100 0.15 19 tic-tac 266 11 0.02 20 waveform 1480 106 0.07 Average 1487.7 50 0.07 Dataset Table 1: Experiment results with decision trees No. of decision tree leaves No. of GSE tree leaves 105.2 38.1 doctors said that they could not obtain an overall picture of the domain from the


Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning, 40. 2000.

the effect of classification noise, we added random class noise to nine domains (audiology, hypo, king-rook-vs-king-pawn (krkp), satimage, sick, splice, segment, vehicle, and waveform . These data sets were chosen because at least one pair of the ensemble methods gave statistically significantly different performance on these domains. We did not perform noise experiments with letter-recognition


Juan J. Rodr##guez and Carlos J. Alonso. Applying Boosting to Similarity Literals for Time Series Classification. Department of Informatics University of Valladolid, Spain. 2000.

are sumarised in table 2. The main criterion for selecting them has been that the number of examples available were big enough, to ensure that the results were reliable. Waveform This dataset was introduced by [Breiman et al., 1993]. The purpouse is to distinguish between three classes, defined by the evaluation in 1; 2 : : : 21, of the following functions: x 1 (i) = uh 1 (i) + (1 u)h 2


Juan J Rodríguez Diez and Carlos Alonso González and Henrik Boström. Learning First Order Logic Time Series Classifiers: Rules and Boosting. PKDD. 2000.

x k(t). Figure 2.b shows two examples of each class. The data used were obtained from the UCI KDD Archive [4]. It contains 100 examples of each class, with 60 points in each example. Waveform This dataset was introduced by [9]. The purpouse is to distinguish between three classes, defined by the evaluation for i = 1; 2 : : : 21, of the following functions: x 1 (i) = uh 1 (i) + (1 u)h 2 (i) + #(i) x 2


Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.

from the UCI Repository of machine learning databases (Blake, Keogh & Merz, 1998). Details of these are given in Table 1. For the artificial datasets---Led24 and Waveform --each training dataset L of size 200 and 300, respectively, is generated using a different seed. The algorithms used for the experiments are then tested on a separate dataset


Khaled A. Alsabti and Sanjay Ranka and Vineet Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. KDD. 1998.

are taken from the STATLOG project, which has been a widely used benchmark in classification. 3 The Abalone," Waveform " and Isolet" datasets can be found in [13]. The Synth1" and Synth2" datasets have been used in [15, 17] for evaluating SLIQ and SPRINT; they have been referred to as the Function2" dataset. The main parameter of our


Kai Ming Ting and Boon Toh Low. Model Combination in the Multiple-Data-Batches Scenario. ECML. 1997.

settings are the same as those used in IB1 3 in all experiments. No parameter settings are required for NB*. Our studies employ two artificial domains (i.e., waveform and LED24) and four real-world datasets (i.e., euthyroid, nettalk(stress), splice junction and protein coding) obtained from the UCI repository of machine learning databases (Merz & Murphy, 1996). The two noisy artificial domains are


Ron Kohavi. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. KDD. 1996.

segment 19 2,310 CV-10 shuttle 9 43,500 14,500 soybean-large 35 562 CV-10 tic-tac-toe 9 958 CV-10 vehicle 18 846 CV-10 vote 16 435 CV-10 vote1 15 435 CV-10 waveform 40 40 300 4,700 Table 1: The datasets used, the number of attributes, and the training/test-set sizes (CV-10 denotes 10-fold cross-validation was used). NBTree - C4.5 NBTree - NB tic-tac-toe chess letter vehicle vote monk1 segment


Tapio Elomaa and Juho Rousu. Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning. ESPRIT Working Group in Neural and Computational Learning. 1996.

preprocessing time dominates the total running time, with the exception of the Waveform data set, which consists of truly continuos-valued attributes: Each attribute has over 500 different values in the data and almost as many boundary points. In comparison, the Shuttle domain has on average


Nir Friedman and Moisés Goldszmidt. Discretizing Continuous Attributes While Learning Bayesian Networks. ICML. 1996.

from the Irvine repository [15]. We estimated the accuracy of the learned classifiers using 5-fold cross-validation, except for the "shuttle-small" and waveform 21" datasets where we used the hold-out method. We report the mean of the prediction accuracies over all cross-validation folds. We also report the standard deviation of the accuracies found in each fold. These


Dietrich Wettschereck and David W. Aha. Weighting Features. ICCBR. 1995.

Feature Weights Computed by MI in the Waveform 19 Task Table 5. Average Accuracies on the Waveform Tasks Relative to k-NN Feature Weight Learning Algorithm Training Feedback Method Ignorant Method Dataset Size k-NN Relief-F k-NNV SM MI Waveform 100 77.0Sigma1.0 79.1 77.2 78.0 300 82.1Sigma0.9 82.4 81.6 82.6 Waveform+19 100 73.4Sigma1.0 78.4 76.7 78.6 300 81.3Sigma0.9 83.0 82.5 82.3 features to


Carlos J. Alonso Gonzalez and Juan J. Rodr and iguez Diez. Time Series Classification by Boosting Interval Based Literals. Grupo de Sistemas Inteligentes Departamento de Informatica Universidad de Valladolid.

after this time. t 3 is in [n/3, 2n/3]. 6. Downward: y(t) = m+ rs - kx. Figure 3.b shows two examples of three of the classes. The data used was obtained from the UCI KDD Archive [4]. Waveform This dataset was introduced by [7]. The purpouse is to distinguish between three classes, defined by the evaluation in 1, 2 . . . 21, of the following functions: x 1 (i) = uh 1 (i) + (1x 2 (i) = uh 1 (i) + (1x 3


Juan J. Rodr##guez and Carlos J. Alonso and Henrik Bostrom. Learning First Order Logic Time Series Classifiers: Rules and Boosting. Grupo de Sistemas Inteligentes, Departamento de Inform#atica Universidad de Valladolid, Spain.

[1]. Figure 2.b shows some examples of three of the classes. The data used were obtained from the UCI KDD Archive [4]. The number of examples is 600, with 60 points in each series. { Waveform This dataset was introduced by [9]. We used the version from the UCI ML Repository [7]. The number of examples is 900 and the number of points in each series is 21. { Wave + Noise. This dataset was generated in


Kai Ming Ting and Ian H. Witten. Stacked Generalization: when does it work. Department of Computer Science University of Waikato.

from the UCI Repository of machine learning databases [Merz and Murphy, 1996]. Details of these are given in Table 1. For the artificial datasets---Led24 and Waveform -- each training dataset L is generated using a different Table 1: Details of the datasets used in the experiment. Datasets # Samples # Classes # Attr & Type Led24 200--5000 10


Amund Tveit. Empirical Comparison of Accuracy and Performance for the MIPSVM classifier with Existing Classifiers. Division of Intelligent Systems Department of Computer and Information Science, Norwegian University of Science and Technology.

As we can see from the results in figure 1, MIPSVM performs comparably well when it comes to classification accuracy for the Waveform and Image Segment datasets. For the Letter Recognition dataset it performs considerably worse than the other classifiers. This is likely to be caused by that MIPSVM doesn't have any balancing mechanisms one-against-the-rest


Vikas Sindhwani and P. Bhattacharya and Subrata Rakshit. Information Theoretic Feature Crediting in Multiclass Support Vector Machines.

include : a synthetic dataset that we constructed, LED-24, Waveform 40, DNA, Vehicle, and Satellite Images (SAT) drawn from the UCI repository [19]; and three datasets which are a subset of the Reuters document collection [20].


Mohammed Waleed Kadous. Expanding the Scope of Concept Learning Using Metafeatures. School of Computer Science and Engineering, University of New South Wales.

are: arrythmia, audiology, bach chorales, echocardiogram, isolet, mobile robots, waveform unable to process large data sets in a reasonable time frame, and/or require the user to set limits on the search such as refinement rules (Cohen, 1995). Furthermore, their most powerful feature { the use of relations { is rarely


Thomas T. Osugi and M. S. EXPLORATION-BASED ACTIVE MACHINE LEARNING. Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Artificial Exploration Benchmark . . . . . . . . . . . . . . . . . . . . 30 6 Conclusions and Future Work 40 Bibliography 43 iii A Dataset Descriptions 46 A.1 Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 A.2 SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A.3 Image


Pierre Geurts. Extremely randomized trees. Technical report June 2003 University of Li#ege Department of Electrical Engineering and Computer Science Institut Monte#ore.

Dataset Nb. attributs Nb. classes #LS #TS Waveform 21 3 4000 1000 Two-norm 20 2 8000 2000 Satellite 36 6 4435 2000 Pendigits 16 10 7494 3498 Dig44 16 10 14000 4000 Letter 16 26 16000 4000 Isolet 617 26 6238


Iñaki Inza and Pedro Larraaga and Ramon Etxeberria and Basilio Sierra. Feature Subset Selection by Bayesian networks based optimization. Dept. of Computer Science and Artificial Intelligence. University of the Basque Country.

principal reason of `overfitting' was the low amount of training instances. To study this issue for FSS-EBNA, we have carried out a set of experiments with different training sizes of Waveform 40 dataset [15] with Naive-Bayes classification algorithm [19]: training sizes of 100; 200; 400; 800 and 1; 600 samples and tested over a fixed test set with 3; 200 instances. Figure 7 summarizes the set of


Kai Ming Ting and Boon Toh Low. Theory Combination: an alternative to Data Combination. University of Waikato.

why theory combination can not outperform data combination in this region. Note that the behaviour of the oracle is different from the usual learning curve in the waveform and protein coding datasets when NB* is employed. This indicates that some learning algorithm can specialise in different regions of the description space when the sizes of the data batches are relatively small. This can


Matthias Scherf and W. Brauer. Feature Selection by Means of a Feature Weighting Approach. GSF - National Research Center for Environment and Health.

and compare the results of EUBAFES and RELIEF-F. Waveform 40 The Waveform-40 data set was introduced in [3] and applied in [25] to examine how well a feature selection algorithm works in the presence of a high number of irrelevant features. The data set enfolds 300 instances with


Zhi-Hua Zhou and Xu-Ying Liu. Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem.

(b) Type (c) j j j 1 2 3 1 2 3 1 2 3 1 0 1 8 1 0 3 3 1 0 3 6 i 2 1 0 9 2 1 0 1 2 3 0 1 3 1 1 0 3 6 6 0 3 4 5 0 Under each type of cost matrix, 10 times 10-fold cross validation are performed on each data set except on waveform where randomly generated training data size of 300 and test data size of 5000 are used in 100 trials, which is the way this data set has been used in some other cost-sensitive


Giorgio Valentini. An experimental bias--variance analysis of SVM ensembles based on resampling techniques.

each region is delimited by one or more of four simple polynomial and trigonometric functions 2 . The synthetic data set Waveform is generated from a combination of 2 of 3 "base" waves; we reduced the original three classes of Waveform to two, deleting all samples pertaining to class 0. The other data sets are all


Juan J. Rodr and guez Diez and Carlos J. Alonso. Learning Classification RBF Networks by Boosting. Lenguajes y Sistemas Inform#aticos.

are summarized in table 1. The data sets waveform waveform with noise [5, 6], CBF (cylinder, bell and funnel) [19] and control charts [1, 3] were already used in our work on boosting distance literals [18]. Auslan is the Australian sign


Zoran Obradovic and Slobodan Vucetic. Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples. Center for Information Science and Technology Temple University.

algorithm on synthetic data of various statistical properties showed that it accurately estimates the class probabilities on unlabeled data [73]. In another experiment on a 3-class benchmark dataset called Waveform [7] the SL was constructed with balanced classes, while we experimented with different class distributions p j on SU . It was shown that for balanced classes p = [1=3; 1=3; 1=3] the


Return to Waveform Database Generator (Version 2) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML