Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Musk (Version 2) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with

Return to Musk (Version 2) data set page.

Qingping Tao Ph. D. MAKING EFFICIENT LEARNING ALGORITHMS WITH EXPONENTIALLY MANY FEATURES. Qingping Tao A DISSERTATION Faculty of The Graduate College University of Nebraska In Partial Fulfillment of Requirements. 2004.

67 4.5.1 Content-Based Image Retrieval . . . ................ 67 4.5.2 Identifying Trx-fold Proteins .................... 69 4.5.3 Multi-Site Drug Binding Affinity . . ................ 69 4.5.4 Musk Data Sets ............................ 70 4.6 Conclusions.................................. 71 5 Extended Kernels for Generalized Multiple-Instance Learning 73 5.1 A Count-based Kernel for GMIL . . .

Qingping Tao and Stephen Scott and N. V. Vinodchandran and Thomas T. Osugi. SVM-based generalized multiple-instance learning via approximate box counting. ICML. 2004.

results of our new kernel on applications such as content-based image retrieval, prediction of drug affinity to bind to multiple sites simultaneously, protein sequence identification, and the Musk data sets. Finally, we conclude in Section 7. 2. Notation and Definitions Let X denote {0, . . . , s} d (though our results trivially generalize to X = Q d i=1 {0, . . . , s i }). Let BX denote the set of

Giorgio Valentini. Random Aggregated and Bagged Ensembles of SVMs: An Empirical Bias?Variance Analysis. Multiple Classifier Systems. 2004.

from UCI [14] (Waveform, Grey-Landsat, Letter-Two, Letter-Two with added noise, Spam, Musk and the P2 synthetic data set 1 . We achieved a characterization of the bias--variance decomposition of the error in bagged and random aggregated ensembles that resembles the one obtained for single SVMs [5] (Fig. 1. For more

Zhi-Hua Zhou and Min-Ling Zhang. Ensembles of Multi-instance Learners. ECML. 2003.

bags, and the number of instances contained in each bag ranges from 1 to 1,044. Detailed information on the Musk data is tabulated in Table 2. Ten-fold cross validation is performed on each Musk data set. In each fold, Bagging is employed to build an ensemble for each of the four base multi-instance learners, i.e. Iterated-discrim APR, Diverse Density, Citation-kNN, and EM-DD. Each ensemble

Giorgio Valentini and Thomas G. Dietterich. Low Bias Bagged Support Vector Machines. ICML. 2003.

Spam Linear 0.1356 0.1340 0.1627 0-4-1 5-0-0 5-0-0 Polyn. 0.1309 0.1338 0.1388 1-4-0 2-3-0 2-2-1 Gauss. 0.1239 0.1349 0.1407 3-2-0 3-2-0 2-3-0 Data set Musk Linear 0.1244 0.1247 0.1415 0-5-0 4-1-0 4-1-0 Polyn. 0.1039 0.1193 0.1192 4-1-0 4-0-1 2-2-1 Gauss. 0.0872 0.0972 0.0920 4-1-0 2-2-1 1-0-4 6.1. Experimental setup We employed two synthetic data

Giorgio Valentini. Ensemble methods based on bias--variance analysis Theses Series DISI-TH-2003. Dipartimento di Informatica e Scienze dell'Informazione . 2003.

with 4601 instances and 57 continuous attributes. Musk The dataset (available from UCI) describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be nonmusks. The 166 features that describe

Stephen D. Bay. Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. ICML. 1998.

maintain consistency with reported results (Quinlan, 1996). For Satimage, we used the original division into a training and test set, so the results represent one run of each algorithm. For the Musk dataset, which has 166 features, FSS and BSS took too long to run (over 24 hours for a single trial) and no results were obtained. 3.2 ACCURACY The accuracy and parameter selection results (average k or

Hendrik Blockeel and Luc De Raedt. Lookahead and Discretization in ILP. ILP. 1997.

to an interval test in the discrete domain. The three approaches have been used and compared in our experiments. 3.2 Experimental Evaluation We evaluate the effect of discretization on two datasets: the Musk dataset (available at the UCI repository [11]) and the Diterpene dataset, generously provided to us by Steffen Schulze-Kremer and Saso Dzeroski. Both datasets contain nondeterminate

Zhi-Hua Zhou and Min-Ling Zhang. Neural Networks for Multi-Instance Learning. National Laboratory for Novel Software Technology, Nanjing University.

BP-MIP network is used in prediction, a bag is positively labeled if and only if the output of the network on at least one of its instances is not less than 0.5. 6 5. Experiments 5.1 Real-world data sets The Musk data is the only real-world benchmark test data for multi-instance learning at present. The data is generated by Dietterich et al. in the way described in Section 2. There are two data

Giorgio Valentini. An experimental bias--variance analysis of SVM ensembles based on resampling techniques.

Spam RBF-SVM 0.1292 0.1290 0.14 -0.48 1.57 2.22 Poly-SVM 0.1323 0.1318 0.35 2.11 -5.83 -1.19 D-prod SVM 0.1495 0.1389 7.15 -3.16 19.87 16.38 Data set Musk RBF-SVM 0.0898 0.0920 -2.36 -6.72 22.91 13.67 Poly-SVM 0.1225 0.1128 7.92 -10.49 38.17 37.26 D-prod SVM 0.1501 0.1261 15.97 -2.41 34.56 29.38 V. DISCUSSION A. Bias--Variance characteristics of

Zhi-Hua Zhou and Min-Ling Zhang. Solving Multi-Instance Problems with Classifier Ensemble Based on Constructive Clustering. National Laboratory for Novel Software Technology.

c 2 are present in the bag. Therefore, it is obvious that Cce can be applied to generalized multi-instance problems without any modifica9 Table 2 The Musk data (72 molecules are shared in both data sets) Bags Instances per bag Data set Dim. Total Musk Non-musk Instances Min Max Ave. Musk1 166 92 47 45 476 2 40 5.17 Musk2 166 102 39 63 6,598 1 1,044 64.69 tion, which is a prominent advantage, while

Hendrik Blockeel and Luc De Raedt. Top-down Induction of Logical Decision Trees. Katholieke Universiteit Leuven Department of Computer Science.

Dougherty's [DKS95] work, and as such is capable of handling numerical data. For more details see [VLDDR96, BDR97b]. 5 Experimental Evaluation Experiments have been performed on several benchmark datasets: Mutagenesis [SMSK96], Musk [DLLP97, MM96], and Diterpenes [DSKH + 96]. For all the experiments, Tilde's default parameters were used; only the choice of the number of thresholds for discretization

Zhi-Hua Zhou and Hua Zhou. Multi-Instance Learning: A Survey. National Laboratory for Novel Software Technology.

and relational learning. 24 7 Discussion The most serious problem encumbering the advance of multi-instance learning is that there is only one popularly used real-world benchmark data, i.e. the Musk data sets. Although some application data have been used in some works, they can hardly act as benchmarks for some reasons. For example, the COREL image database has been used by Maron and Ratan [18], Yang

Return to Musk (Version 2) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML