Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

URL Reputation Data Set
Download: Data Folder, Data Set Description

Abstract: Anonymized 120-day subset of the ICML-09 URL data containing 2.4 million examples and 3.2 million features.

Data Set Characteristics:  

Multivariate, Time-Series

Number of Instances:




Attribute Characteristics:

Integer, Real

Number of Attributes:


Date Donated


Associated Tasks:


Missing Values?


Number of Web Hits:



'Identifying Malicious URLs: An Application of Large-Scale Online Learning' (ICML-09)
Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker

Please visit [] for more information.

Data Set Information:

Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:
* FeatureTypes --- A text file list of feature indices that correspond to real-valued features.
* DayX.svm (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.

Attribute Information:

Attributes are anonymized, but correspond to lexical and host-based features gathered for each URL.

Relevant Papers:


Citation Request:

If you use this data set in published work, please cite the ICML-09 paper in which it was first introduced and described:

Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker,
Identifying Suspicious URLs: An Application of Large-Scale Online Learning
Proceedings of the International Conference on Machine Learning (ICML), pages 681-688, Montreal, Quebec, June 2009.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML