Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

Detect Malacious Executable(AntiVirus) Data Set
Download: Data Folder, Data Set Description

Abstract: I extract features from malacious and non-malacious and create and training dataset to teach svm classifier.Dataset made of unknown executable to detect if it is virus or normal safe executable.

Data Set Characteristics:  


Number of Instances:




Attribute Characteristics:


Number of Attributes:


Date Donated


Associated Tasks:


Missing Values?


Number of Web Hits:



Piyush Anasta Rumao,
UnderGrad Computer Engineer from Fr.Conceicao Rodrigues College of Enginerring,Bandra,Mumbai,
University of Mumbai,
email:- piyushrumao '@'
phone- +91-7387712196 / +91-8554041806
you can call / mail me anytime regarding any query or to know more about working model of antivrius that I can created.

Data Set Information:

TRAINING File : I have created training file with 100+ non malacious examples and 250+ malacious samples. NON-MALACIOUS dataset is represented by +1 while MALACIOUS datset is represented by -1 as label. Based on comparison and analysis I have selected 500 most commonly occuring features in MALACIOUS and NON-MALACIOUS file and compared extracted features of each file with this best features. The file is saved with .train extension.

TESTING file: We select a unknown malacious executable and carry out same procedure on it ( however we can put it in any class +1/ -1) cuz svmpredict will any way corretly find it for us. We save this testing file with .test extension.

Attribute Information:

For best results I have used Hybrid Features ( hexdump and DLL) from an executable. After extracting this features I find out the top 500 hex features and top 13 DLL features which are most commonly occuring and prepare file with best features.Now feature amoung this which are found in individual file is been stated in dataset along with 1 while rest are ignored and feature set ends with -1 ie say ( +1 2:1 5:1 45:1 .............. -1)
so here +1 states a NON-malacious file while 2:1 states 2nd feature exists similarly for 5,45 while features which do not occur are simply ignored.
For MALACIOUS executable we write it as ( -1 6:1 56:1 ............ -1)
so Attribute which exists is given a colon 1 ahead of it (:1)

Relevant Papers:

my project is been done based on paper published by
A hybrid Model to detect malacious executable( using data mining and machine learning concept) by -- MM Masud , Latifur Khan, Bhavani Thuraisingham

Citation Request:

I found no dataSet on AntiVirus techniques which is need of hour.
So I hope you encourage this work.
Looking forward for positive response.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML