Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Detect Malacious Executable(AntiVirus) Data Set
Download: Data Folder, Data Set Description

Abstract: I extract features from malacious and non-malacious and create and training dataset to teach svm classifier.Dataset made of unknown executable to detect if it is virus or normal safe executable.

Data Set Characteristics:  

Multivariate

Number of Instances:

373

Area:

Computer

Attribute Characteristics:

Real

Number of Attributes:

513

Date Donated

2016-03-03

Associated Tasks:

Classification

Missing Values?

Yes

Number of Web Hits:

207290


Source:

Piyush Anasta Rumao,
UnderGrad Computer Engineer from Fr.Conceicao Rodrigues College of Enginerring,Bandra,Mumbai,
University of Mumbai,
email:- piyushrumao '@' gmail.com
phone- +91-7387712196 / +91-8554041806
you can call / mail me anytime regarding any query or to know more about working model of antivrius that I can created.


Data Set Information:

TRAINING File : I have created training file with 100+ non malacious examples and 250+ malacious samples. NON-MALACIOUS dataset is represented by +1 while MALACIOUS datset is represented by -1 as label. Based on comparison and analysis I have selected 500 most commonly occuring features in MALACIOUS and NON-MALACIOUS file and compared extracted features of each file with this best features. The file is saved with .train extension.

TESTING file: We select a unknown malacious executable and carry out same procedure on it ( however we can put it in any class +1/ -1) cuz svmpredict will any way corretly find it for us. We save this testing file with .test extension.


Attribute Information:

For best results I have used Hybrid Features ( hexdump and DLL) from an executable. After extracting this features I find out the top 500 hex features and top 13 DLL features which are most commonly occuring and prepare file with best features.Now feature amoung this which are found in individual file is been stated in dataset along with 1 while rest are ignored and feature set ends with -1 ie say ( +1 2:1 5:1 45:1 .............. -1)
so here +1 states a NON-malacious file while 2:1 states 2nd feature exists similarly for 5,45 while features which do not occur are simply ignored.
For MALACIOUS executable we write it as ( -1 6:1 56:1 ............ -1)
so Attribute which exists is given a colon 1 ahead of it (:1)


Relevant Papers:

my project is been done based on paper published by
A hybrid Model to detect malacious executable( using data mining and machine learning concept) by -- MM Masud , Latifur Khan, Bhavani Thuraisingham



Citation Request:

I found no dataSet on AntiVirus techniques which is need of hour.
So I hope you encourage this work.
Looking forward for positive response.


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML