Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

YouTube Spam Collection Data Set
Download: Data Folder, Data Set Description

Abstract: It is a public set of comments collected for spam research. It has five datasets composed by 1,956 real messages extracted from five videos that were among the 10 most viewed on the collection period.

Data Set Characteristics:  

Text

Number of Instances:

1956

Area:

Computer

Attribute Characteristics:

N/A

Number of Attributes:

5

Date Donated

2017-03-26

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

24704


Source:

This corpus has been collected using the YouTube Data API v3.


Data Set Information:

The table below lists the datasets, the YouTube video ID, the amount of samples in each class and the total number of samples per dataset.

Dataset --- YouTube ID -- # Spam - # Ham - Total
Psy ------- 9bZkp7q19f0 --- 175 --- 175 --- 350
KatyPerry - CevxZvSJLk8 --- 175 --- 175 --- 350
LMFAO ----- KQ6zr6kCPj8 --- 236 --- 202 --- 438
Eminem ---- uelHwf8o7_U --- 245 --- 203 --- 448
Shakira --- pRpeEdMmmQ0 --- 174 --- 196 --- 370

Note: the chronological order of the comments were kept.


Attribute Information:

The collection is composed by one CSV file per dataset, where each line has the following attributes:

COMMENT_ID,AUTHOR,DATE,CONTENT,TAG

We offer one example bellow:

z12oglnpoq3gjh4om04cfdlbgp2uepyytpw0k,Francisco Nora,2013-11-28T19:52:35,please like :D [Web Link],1


Relevant Papers:

Alberto, T.C., Lochter J.V., Almeida, T.A. TubeSpam: Comment Spam Filtering on YouTube. Proceedings of the 14th IEEE International Conference on Machine Learning and Applications (ICMLA'15), 1-6, Miami, FL, USA, December, 2015.

T.A. ALMEIDA, T.P. SILVA, I. SANTOS and J.M. GOMEZ HIDALGO. Text Normalization and Semantic Indexing to Enhance Instant Messaging and SMS Spam Filtering. Knowledge-Based Systems, Elsevier, 108(2016), 25-32, 2016.



Citation Request:

We would appreciate:

1. If you find this collection useful, make a reference to the paper below and the web page: [Web Link].
2. Send us a message either to talmeida < AT > ufscar.br or tuliocasagrande < AT > acm.org in case you make use of the corpus.


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML