Spambase

Donated on 6/30/1999

Classifying Email as Spam or Non-Spam

Dataset Characteristics

Multivariate

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

Integer, Real

# Instances

4601

# Features

Dataset Information

What do the instances in this dataset represent?

Emails

Additional Information

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... The classification task for this dataset is to determine whether a given email is spam or not. Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam!, Communications of the ACM, 41(8):74-83, 1998. Typical performance is around ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. See also Hewlett-Packard Internal-only Technical Report. External version forthcoming.

Has Missing Values?

Yes

Variables Table

Variable Name	Role	Type	Missing Values
word_freq_make	Feature	Continuous	no
word_freq_address	Feature	Continuous	no
word_freq_all	Feature	Continuous	no
word_freq_3d	Feature	Continuous	no
word_freq_our	Feature	Continuous	no
word_freq_over	Feature	Continuous	no
word_freq_remove	Feature	Continuous	no
word_freq_internet	Feature	Continuous	no
word_freq_order	Feature	Continuous	no
word_freq_mail	Feature	Continuous	no

Rows per page

0 to 10 of 58

Additional Variable Information

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Baseline Model Performance

Dataset Files

File	Size
spambase.data	686.5 KB
spambase.DOCUMENTATION	6.3 KB
spambase.names	3.5 KB

Papers Citing this Dataset

Setting decision thresholds when operating conditions are uncertain

By Cèsar Ferri, José Hernández-Orallo, Peter Flach. 2019

Published in Data Mining and Knowledge Discovery.

Communication-Efficient Accurate Statistical Estimation

By Jianqing Fan, Yongyi Guo, Kaizheng Wang. 2019

Published in ArXiv.

Explaining Vulnerabilities to Adversarial Machine Learning through Visual Analytics

By Yuxin Ma, Tiankai Xie, Jundong Li, Ross Maciejewski. 2019

Published in ArXiv.

Minimax Optimal Online Stochastic Learning for Sequences of Convex Functions under Sub-Gradient Observation Failures

By Hakan Gokcesu, Suleyman Kozat. 2019

Published in ArXiv.

Recombinator-k-means: Enhancing k-means++ by seeding from pools of previous runs

By Carlo Baldassi. 2019

Published in ArXiv.

Rows per page

0 to 5 of 44

Reviews

There are no reviews for this dataset yet.

Download (122.6 KB)

44 citations

131377 views

Creators

Mark Hopkins

Erik Reeber

George Forman

Jaap Suermondt

DOI

10.24432/C53G6X

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

Spambase

Donated on 6/30/1999

Dataset Characteristics

Subject Area

Associated Tasks

Feature Type

# Instances

# Features

Dataset Information

Variables Table

Additional Variable Information

Baseline Model Performance

Dataset Files

Papers Citing this Dataset

Reviews

Write a Review

Creators

DOI

License