Spambase

Donated on 6/30/1999

Classifying Email as Spam or Non-Spam

Dataset Characteristics

Multivariate

Subject Area

Computer Science

Associated Tasks

Classification

Feature Type

Integer, Real

# Instances

4601

# Features

Dataset Information

What do the instances in this dataset represent?

Emails

Additional Information

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... The classification task for this dataset is to determine whether a given email is spam or not. Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. For background on spam: Cranor, Lorrie F., LaMacchia, Brian A. Spam!, Communications of the ACM, 41(8):74-83, 1998. Typical performance is around ~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter. See also Hewlett-Packard Internal-only Technical Report. External version forthcoming.

Has Missing Values?

Yes

Variables Table

Variable Name	Role	Type	Missing Values
word_freq_make	Feature	Continuous	no
word_freq_address	Feature	Continuous	no
word_freq_all	Feature	Continuous	no
word_freq_3d	Feature	Continuous	no
word_freq_our	Feature	Continuous	no
word_freq_over	Feature	Continuous	no
word_freq_remove	Feature	Continuous	no
word_freq_internet	Feature	Continuous	no
word_freq_order	Feature	Continuous	no
word_freq_mail	Feature	Continuous	no

Rows per page

0 to 10 of 58

Additional Variable Information

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.