Farm Ads Data Set
Download: Data Folder, Data Set Description
Abstract: This data was collected from text ads found on twelve websites that deal with various farm animal related topics. The binary labels are based on whether or not the content owner approves of the ad.
|
|
Data Set Characteristics: |
Text |
Number of Instances: |
4143 |
Area: |
Business |
Attribute Characteristics: |
N/A |
Number of Attributes: |
54877 |
Date Donated |
2011-10-18 |
Associated Tasks: |
Classification |
Missing Values? |
N/A |
Number of Web Hits: |
94800 |
Source:
Chris Mesterharm and Michael J. Pazzani
Rutgers, The State University of New Jersey
mesterha '@' cs.rutgers.edu
Data Set Information:
This data was collected from text ads found on twelve websites that deal with various farm animal related topics. Information from the ad creative and the ad landing page is included. The binary labels are based on whether or not the content owner approves of the ad.
For each ad, we include the words on the ad creative and the words from the landing page. Each word from the creative is given a prefix
of 'ad-'. Title and header HTML markups are noted in a similar way in the text of the landing page. We have already performed stemming and
stop word removal. Each ad is on a single line. The first word in the line is the label of the instance. It is 1 for accepted ads and -1 for rejected ads.
We have also included a straightforward bag-of-words representation of our data. We use the SVMlight sparse vector format. The first value
is the label followed by every nonzero attribute. Each of these attributes is encoded as index:value. This is the representation used for the relevant paper cited below.
Attribute Information:
Text words in file farm-ads. SVMlight format sparse vectors in file farm-ads-vect.
Relevant Papers:
Active Learning using On-line Algorithms. Chris Mesterharm, Michael J. Pazzani. In KDD 2011.
Citation Request:
Please refer to the Machine Learning
Repository's citation policy
|