BlogFeedback
Donated on 5/28/2014
Instances in this dataset contain features extracted from blog posts. The task associated with the data is to predict how many comments the post will receive.
Dataset Characteristics
Multivariate
Subject Area
Social Science
Associated Tasks
Regression
Feature Type
Integer, Real
# Instances
60021
# Features
-
Dataset Information
Additional Information
This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime. In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situtation in which training data from the past is available to predict events in the future. The train data was generated from different basetimes that may temporally overlap. Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap. Therefore, the you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no |
0 to 10 of 281
Additional Variable Information
1...50: Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10 51: Total number of comments before basetime 52: Number of comments in the last 24 hours before the basetime 53: Let T1 denote the datetime 48 hours before basetime, Let T2 denote the datetime 24 hours before basetime. This attribute is the number of comments in the time period between T1 and T2 54: Number of comments in the first 24 hours after the publication of the blog post, but before basetime 55: The difference of Attribute 52 and Attribute 53 56...60: The same features as the attributes 51...55, but features 56...60 refer to the number of links (trackbacks), while features 51...55 refer to the number of comments. 61: The length of time between the publication of the blog post and basetime 62: The length of the blog post 63...262: The 200 bag of words features for 200 frequent words of the text of the blog post 263...269: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the basetime 270...276: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the date of publication of the blog post 277: Number of parent pages: we consider a blog post P as a parent of blog post B, if B is a reply (trackback) to blog post P. 278...280: Minimum, maximum, average number of comments that the parents received 281: The target: the number of comments in the next 24 hours (relative to basetime)
Dataset Files
File | Size |
---|---|
blogData_train.csv | 62.1 MB |
blogData_test-2012.03.31.01_00.csv | 255.9 KB |
blogData_test-2012.03.22.00_00.csv | 249.7 KB |
blogData_test-2012.03.21.00_00.csv | 244.4 KB |
blogData_test-2012.03.30.01_00.csv | 229.5 KB |
0 to 5 of 61
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset blogfeedback = fetch_ucirepo(id=304) # data (as pandas dataframes) X = blogfeedback.data.features y = blogfeedback.data.targets # metadata print(blogfeedback.metadata) # variable information print(blogfeedback.variables)
Buza, K. (2014). BlogFeedback [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C58S3F.
Creators
Krisztian Buza
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.