News Aggregator
Donated on 2/27/2016
References to news pages collected from an web aggregator in the period from 10-March-2014 to 10-August-2014. The resources are grouped into clusters that represent pages discussing the same story.
Dataset Characteristics
Multivariate
Subject Area
Other
Associated Tasks
Classification, Clustering
Feature Type
-
# Instances
422937
# Features
-
Dataset Information
Additional Information
News are grouped into clusters that represent pages discussing the same news story. The dataset includes also references to web pages that, at the access time, pointed (has a link to) one of the news page in the collection. 422937 news pages and divided up into: 152746 news of business category 108465 news of science and technology category 115920 news of business category 45615 news of health category 2076 clusters of similar news for entertainment category 1789 clusters of similar news for science and technology category 2019 clusters of similar news for business category 1347 clusters of similar news for health category References to web pages containing a link to one news included in the collection are also included. They are represented as pairs of urls corresponding to 2-page browsing sessions. The collection includes 15516 2-page browsing sessions covering 946 distinct clusters divided up into: 6091 2-page sessions for business category 9425 2-page sessions for entertainment category
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
no | |||||
no | |||||
no | |||||
no | |||||
no |
0 to 5 of 5
Additional Variable Information
FILENAME #1: newsCorpora.csv (102.297.000 bytes) DESCRIPTION: News pages FORMAT: ID TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP where: ID Numeric ID TITLE News title URL Url PUBLISHER Publisher name CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health) STORY Alphanumeric ID of the cluster that includes news about the same story HOSTNAME Url hostname TIMESTAMP Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970 FILENAME #2: 2pageSessions.csv (3.049.986 bytes) DESCRIPTION: 2-page sessions FORMAT: STORY HOSTNAME CATEGORY URL where: STORY Alphanumeric ID of the cluster that includes news about the same story HOSTNAME Url hostname CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health) URL Two space-delimited urls representing a browsing session
Dataset Files
File | Size |
---|---|
newsCorpora.csv | 97.6 MB |
2pageSessions.csv | 2.9 MB |
readme.txt | 2.5 KB |
__MACOSX/._readme.txt | 510 Bytes |
__MACOSX/._2pageSessions.csv | 239 Bytes |
0 to 5 of 6
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset news_aggregator = fetch_ucirepo(id=359) # data (as pandas dataframes) X = news_aggregator.data.features y = news_aggregator.data.targets # metadata print(news_aggregator.metadata) # variable information print(news_aggregator.variables)
Gasparetti, F. (2017). News Aggregator [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5F61C.
Creators
Fabio Gasparetti
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.