Page Blocks Classification
Donated on 6/30/1995
The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.
Dataset Characteristics
Multivariate
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
Integer, Real
# Instances
5473
# Features
10
Dataset Information
Additional Information
The 5473 examples comes from 54 distinct documents. Each observation concerns one block. All attributes are numeric. Data are in a format readable by C4.5.
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Demographic | Description | Units | Missing Values |
---|---|---|---|---|---|---|
height | Feature | Integer | Height of the block | no | ||
length | Feature | Integer | Length of the block | no | ||
area | Feature | Integer | Area of the block (height * length) | no | ||
eccen | Feature | Continuous | Eccentricity of the block (length / height) | no | ||
p_black | Feature | Continuous | Percentage of black pixels within the block (blackpix / area) | no | ||
p_and | Feature | Continuous | Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area) | no | ||
mean_tr | Feature | Continuous | Mean number of white-black transitions (blackpix / wb_trans) | no | ||
blackpix | Feature | Integer | Total number of black pixels in the original bitmap of the block | no | ||
blackand | Feature | Integer | Total number of black pixels in the bitmap of the block after the RLSA | no | ||
wb_trans | Feature | Integer | Number of white-black transitions in the original bitmap of the block | no |
0 to 10 of 11
Additional Variable Information
height: integer. | Height of the block. lenght: integer. | Length of the block. area: integer. | Area of the block (height * lenght); eccen: continuous. | Eccentricity of the block (lenght / height); p_black: continuous. | Percentage of black pixels within the block (blackpix / area); p_and: continuous. | Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area); mean_tr: continuous. | Mean number of white-black transitions (blackpix / wb_trans); blackpix: integer. | Total number of black pixels in the original bitmap of the block. blackand: integer. | Total number of black pixels in the bitmap of the block after the RLSA. wb_trans: integer. | Number of white-black transitions in the original bitmap of the block.
Class Labels
text, horiz. line, graphic, vert. line, picture
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset page_blocks_classification = fetch_ucirepo(id=78) # data (as pandas dataframes) X = page_blocks_classification.data.features y = page_blocks_classification.data.targets # metadata print(page_blocks_classification.metadata) # variable information print(page_blocks_classification.variables)
Malerba,Donato. (1995). Page Blocks Classification. UCI Machine Learning Repository. https://doi.org/10.24432/C5J590.
@misc{misc_page_blocks_classification_78, author = {Malerba,Donato}, title = {{Page Blocks Classification}}, year = {1995}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5J590} }
Creators
Donato Malerba
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.