Page Blocks Classification
Donated on 6/30/1995
The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.
Dataset Characteristics
Multivariate
Subject Area
Computer Science
Associated Tasks
Classification
Feature Type
Integer, Real
# Instances
5473
# Features
10
Dataset Information
Additional Information
The 5473 examples comes from 54 distinct documents. Each observation concerns one block. All attributes are numeric. Data are in a format readable by C4.5.
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
height | Feature | Integer | Height of the block | no | |
length | Feature | Integer | Length of the block | no | |
area | Feature | Integer | Area of the block (height * length) | no | |
eccen | Feature | Continuous | Eccentricity of the block (length / height) | no | |
p_black | Feature | Continuous | Percentage of black pixels within the block (blackpix / area) | no | |
p_and | Feature | Continuous | Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area) | no | |
mean_tr | Feature | Continuous | Mean number of white-black transitions (blackpix / wb_trans) | no | |
blackpix | Feature | Integer | Total number of black pixels in the original bitmap of the block | no | |
blackand | Feature | Integer | Total number of black pixels in the bitmap of the block after the RLSA | no | |
wb_trans | Feature | Integer | Number of white-black transitions in the original bitmap of the block | no |
0 to 10 of 11
Additional Variable Information
height: integer. | Height of the block. lenght: integer. | Length of the block. area: integer. | Area of the block (height * lenght); eccen: continuous. | Eccentricity of the block (lenght / height); p_black: continuous. | Percentage of black pixels within the block (blackpix / area); p_and: continuous. | Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area); mean_tr: continuous. | Mean number of white-black transitions (blackpix / wb_trans); blackpix: integer. | Total number of black pixels in the original bitmap of the block. blackand: integer. | Total number of black pixels in the bitmap of the block after the RLSA. wb_trans: integer. | Number of white-black transitions in the original bitmap of the block.
Class Labels
text, horiz. line, graphic, vert. line, picture
Dataset Files
File | Size |
---|---|
page-blocks.data.Z | 102.1 KB |
page-blocks.names | 3.8 KB |
Index | 128 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset page_blocks_classification = fetch_ucirepo(id=78) # data (as pandas dataframes) X = page_blocks_classification.data.features y = page_blocks_classification.data.targets # metadata print(page_blocks_classification.metadata) # variable information print(page_blocks_classification.variables)
Malerba, D. (1994). Page Blocks Classification [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5J590.
Creators
Donato Malerba
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.