USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder Problem: Pat

Donated on 10/12/2013

Data used for USPTO Algorithm Competition. Contains drawing pages from US patents with manually labeled figure and part labels.

Dataset Characteristics

Domain-Theory

Subject Area

Other

Associated Tasks

Classification

Feature Type

Integer

# Instances

306

# Features

Dataset Information

Additional Information

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder Problem: Patent Labeling

Has Missing Values?

Variable Information

Dataset Information: -- This folder contains 4 groups of USPTO patent images including ground truth information. -- The 4 groups are 'train1', 'train2', 'test', 'evaluation'. -- 'train1', 'test', 'evaluation' contains data in the original 'USPTO Algorithm Challenge' for training, testing and final evaluation, respectively. -- 'train2' contains additional data which was used in the 'USPTO Algorithm Followup Challenge.' Notice that 'train2' includes some cover page images of patent document which is not included in other groups. -- In each group, there are two folders contain original images and corresponding ground truth informations. -- The original images are in 'jpeg' format. -- There are two types of ground truth: figure label ground truth and part label ground truth. -- The ground truth files are text files with '.ans' extension. -- The structure of the ground truth files are described as below: -- The first line is one number indicating how many instances exist in corresponding image -- The following lines are polygon coordinates and corresponding label contents, each line corresponds to a figure label or part label, in the form 'N x1 y1 x2 y2 â€¦ xN yN x1 y1 content'. -- In each of those lines, the first number N indicates how many polygon vertices are recorded in current instance. -- The following numbers are x, y coordinates of those vertices. -- The final word in each line is the content of figure label or part label. <Notice for figure labels, the word 'Figure', 'Fig' etc. are omitted> -- Each number or word is separated by a white space. -- For group 'train2', there are only part label ground truth available. -- We also release the source code of the top 5 winning solution. See additional archive file.

Dataset Files

File	Size
Data.zip	135.5 MB
SourceCode.zip	766.7 KB
README.txt	2.7 KB

Reviews

There are no reviews for this dataset yet.

Download (136.3 MB)

0 citations

1106 views

Creators

Christoph Riedl

Richard Zanibbi

Marti Hearst

Siyu Zhu

Michael Minetti

Jason Crusan

DOI

10.24432/C5BP5S

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.