Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder Problem: Pat Data Set
Download: Data Folder, Data Set Description

Abstract: Data used for USPTO Algorithm Competition. Contains drawing pages from US patents with manually labeled figure and part labels.

Data Set Characteristics:  

Domain-Theory

Number of Instances:

306

Area:

N/A

Attribute Characteristics:

Integer

Number of Attributes:

5

Date Donated

2013-10-13

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

19597


Source:

-- Creator: TopCoder, Inc.
-- Released under Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0.html


Data Set Information:

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder
Problem: Patent Labeling


Attribute Information:

Dataset Information:
-- This folder contains 4 groups of USPTO patent images including ground truth information.
-- The 4 groups are 'train1', 'train2', 'test', 'evaluation'.
-- 'train1', 'test', 'evaluation' contains data in the original 'USPTO Algorithm Challenge' for training, testing and final evaluation, respectively.
-- 'train2' contains additional data which was used in the 'USPTO Algorithm Followup Challenge.'
Notice that 'train2' includes some cover page images of patent document which is not included in other groups.

-- In each group, there are two folders contain original images and corresponding ground truth informations.
-- The original images are in 'jpeg' format.
-- There are two types of ground truth: figure label ground truth and part label ground truth.
-- The ground truth files are text files with '.ans' extension.

-- The structure of the ground truth files are described as below:
-- The first line is one number indicating how many instances exist in corresponding image
-- The following lines are polygon coordinates and corresponding label contents, each line corresponds to a figure label or part label, in the form 'N x1 y1 x2 y2 … xN yN x1 y1 content'.
-- In each of those lines, the first number N indicates how many polygon vertices are recorded in current instance.
-- The following numbers are x, y coordinates of those vertices.
-- The final word in each line is the content of figure label or part label.
-- Each number or word is separated by a white space.
-- For group 'train2', there are only part label ground truth available.
-- We also release the source code of the top 5 winning solution. See additional archive file.


Relevant Papers:

Christoph Riedl, Richard Zanibbi, Marti A. Hearst, Siyu Zhu, Michael Minetti, Jason Crusan, Ivan Metelsky, and Karim R. Lakhani, 'Detecting Figures and Part Labels in Patents: A
Competition-Based Development of Image Processing Algorithms', working paper, [Web Link].



Citation Request:

Christoph Riedl, Richard Zanibbi, Marti A. Hearst, Siyu Zhu, Michael Minetti, Jason Crusan, Ivan Metelsky, and Karim R. Lakhani, 'Detecting Figures and Part Labels in Patents: A Competition-Based Development of Image Processing Algorithms,' International Journal on Document Analysis and Recognition, 1-18, DOI 10.1007/s10032-016-0260-8


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML