Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder Problem: Pat Data Set
Download: Data Folder, Data Set Description

Abstract: Data used for USPTO Algorithm Competition. Contains drawing pages from US patents with manually labeled figure and part labels.

Data Set Characteristics:  

Domain-Theory

Number of Instances:

306

Area:

N/A

Attribute Characteristics:

Integer

Number of Attributes:

5

Date Donated

2013-10-13

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

36555


Source:

-- Creator: TopCoder, Inc.
-- Released under Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0.html


Data Set Information:

USPTO Algorithm Challenge, run by NASA-Harvard Tournament Lab and TopCoder
Problem: Patent Labeling


Attribute Information:

Dataset Information:
-- This folder contains 4 groups of USPTO patent images including ground truth information.
-- The 4 groups are 'train1', 'train2', 'test', 'evaluation'.
-- 'train1', 'test', 'evaluation' contains data in the original 'USPTO Algorithm Challenge' for training, testing and final evaluation, respectively.
-- 'train2' contains additional data which was used in the 'USPTO Algorithm Followup Challenge.'
Notice that 'train2' includes some cover page images of patent document which is not included in other groups.

-- In each group, there are two folders contain original images and corresponding ground truth informations.
-- The original images are in 'jpeg' format.
-- There are two types of ground truth: figure label ground truth and part label ground truth.
-- The ground truth files are text files with '.ans' extension.

-- The structure of the ground truth files are described as below:
-- The first line is one number indicating how many instances exist in corresponding image
-- The following lines are polygon coordinates and corresponding label contents, each line corresponds to a figure label or part label, in the form 'N x1 y1 x2 y2 … xN yN x1 y1 content'.
-- In each of those lines, the first number N indicates how many polygon vertices are recorded in current instance.
-- The following numbers are x, y coordinates of those vertices.
-- The final word in each line is the content of figure label or part label.
-- Each number or word is separated by a white space.
-- For group 'train2', there are only part label ground truth available.
-- We also release the source code of the top 5 winning solution. See additional archive file.


Relevant Papers:

Christoph Riedl, Richard Zanibbi, Marti A. Hearst, Siyu Zhu, Michael Minetti, Jason Crusan, Ivan Metelsky, and Karim R. Lakhani, 'Detecting Figures and Part Labels in Patents: A
Competition-Based Development of Image Processing Algorithms', working paper, [Web Link].



Citation Request:

Christoph Riedl, Richard Zanibbi, Marti A. Hearst, Siyu Zhu, Michael Minetti, Jason Crusan, Ivan Metelsky, and Karim R. Lakhani, 'Detecting Figures and Part Labels in Patents: A Competition-Based Development of Image Processing Algorithms,' International Journal on Document Analysis and Recognition, 1-18, DOI 10.1007/s10032-016-0260-8


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML