Dataset for OCR Post-Correction

Abstract:

In this paper, we propose a novel approach to post-correct the OCR output via utilizing duplication in digital corpora. In particular, our approach takes advantage of repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. An attention-based Sequence-to-Sequence model is applied for single-input correction, and strategies of multi-input attention combination are explored to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word errors rates in half, and can rival supervised methods with multi-input decoding.

Contact:

Rui Dong: dongrui@ccs.neu.edu

David A. Smith: dasmith@ccs.neu.edu

Data Download:

Richmond Daily Dispatch

Text Creation Partnership

Data Description:

The Richmond (Virginia) Daily Dispatch (RDD) dataset includes 1384 issues from 1860-1865 and the Text Creation Partnership (TCP) dataset includes 934 books from 1500-1800. Both of these manually transcribed collections, which were produced independently from the current authors, are in the public domain and in English. All the files are stored in json format with each line as a dictionary for an OCR'd line, where the key "id" storing the detail information and the key "lines" storing a list of the witnesses of an OCR'd line.

Data Processing:

Here is the script for processing the data after extracting the downloaded data. The output are the following files:
1. pair.x: each line corresponds to an OCR'd text line

2. pair.y: each line corresponds to the witnesses of one line in file "pair.x" splitted by tab('\t')

3. pair.z: each line corresponds the the manual transcription of one line in file "pair.x" splitted by tab('\t')

4. pair.x.info: each line corresponds to the information of the same line in "pair.x". It contains the following information splitted by '\t':

(group no., line no., file_id, begin index in file, end index in file, number of witnesses, number of manual transcriptions)

5. pair.y.info: each line corresponds to the information of the same line in "pair.y". It contains the following information splitted by '\t':

(line no., file_id, begin index in file). If "line no." = 100 for the 10th line in "pair.y.info", it means that the 10th line of "pair.y" contains the witnesses of the 101th line of file "pair.x".

6. pair.z.info: each line corresponds to the information of the same line in "pair.z". It contains the following information splitted by '\t':

(line no., file_id, begin index in file). If "line no." = 100 for the 10th line in "pair.z.info", it means that the 10th line of "pair.z" contains the witnesses of the 101th line of file "pair.x".

Generate Training Data:

Here is the script for generating training, development and test data. The output are the following files:
1. train.x.txt

2. train.y.txt

3. dev.x.txt

4. dev.y.txt

5. test.x.txt

6. test.y.txt

Multi-Input Attention Code:

Here is the code for the multi-input attention method implemented via tensorflow 1.9.0.

Multi-Input Attention for Unsupervised OCR Correction