distributed proofreading
after a brief wikipedia ramble i found myself at the distributed proofreaders europe site, almost as if wikipedia was expelling me to go do something more constructive instead.
DP is a feeder project for the free ‘etexts’ at project gutenberg; organisers scan out-of-copyright texts, and submit the batch of pages with best-effort OCR scans, which are farmed out via the web to be proofed and corrected by volunteers, then reassembled into gutenberg-compatible texts.
I was gripped, noticing a scanned reprint of an Elizabethan melodrama; those seminars on 16th-century handwriting and typography i was obliged to sit through ten years ago, finally yield a practical use! I was impressed by the quality of OCR and its post-processing; despite a lot of common errors introduced by the old typography - ‘f’ for long ’s’, often joined to other characters by a ligature - generally the type recognition is very good. There are a lot of cyrillic texts on the european Distributed Proofreading site, and i wonder how the state of the art compares to other character sets.
As a site, the DP ‘community’ has a lot going for it; a personal
pages-proofed counter which shows your position in a ‘ranking’ is particularly
sticky; profiles, and ‘teams’, all with lots of activity and volume
statistics; the “item-collection” aspect of the friendster clones, without the
social sickliness.
Such a project is massively-conducive to partwork micro-remuneration; the
retired or semiretired, home carers, students, could ‘work’ as much as they
liked, with no personal time-constraints; on such a project which needs human
correction and circumspection on a task that a machine can only take so far.
Distributed call centres are the obvious parallel application; about 450,000 people work in call centres in the UK, apparently.
Will a structure like the DP become part of google’s much publicised library initiative to scan, among others, a million public domain books from the Bodleian’s pre-1920 catalogue? The scanning operation will lead to the creation of two digital copies of each book: one for Google, and one for Oxford. Will each copy have a separate digital watermark?
A good-enough text search, returning scanned pages (perhaps with a
highlighted mask) seems plausible given a reasonably accurate OCR, with likely matches for obvious non-words included in the search results. No good for text-to-speech, and perhaps toxic to funky semantic analysis techniques. I wonder whether the library initiative plans include making the extracted text available at all, for use in neat online corcordance projects like pepysdiary, or through channels like DP into the bona-fide public domain…
Post a Comment
You must be logged in to post a comment.