Things are not always what they seem...

September 19, 2015 / P-j Translations Services Co.Ltd

Using OCR (OCR standing for "Optimizated Character Recognition") for translation processes is a bit like trying to make people realize that no plane allegedly crashed on the Twin Towers / Shanksville / nor the Pentagon (!!¡¡11) some 14 years ago as seen on your beloved teLIEvision. You're never quite sure how long it will take till things start to get a little "nasty". " I can't see what you're trying to say, WHAT'S THE POINT ??? ",eventually and endemically leading to " WHAT THE HECK IS WRONG WITH YOU, MISTER OCR ??? ". Indeed, chances that you keep a decent track of some sort of Optimal Cartoonesque Reality are actually quite slim, to say the least.

Wait, we ALL thought that "post-modern artsy mosaics" were solely part of the adult movie world...  :( — Wait, we ALL thought that "post-modern artsy mosaics" were solely part of the adult movie world... :(

When one of our "local" client first introduced us to the effective use of OCR for translation purposes a few years ago, it was relatively rare in the world of commercial translation and too often incompetently performed, but most professional translation agencies had been using that technology in one way or another for a decade already. But in many cases, "standard" procedures for optical character recognition are simply not well suited for our purposes, so here we go, we just had a lot to learn.

Well, the basic idea is to avoid using OCR for anything OTHER THAN undertaking some word counting estimation or the like. Unfortunately few do so efficiently or even usefully. Depending on the source, estimating text counts for word quotation may be very accurate or only a rough count (if there are serious contrast problems that can't be compensated, for example). Most professional translators do this not only with PDFs and bitmap files such as JPEG or TIFF, but also with large, complex documents in other formats.

One cannot always rely on the counts from Microsoft Word itself or various translation tools for text counting. Embedded objects, even editable ones, are generally not included in the counts, we ALL know this but time to time fail to admit it (!!).

Using OCR to prepare translations is often straightforward, but there are a number of traps that people commonly fall into. Do not, under any circumstances, be seduced by the automatic conversion settings of any commercial OCR program nor by options to save with the "original formatting". This is nearly always a disaster when working with translation tools. Problems may include bizarre text changes, disappearing chunks of text due to text box sizing problems, a plague of tags and more.

OCR強迫観念、あ・ぶ・な・い！！！ Be safe !

2 Likes

categories / Blog
tags / Recognition character, Adobe Acrobat Pro, OCR function, Ricks