Imagine that you have printed documents, a book, or maybe a bunch of scanned documents that you would like to translate. How should you proceed? You will probably start by scanning the printed document. But the images and text will, of course, not be editable which makes them an impractical source for translations. This is where Optical Character Recognition (OCR) comes in.
OCR takes your non-editable documents (also called outlined documents) such as scanned document, PDF, or image, and separates it from the text you’d like translated. Of course, there are some crucial steps to consider when putting a workflow involving OCR into action and receive the best possible output as well as a fresh product localized for your target locale. In this blog, we will outline the OCR best practices. how they fit in an optimized workflow designed to save time, and the linguistic as well as cosmetic steps that should be included in the process.
The OCR Process Deconstructed
So, let’s look at a PDF eBook as an example. First things first, there’s something we need to keep in mind! Because there is a difference between a scanned PDF file, and a PDF obtained as an output of an application/protected PDF file. In the case of the latter, the PDF might be protected, usually for reasons related to IP. This means that before the Translation process can start, the client may need to obtain the necessary permissions for the content to be translated and published, depending on the goals and objectives of the translation. In fact, both file types will need to adhere to a different procedure even during the translation and Desktop Publishing stages. But more on that later!
So, back to the eBook! Even though the text itself is locked and can’t be edited, the source application could still be known. If you go to File > Properties, you will see the Application name mentioned in the data. This application name refers to the source content the PDF is built on. One of the roadblocks we often encounter with our clients is that they work with third-party experts to create their eBooks and the source file (if shared at all) is often available through a temporary link they have since lost access to. So, this is why OCR services are often requested.
OCR simplifies the process by running the PDF through a tool that extracts the content and exports it to an editable format. But it is not as simple as it sounds. For one, OCR is still far from perfect, and textual as well as cosmetic errors are bound to slip in.
Here is what this looks like step by step!
Pre-Processing: Optimizing The Image For OCR Application
The preprocessing is a very essential step when it comes to scanned documents. After the image has been imported, it may be subject to a Pre-Processing stage where the recognition accuracy can be increased. Since resolution, font type/size, scan quality, and other factors can greatly impact the output quality, applying this stage effectively can save time significantly.
More specifically, this refers to image optimization involving:
- De-skewing: the image may need to be tilted one way or the other to ensure correct alignment
- Binarization: if there is any color, the image or document will be converted to black and white to prepare the file for an optimal reading
- Despeckling: spots will be removed, and edges smoothed out
- Removal of non-glyph lines and the detection of important lines and shapes
- Script recognition to ensure optimal OCR function on the word level
- Character segmentation where characters linked to a specific image will be divided
- Normalization of the aspect scale and ratio
Processing The OCR Method
When the image/scanned page has been optimized, OCR can be applied, and the text can be extracted and exported.
Firstly, Document Analyses will be performed where the image or document will be scanned to identify the page structure and its paragraphs, lines, images and, if applicable, barcodes. This step is of particular importance if the same layout will need to be preserved in the output. If the layout will need to be altered according to a new strategy or medium, or even if the analysis wasn’t 100 % successful, no worries! We will expand on the possibilities here under the Desktop publishing (DTP) section.
After the analysis has isolated the areas, the actual recognition will be performed and the words and characters will be “recognized”. The parameters that play an essential role here are fonts, print types, and languages.
When it comes to the language parameters, it is worth remembering that while most of the quality tools out there have language settings included and you only need to select your preferred one, specific OCR Tools will benefit certain languages more than others. For example, while ABBYY FineReader is an all-around powerful tool, Right to Left (RTL) languages such as Arabic, could benefit more from the IRIS ReadIRIS Pro series.
Adding The Human Touch: Linguist Assessment Of OCR And Translation
Once the document has been analyzed and the textual areas defined, the OCR-Tool will still give you the option to select some predefined setting to determine the outcome of your document. During this Text Verification process, errors such as corrupted characters, or missing words that couldn’t be extracted in the OCR outcome, will be weeded out by a native linguists/proofreaders who will compare between the original text against the OCR outcome to make sure it is free of any issues and it is perfectly identical to the source text.
Because despite having received expert attention for decades, OCR is still not perfect. As a result, the extracted content can contain errors that need to pass the test of the human eye. While these errors typically relate to character exchange, hyphenated terms, or indeed words separated by a hyphen to mark the alignment of the page can be read as separate tokens.
The linguist’s expertise combined with the OCR Tool’s features to be leveraged will allow for:
- Replace unrecognized characters with the correct variant.
- Edit the text as needed for any errors that sneaked in.
- Specifying the extent to which the layout is to be reconstructed and the text formatting maintained.
- Export the files according to the desired format.
Speaking of formats, with most OCR tools the content can be exported in a particular format applied to the content or indeed translation strategy. And one of these formats is XML. XML is the most frequently used format in the translation & localization industry due to its compatibility with Translation (CAT) Tools. They simplify and facilitate the translation process from a technical perspective and, in doing so, boost efficiency.
When the text has been edited to satisfaction, the TEP (Translation, Editing, Proofreading) Process can be implemented followed by human QA. CAT Tools are ideally enriched with a Glossary detailing technical and product-related terminology, a Style Guide with any Brand-Specific Instructions, and a Translation Memory (TM), so any inconsistencies can be weeded out and the process can be automated in part.
Unlocking OCR Potential: Are We On Our Way To A New And Improved OCR?
We’ve said it before, OCR is still far from perfect. But its exceeding potential and usefulness in the translation industry, has sparked interest in developing OCR into a more efficient tool.
A September 2020 paper1 launched by two Argentinian researchers, details the implementation of a uniquely designed dataset of annotated images taken from the popular Japanese comic, Manga. Methods and tools for more sophisticated text models, so the duo says, hasn’t reached the potential to effectively binarize content in text balloons. So, they have decided to develop their own and make OCR a viable tool to use for manga that goes as deep as the pixel level.
You may think, why Manga? The unusual, vibrant and dynamic style of comic books, makes text detection all the more challenging. If OCR can prove successful here, you can see how it can certainly add value to the types of documents generally received from clients, which tend to be more simplistic.
What’s more, creating a set of data specifically to analyze and fetch text in Manga, would involve a great step towards the commercialization of OCR, which is, according to a recent review2, exactly what we need to develop more cost-effective and qualitative OCR tools.
Desktop Publishing (DTP): Get That Design Just Right
An attractive design is more than just one that looks nice. It should support both the content and the brand identity. Even more so, perhaps, since a poorly designed image, document or eBook will not motivate your public to pay attention to the content. And if it is not loyal to your brand identity, it won’t help you promote brand awareness.
This is why next to linguists, DTP-Experts should also be included in the reviewing and if needed, formatting process. While highly trained in respecting the detail and visual potency of a product, DTP-Professionals also know how to address cultural variances that will uphold the visual integrity of your materials in the global market. Because, of course, different cultures will have developed a different eye in what is appealing to them.
The DTP-Expert’s job includes:
- Selecting the right font for the right language. There is no universal font that supports all languages. It is important to consider diacritical marks, style, height, but also local preferences when determining which font type would suit which language.
- Offering recommendations for best practices on the impact of your graphical layout in the context of country-specific standards. While the source design can certainly be maintained, it might come recommended to include a new design depending on your target locale. And this is easy enough to achieve for the DTP-Team since the OCR conversion has provided an editable format.
- Being particularly helpful when it comes to formatting complex languages and characters.
- Fixing any visual errors that may have been left overlooked by the OCR Tool.
Now, here is another tricky part. Because when it comes to DTP, there is also a distinct difference in procedure when it comes to scanned documents, or documents retrieved from an application. In the case of scanned documents, DTP-Experts try to replicate the page layout as close to the original layout as possible. But what if the PDF is output format from an application like InDesign or Illustrator? In this case, you are dealing with high-resolution files that can be re-created using the same source application. In other words, the experts can create their own source/working-files and in doing so, will have a great chance at maintaining the exact same format/layout of the original documents.
While imperfect, OCR can be a valuable tool in the LSP’s toolbox when handling specific client requests. With the combination of qualitative tools, native linguists, and DTP-Experts, any scanned document or images with embedded texts can be translated and localized effectively.
At Laoret, we develop our own tools and technologies dynamically designed to meet any client’s needs and industry standards. Our tools are leveraged by native, in-country professionals experienced in our client’s subject matter and specifically trained in applying technologies in a way that saves both time and money.