How To Use Regular Expressions To Identify Translatable Content
Imagine you are a translator and you are given a file that is full of content that looks like this:
Read more about regular expressions
You will probably be sending an angry email to your Project Manager, or you might just feel inclined to walk away altogether. If you are unlucky enough to receive this kind of document, your challenge is threefold.
Firstly, you are being exposed to a piece of content that is visually bewildering and takes focus away from the context. And as you know, context is very important as it allows you to make informed translation decisions.
Secondly, the codes in HTML file (for example) can both be edited and it might not be clear to everyone what should be translated and what not. This means you could end up with a pretty odd looking translation. Even worse, if this translation actually makes it to the stage where it will be imported back to its original format, there will be a lot of bugs. Especially if you not only translated, but also inadvertently deleted parts of the code.
And last but certainly not least, you will lose a lot of time trying to figure out how to go about it. And these days, I’m sure you are aware, a fast turnaround is about as important as the quality of the translation.
Regular Expressions Are Here To Save The Day
You can avoid this triple threat by making use of Regular Expressions. Regular Expressions, or Regex for short, are a combination of characters that form a search pattern. Does this sound a little vague? Well, here are a couple of examples:
REGULAR EXPRESSION | MATCHES STRINGS THAT … |
localization | contain {localization} |
g[aio]ggle | contain {gaggle, giggle, goggle} |
\d | contain {0,1,2,3,4,5,6,7,8,9} |
\d{5}(-\d{4})? | contain US Zip Codes |
The application of Regex identifies the non-translatable content and locks it so that only the translatable content is editable. Let’s see what this actually looks like when implemented in your Translation (CAT) Tools. The non-translatable content is displayed as a tag (see the purple boxes in the screenshot below). This content is locked, meaning the translators can’t edit or delete the code. The translation can securely and easily be imported, with any possible bugs kept to an absolute minimum.
Perhaps you are reading about all this searching, locking and securing content and you think; hey, this sounds like it could be useful for more than just translations. And you are right. Mobile Apps, for example, are ideally exposed to an automated QA on a regular basis. While doing so, it is highly advised to lock crucial data, such as prices and links, so no valuable info will be lost. Regular Expressions can be used for this as well.
If this seems a bit too arduous to start with, know that you are not alone. There are a number of online resources and tutorials1 with standardized reference guides to pull from as well as tools2 that can help you with the testing.
Not Every File Format Or Tool Is Treated Equally
Depending on the translation (CAT) Tool you use, there may or may not be a RegEx Template provided. Some tools come with ready-made templates including some of the most commonly used Expressions for specific file formats i.e. HTML, XML JSON, etc; which automates a large part of the process. With some other Tools, the Regular Expressions will need to be inserted manually and built from the ground up. Even in the case of a template, some may need to be added over time.
In order to fully understand this, think about the database used in Translation CAT-Tools known as the Translation Memory (TM). The TM stores segments from previously translated content, making any subsequent translations run more smoothly and efficiently. Similarly, the Regular Expressions that are added, are saved in a database that can be applied automatically as the projects build up.
Note that while an extensive database can be built, each file type will require a different set of Expressions because they consist of a different set of variables. The following file formats support tags, and each format will require its own implementation of Regular Expressions:
INC JSON MQXLIFF PHP PO RESX -SDLXLIF |
STRINGS TJSON TTX XLF XLIFF XLSX XML |
Perfection Is Desired, But Not Always Guaranteed
As you know, not every project is a simple and straightforward one. Some translation and localization projects involve different layers of information and content, resulting in more complex lines of code. Let’s have a look at two such cases and what can be done to avoid you being subjected to large amounts of manual corrections:
- Authoring tools come with a wide variety of files and media from written text to images and multimedia content, so when you are localizing eLearning materials, you can expect to be dealing with more code rich file extractions. To avoid a huge amount of coding, make sure that prior to exporting the content, you adjust the settings so you will be left with the text and a minimum amount of coding.
- Software Localization is another one of those instances where file preparation can prove to be an arduous task. The translatable information tends to be stored in resource files (in Visual C++ environments, these usually have the .RC extension). Now, here is the challenge. Unless a software product is well-internationalized, the translatable text can be found all throughout the building environment.
Conclusion:
Not everyone is aware of the damage that can be done when losing focus on which content is translatable, and which should not be touched. And while Regular Expressions are usually handled on the Localization Engineer’s end, we hope that this post showed that having an understanding of the importance of RegEx, could vastly improve the quality and efficiency of the output on every level. Laoret’s Translation & Localization Services are provided by Localization Engineers highly skilled in the use of Regular Expressions, and linguists whose tech-savviness helps them respect the importance of adhering to a streamlined workflow. We take pride in offering high-quality services within the lowest possible turnaround and are ready to deliver a quote within minutes, and a translation within hours.