Five Tools to Build Your Basic Machine Translation Toolkit
Topic: Machine Translation
If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.
As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read this article.
Advanced Text Editors
The text editor that comes with your laptop won’t cut it, trust me. You need an advanced text editor that can provide at least these capabilities:
- Deal with different encodings (UTF, ANSI, etc.)
- Open big files, sometimes with unusual formats or extensions
- Do global search and replace operations with regular-expression support
- Highlight syntax (display different programming, scripting, or markup languages — such as XML and HTML — with color codes)
- Have multiple files open at the same time (that is, support tabs)
This is a list of my favorite text editors, but there are a lot of good ones out there.
Notepad++ is my editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you quit it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like
\t). It’s really easy to convert to and from different encodings and to save all opened files at once. You can also download different plug-ins, like spellcheckers, comparators, etc. It’s free, and you can download it from the Notepad++ web site.
Sublime Text is another amazing editor, and it’s a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs as well. It has a distraction-free mode if you really need to focus. It comes with a free trial period, and you can get it at the Sublime Text web site.
Syntax highlighting, document comparison, regular expressions, handling of huge files, encoding conversion: Emeditor is complete. My favorite feature, however, is the scriptable macros. This means that you can create, record, and run macros within EmEditor — you can use these macros to automate repetitive tasks, like making changes in several files or saving them with different extensions. You can download it from the EmEditor web site.
Language Quality Assurance Tools
Quality Assurance tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way:
- You load files with your translated content (source + target).
- You optionally load reference content, like glossaries, translation memories, previously translated files, or blacklists.
- The tool checks your content and provides a report listing potential errors.
You can find lots of errors using a QA tool:
- Terminology: where term A in the source is not translated as B in the target
- Blacklisted terms: terms you don’t want to see in the target
- Inconsistencies: same source segment with different translations
- Differences in numbers: when the source and target numbers don’t match
- Punctuation: missing or extra periods, duplicate commas, and so on
- Patterns: certain user-defined patterns of words, numbers and signs (which may contain regular expressions to make them more flexible) expected to occur in a file.
- Grammar and spelling errors
- Duplicate words, tripled letters, and more
Xbench allows you to run the following QA Checks:
- Untranslated segments
- Segments with the same source text and different target text
- Segments with the same target text and different source text
- Find segments whose target text matches the source text (potentially untranslated text)
- Tag mismatches
- Number mismatches
- Double blanks
- Repeated words
- Terminology mismatches against a list of key terms
- Spell-check translations.
Some linguists like to add all their reference materials into Xbench, like translation memories, glossaries, termbases, and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.
Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited, but there are ways to expand it.
Checkmate is the QA Tool part of the Okapi Framework, which is an open-source suite of applications to support the localization process. That means that the framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are repeated words, corrupted characters, patterns, inline code differences, significant differences in length between source and target, missing translations, spaces, etc.
The patterns section is especially interesting; I will come back to it in the future. Checkmate also produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open-source spelling and grammar checker.
Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, for example, which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond Compare is, by far, the best and most complete comparison tool, in my opinion.
With Beyond Compare, you can compare entire folders, too. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not.
As defined by its website, AntConc is a “freeware corpus analysis toolkit for concordancing and text analysis.” This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your content, regardless the language. AntConc will let you easily find n‑grams and sort them by the number of occurrences. This is extremely important when you want to find patterns in your content. Remember: with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error.
AntConc will also create a list of each word in your content, preceded by the number of hits. This is extremely helpful for your terminology work. There are so many things you can use this tool for, that it deserves its own article.
In most cases, using a translation memory (TM) is a good idea. If you are not familiar with CAT tools (computer-assisted translation), they provide a translation environment that combines an editor, a translation memory, and a terminology database. The key part here is the TM, which is essentially a database that stores translations in the form of translation units (that is, a source segment plus a target segment), and if you come across the same or a similar source segment, the TM will “remember” the translation you previously stored there.
CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions.
When post-editing with a CAT tool, there are usually two approaches: you can get MT matches from a TM or a connected MT system (assuming, of course, that the matches are added to it previously), or you can work on bilingual, pre-translated files and store only post-edited segments in your TM.
If you have never tried it, I totally recommend Matecat. It’s a free, open-source, web-based CAT tool, with a nice and simple editor that is easy to use. You don’t have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering that some tools out there cost around 800 dollars, what Matecat has to offer for free can’t be ignored. It can process 50+ file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes.
Another interesting free, open-source option is OmegaT. It’s not as user-friendly as Matecat, so you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, and support for around 40 different file formats, and it boasts an interface with Google Translate. If you never used it, you should give it a try.
If you are looking into investing some money and getting a commercial tool, my personal favorite is Memoq. It has tons of cool features and overall is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post.
If you enjoyed this article, please check other posts from the eBay MT Language Specialists series.