Types of Content at eBay: Titles
Topic: Machine Translation
All eBay-generated content is currently translated by our talented localization team, whereas eBay’s user-generated (UG) content is handled by our Machine Translation (MT) engine. It is common knowledge that UG text can get pretty noisy due to typos, non-dictionary terms, etc. At eBay, however, MT deals with more than that. We work with multiple types of UG content — search queries, item titles, and item descriptions — and each presents its own challenges. In the previous post we talked about search queries. This post discusses item titles.
Translating item titles (IT) provides our buyers from Russia, Brazil, Spanish Latin America, France, Italy, Germany, and Spain with an option to view eBay search results in their own language. This allows customers to look through pages of results and make an informed decision on which listings to open, because an image alone does not contain enough information. Being able to read and understand item titles is essential to a positive customer experience, which is why we invest a lot of effort into improving the MT engine for titles.
This type of content is very specific and presents a number of challenges for MT.
A title is a summarized item description composed of keywords. The eBay Help article on writing effective titles encourages sellers to omit punctuation and avoid trying to create a grammatically correct sentence. Following these and other tips is supposed to help sellers create a clear picture of an item and a good first impression, so it is important that the MT translation meets the same expectations. However, the lack of syntax and punctuation presents a problem for an MT engine that is normally trained on sentences. If it tries to translate a sequence of nouns, adjectives, and numbers as a sentence, meaning errors are unavoidable. It may start looking for a subject and a predicate and in general for a sentence structure, thus translating adjectives as verbs, moving words around, and so on.
As an example, let’s take a title for a can of paint: “20g Glow in the Dark Acrylic Luminous Paint Bright Pigment Graffiti Party DIY”.
What might go wrong here?
“Glow” may get translated as an imperative form of the verb, and “dark acrylic” — as a noun phrase with “acrylic” being a noun. (as in “Stay in the shaded area!”) – and that is just part of the title. Similar transformation may happen with polysemous words or those that belong to different parts of speech: “can”, “paint”, “party”, etc. The result of such translation may be a completely different item.
This is closely related to the previous issue. Segmenting a title and correctly identifying semantic units is of utmost importance for machine translation. For example, “Gucci fake snake leather purse”: in case of an incorrect segmentation, we may get a translation of a “Gucci fake” instead of the intended “fake snake leather”. Such translations are the most dangerous because they sound correct and believable yet present misleading information, which in the end may leave both a buyer and a seller unhappy with the experience.
To address these major issues, the science team created an engine just for item titles; it is trained on separate data sets. In addition, they have been working on a named entity recognition (NER) algorithm that identifies semantic units in a title before it goes in the MT engine for translation.
Sellers tend to use multiple synonyms in a title assuming this will increase the chances of matching search queries and coming up high in search results (which is a common misconception). For MT this means several things:
A chain of adjacent nouns or adjectives that are in no relation to each other
The machine needs to learn to translate them independently of each other. This is similar to the first issue described above, because the engine may try to create agreement where there should be none.
Example, Baby Toddler Kids Child Mini Cartoon Animal Backpack Schoolbag Shoulder Bag
We see four synonyms for the age reference and three synonyms for the item itself. The age reference terms are not all adjectives nor can all of them be translated as adjectives. Even a human translator would have to get creative and produce something like “for a baby/toddler, kids’, child’s” – because we could not simply leave all four of them as nouns; it would sound too abrupt and possibly confusing. The task is much more challenging for a machine. Not only should it avoid creating noun phrases (Kids child may turn into a kid’s child), it also needs to rephrase or insert prepositions where necessary (baby toddler child -> for baby, toddler, child; kids –> kids’). The best ways to approach this would vary depending on the target language.
Agreement with the head noun
In our example, there are three synonyms for a head noun: Backpack – Schoolbag – Shoulder Bag. What if they are of different gender in the target language? Which one should the adjectives agree with? A human translator would probably pick the first one, but MT may not think the same way. Here is a bigger challenge: the head noun does not immediately follow the adjectives describing it. In our example there are two other nouns between the attributes “Kids Child” and the head noun “Backpack”. The machine is supposed to figure out that “kids” describes “backpack”, not “cartoon” or “animal”. As you can imagine, however, the most logical decision for a machine would be to connect “kids” with “cartoon”.
Agreement plays a very important role in translating item titles, because it provides a customer with a description of features and qualities of the item. If you connect an attribute with the wrong noun, it will modify an incorrect object and produce an overall misleading translation. In our example, with the incorrect agreement, a user will read: “backpack with a kids’ cartoon animal”, which is in essence a different item than a “kids’ backpack with a cartoon animal”. One may argue that an image would be a clear indication that the item is a kids’ backpack. Unfortunately, a picture is not always a reliable source of information. In our case, there are similar backpacks for adults, which is why an accurate translation will make a difference.
Sellers use multiple acronyms to save space and fit as much information in a title as possible. For MT this presents several challenges.
- Rare, unknown acronyms or acronyms that sellers made up on the spot. Gathering more training data and compiling additional lists of expanded out-of-vocabulary (OOV) acronyms is helping address that.
- Polysemic acronyms that have different translations in different categories. The most challenging acronyms are the ones that have more than one meaning in the same category. For example, “RN” appears in Clothing, Shoes and Accessories as “registered nurse”, “Rapa Nui”, “Rusty Neal”, and as part of model names for Nike, Hugo Boss, A&F and other brands.
Writings and names of songs/music bands/movies/video games
This is common content for certain categories. Singling out a movie or song title out of the rest of the string may be difficult because there is often no contextual information pointing to the fact that it is a movie or a song. It is not much of a problem in the DVD or Music category, but quite often you will find reference to a movie title or a music band name in other categories such as Collectables or Clothing. It is also common for sellers to quote a writing on the item they are selling. Ideally, we would want to have the writing to be left as is so that the customer would know exactly what the item depicts. As you can imagine, however, literally anything can be written on a t-shirt or a poster, which is why it is very difficult for a machine to differentiate a writing from the actual item description. In such cases a user would have to rely on the quality and size of an image, which may not be the best on the search results page.
In this example, “New York Vermont Quebec” is part of the poster design, but it is barely visible. In the text of the item title, however, it may be interpreted as locations of the poster, places it originally came from, etc. Identifying this as verbatim writing, thus keeping it in English, would be a very difficult task for an MT engine, but it would clearly benefit an eBay customer.
With so many aspects to keep in mind, training the engine to translate eBay item titles is certainly a challenge. Our teams of scientists and linguists are actively and successfully working on ways to improve the quality of the training data and the MT output.