The Challenges of Translating Chinese Using Natural Language Processing

Natural language processing is coming along in leaps and bounds, helped by rapid progress in neural network computing that essentially means computers can think for themselves. But AI researchers keep bumping up against problems in getting AI to really understand language. One of these is that human language is just really, really hard.

Machine translation is complex because it’s not as simple as translating from a single standard expression in one language into its equivalent in another. People use many different ways to express the same thing, they innovate with their expressions and they use odd metaphors to describe things.

It’s both hard for machines to understand this, and also to choose which version to serve back to the humans.

From the start, the biggest problem has always been getting machines to construct sentences. Machines that generate their own sentences often end up with a garbled mess. If you’ve ever used a machine translation service, you’ll understand exactly how bad it can be.

There are some language rules that can be taught. Most English speakers aren’t aware that there are rules governing the way words, particularly adjectives, are used in sentences.

This goes: opinion-size-age-shape-color-origin-material-purpose. Most English speakers couldn’t recount this rule, but they’d instantly recoil if you used the phrase ‘cotton green nice coat’ rather than the accepted ‘nice green cotton coat’.

It’s easy for AI to be taught rules such as this one. But word order in standard sentences – ones that aren’t just strings of adjectives – varies a great deal in both Chinese and English. Translation programs really struggle with how to render sentences that they’ve translated, even if they understand all the words in it.

The standard test of a computer’s performance is to translate from Chinese into English, then back again into Chinese. This method, called ‘dual learning’, is one of the ways AI researchers test and learn with present systems in order to improve how they handle language.

Performance is also refined by giving human feedback on translated passages until the system improves through repetition and correction. Part of the problem is that human translators can disagree what’s the best translation, as there can be multiple ways to render the same idea. Fundamentally the human translator’s personality has to be factored in.

The case for Chinese

In some ways Chinese does have some advantages of simplicity. Put into writing, it’s relatively straightforward – there’s no need to put spacing between characters. Unlike romance languages, it isn’t gendered, and unlike many European languages, it doesn’t use cases.

It also doesn’t make you tread the same honorific minefield as Japanese, with it impenetrable social gradations. These features could help any speaker, whether human or artificial in nature.

The Chinese language has a colossal number of characters – so many, in fact, that it’s nigh on impossible for any human to master them all in a lifetime. For computers though, this kind of information storing is more feasible.

In fact, computer memory is probably better able to cope with Chinese character learning than humans are. But Chinese sentence structure is just as hard to understand, just as human and quirky, as any other language humans have created.

The problem is not getting machines to memorize the vast array of characters – they’re actually far better at this than humans are – but in understanding and conveying how these symbols interact with one another. Essentially it’s a case of teaching computers to move beyond word-for-word translation into the mysteries and subtleties of sentence structure.

coding code program programming compute coder work write software hacker develop man concept

In order to provide users with the best results when translating other languages, natural language processing needs to go beyond mere word-for-word translation.

There are many rules of language that we’ve identified, and some that we’re not even always aware of – such as the aforementioned adjective word order.

Then there are some rules that only work some of the time (like ‘the i before e except after c’ rule that has many, many exceptions).

Much of the initial work in AI was conducted in English language but that is now changing. China is focusing a tremendous amount of effort into this arena of research and that effort will obviously factor in for Chinese language structure.

This concentration of resources is likely to lead to significant leaps forward, not just for AI’s understanding of the Chinese language but for AI as a whole. The only thing holding the research back at present seems to be a shortage of skilled people in this new and fast-growing field.

Educational institutions may be rapidly adding to the number of courses they offer in the area of AI and natural language processing but new graduates aren’t really any substitute for experienced AI workers, and they’re at a real premium.

Finding workers in this area who also understand language is another challenge. China is actively recruiting for talent in Silicon Valley, as well as relaxing visa rules for foreign workers in this area. AI is a major area of international cooperation – as well as competition.