How Computers Are Teaching Themselves to Talk

Recently we covered the rather alarming news that Google’s translation device had made a big leap forward in artificial intelligence terms – seemingly all by itself. Using machine learning, Google Translate ‘taught’ itself a better way to translate between unfamiliar languages, effectively by inventing its own language to act as a go-between.

Anyone who watched the original Terminator film knows how quickly things went pear-shaped for humanity once the Skynet artificial intelligence system became self-aware.

So you might be forgiven for feeling slightly alarmed to find out that machines are quietly starting to take creative decisions by themselves. In Terminator, Skynet progressed rapidly from becoming self-aware to triggering a nuclear holocaust.

The era of machines thinking and learning for themselves is now upon us. Over in Silicon Valley, internet giant Baidu has made a breakthrough in speech synthesis that enables AI to learn to very quickly express spoken language. Effectively, the machines can now teach themselves to talk in a matter of hours.

Text-to-voice systems were previously created by recording an individual (usually an actor) reading a large body of spoken words and common phrases aloud.

These were then served up in various combinations to suit their use as the speaking clock, or satnav guidance systems, or automated call systems to answer telephones.

This approach posed a few challenges, such as what happens if the system needs extending and the original actor is no longer available. Navigation systems often circumnavigate this problem by having the original actor record some common word parts, which are then cobbled together to form unusual street names.

The actor will record common place names such as ‘High Street’, but for highly unusual place names such as Torquay’s Hellevoetsluis Way (named for the town’s twin city in the Netherlands), the system will smash together the actor’s recording of the syllables to get a passable effort at pronouncing the word.

Recent research by Google offered to overcome this problem using a system that instead observes the sound waves from recorded speech and uses this to vocalise the transcript of any text it is given.

The neural network uses deep learning but still needs human training, and it also had some computational challenges to overcome before it can be used in real world situations.

Part of the problem is that speaking happens so fast in real life that computing can’t quite keep up with new developments. Google’s solution hasn’t been able to achieve enough speed to converse with people just yet.

The latest development

That’s where Baidu just stepped in, after developing a speech synthesis project of its own in Silicon Valley based on self-trained, deep-learning algorithms.

This new development breaks speech down into its smallest possible component parts – phenomes – and Baidu’s AI can adjust the tones of these to add emotion into the speech it produces.

Baidu’s system doesn’t require human training and can pick up new data fast. This suggests that it may be able to adapt to new languages. It may also be able to learn different voice types within one language: so AI could, for example, read a talking book and do each character’s voice differently. This introduces new possibilities for more realistic, emotionally-adept conversations between humans and machines.

This has obvious implications in sensitive fields such as healthcare, where patients may be more accepting of AI if the interaction is more emotionally nuanced.

Most significantly, Baidu’s team claims to have overcome the computational problems that Google encountered. It’s estimated that the new system is around 400 times faster than Google’s last iteration. This means the system can work quickly enough to function in real-life situations, such as interacting with a human being over an unpredictable transaction.

The future of talking robots

With the two internet giants turning their attention to speech synthesis, it’s almost inevitable that this field of study should move ahead quickly. There are many implications for commerce, technology and society once humanity cracks the artificial conversation problem.

It’ll make it easier to move ahead with technologies such as self-driving cars and automated checkouts could become considerably less annoying and repetitious.

Self-teaching systems are really the key to tackling unpredictable conversational situations. Presently AI interactions are limited to within a fairly narrow sphere; such as automated voicemail that can take a phone number from you or respond to simple yes/no responses.

Speech synthesis systems that can adapt to new situations open up new avenues of possibilities, such as negotiating a route with your self-driving car or describing your symptoms to an AI medic.

Improved speech synthesis could also better represent us. People who are losing their voices due to conditions such as motor neurone disease could have their identity better reflected by computer based communication if they record samples of their speaking voice before they lose it.

Alternatively, the computer could combine a number of voices from their age, gender and region to create a fair representation.

Of course, there are also negatives associated with the advent of new technologies like this. There’s the potential for massive job losses if AI interactions can replace human ones.

There’s far less need for costly and fallible human workers to perform transactions such as in the service industry.

Things we take for granted, such as having human waiting staff in restaurants, may become rarer. Social changes and economic disruptions will inevitably accompany the arrival of machines that can freely converse with us.