The Neural Network Was Taught To Copy The Human Voice Almost Perfectly - Alternative View

Table of contents:

The Neural Network Was Taught To Copy The Human Voice Almost Perfectly - Alternative View
The Neural Network Was Taught To Copy The Human Voice Almost Perfectly - Alternative View

Video: The Neural Network Was Taught To Copy The Human Voice Almost Perfectly - Alternative View

Video: The Neural Network Was Taught To Copy The Human Voice Almost Perfectly - Alternative View
Video: Neural Voice Cloning 2024, March
Anonim

Last year, artificial intelligence technology company DeepMind shared details about its new project WaveNet, a deep learning neural network used to synthesize realistic human speech. Recently, an improved version of this technology was released, which will be used as the basis of the digital mobile assistant Google Assistant.

A voice synthesis system (also known as a text-to-speech function, TTS) is usually built around one of two basic methods. The concatenative (or compilation) method involves the construction of phrases by collecting individual pieces of recorded words and parts previously recorded with the involvement of a voice actor. The main disadvantage of this method is the need to constantly replace the sound library whenever any updates or changes are made.

Another method is called parametric TTS, and its feature is the use of parameter sets with which the computer generates the desired phrase. The disadvantage of the method is that most often the result manifests itself in the form of unrealistic or so-called robotic sound.

WaveNet, on the other hand, produces sound waves from scratch using a convolutional neural network system where sound is generated in several layers. First, to train the platform for synthesizing "live" speech, it is "fed" a huge amount of samples, while noting which sound signals sound realistic and which do not. This gives the voice synthesizer the ability to reproduce naturalistic intonation and even details such as smacking lips. Depending on which samples of speech are run through the system, this allows it to develop a unique "accent", which in the long term can be used to create many different voices.

Sharp on the tongue

Perhaps the biggest limitation of the WaveNet system was that it required a huge amount of computing power to run, and even when this condition was met, it did not differ in speed. For example, it took about 1 second of time to generate 0.02 seconds of sound.

After a year of work, DeepMind engineers still found a way to improve and optimize the system so that it is now capable of producing a raw sound of one second in only 50 milliseconds, which is 1000 times faster than its original capabilities. Moreover, the specialists managed to increase the audio sampling rate from 8-bit to 16-bit, which had a positive effect on the tests involving listeners. These successes have paved the way for WaveNet to integrate into consumer products such as Google Assistant.

Promotional video:

Currently WaveNet can be used to generate English and Japanese voices through Google Assistant and all platforms that use this digital assistant. Since the system can create a special type of voices, depending on what set of samples was provided to it for training, in the near future Google will most likely implement support for synthesizing realistic speech in WaveNet in other languages, including taking into account them local dialects.

Speech interfaces are becoming more and more common on a wide variety of platforms, but their pronounced unnatural nature of the sound turns off many potential users. DeepMind's efforts to improve this technology will certainly contribute to the wider adoption of such voice systems, as well as improve the user experience from their use.

Examples of English and Japanese synthesized speech using the WaveNet neural network can be found by following this link.

Nikolay Khizhnyak