Russian publishers are already experimenting with machine recording of audiobooks; in the future, artificial intelligence can be entrusted with translating serials and dubbing them with the voices of their favorite actors. About the features of such technologies and how long it will take to create them.
Oral speech becomes written
On YouTube, automatic subtitles for videos are created by voice recognition and speech-to-text translation software. It is based on self-learning neural networks. This option is more than ten years old, but the result is still far from ideal. More often than not, you can only catch the general meaning of what was said. What is the difficulty?
Let's say, explains Andrey Filchenkov, head of the Machine Learning laboratory at ITMO University, that we are building an algorithm for speech recognition. This requires training a neural network on a large data array.
It will take hundreds, thousands of hours of speech recordings and their correct comparison with texts, including marking the beginning and end of phrases, changing interlocutors, and so on. This is called the enclosure. The larger it is, the better the training of the neural network is. Really large corpora have been created for the English language, so recognition is much better. But for Russian or, say, Spanish, there is much less data, and for many other languages there is no data at all.
“And the result is appropriate,” the scientist concludes.
“In addition, we evaluate the meaning of a word, a phrase in a film not only by sound, the intonation of the actor and his facial expressions are also important. How do you interpret this? " - adds Sergey Aksenov, associate professor of the Information Technology Department of the Tomsk Polytechnic University.
“How to handle the features of fluent speech? Fuzzy articulation, sketchiness, interjections, pauses? After all, depending on this, the meaning changes, as in "you cannot be pardoned". How to teach a machine to determine where the speaker has a comma? And in poetry? " - lists Marina Bolsunovskaya, head of the laboratory "Industrial streaming data processing systems" of the NTI SPbPU Center.
The most successful projects, according to the expert, are in narrow areas. For example, a system for recognizing the professional speech of doctors using medical terms, developed by the RTC group of companies, helps doctors keep a medical history.
“Here you can clearly outline the subject area and highlight key words in speech. The doctor specifically emphasizes certain sections with intonation: patient complaints, diagnosis,”Bolsunovskaya clarifies.
Another problem is pointed out by Mikhail Burtsev, head of the laboratory of neural systems and deep learning at MIPT. The fact is that so far the machine is more successful in recognizing text when one person speaks than several, as in movies.
Translation with context
Let's take an English-language video, for example, a cut from the TV series "Game of Thrones", and turn on automatic Russian subtitles. What we see is likely to make us laugh.
Still from * Game of Thrones *.
However, in machine translation, technology has achieved impressive success. So, Google Translate translates texts in common languages quite tolerably, often only minimal editing is required.
The fact is that the neural network-translator is also trained on a large array of initial, correctly labeled data - a parallel corpus, which shows how each phrase in the original language should look like in Russian.
“Building such buildings is very laborious, expensive and time-consuming, it takes months and years. To train a neural network, we need texts of the size of the Library of Alexandria. The models are universal, but much depends on the language. If you provide a lot of data, for example in Avar, and the translation will be of high quality, but for Avar there is simply no such amount of data,”says Andrey Filchenkov.
“Translation is a separate product that is related to the original, but is not equal to it,” says Ilya Mirin, director of the School of Digital Economy at the Far Eastern Federal University. - A typical example is Dmitry Puchkov's (Goblin's) translations of foreign films in the 90s. Only after his work did it become clear what was happening there. We could not find out anything adequate from the VHS versions. Alternatively, try to translate into a language that you know well, something from The Master and Margarita. For example, “in a black cloak with a bloody lining”. The machine cannot do that."
Neural networks learn well from many typical examples, but films are full of complex meanings and connotations, jokes that are not accessible to the machine - it cannot distinguish them.
“In every episode of the animated series Futurama there is a reference to the classic American cinema - Casablanca, Roman Holiday and so on. At such moments, in order to catch and repackage the meaning for those who have not watched these films, the translator needs to come up with a close analogue from the Russian context. An incorrect machine translation can be very discouraging for the viewer,”continues Mirin.
In his opinion, the quality of machine translation is close to 80 percent, the rest is specificity that must be added manually, involving experts. "And if 20-30 percent of phrases require manual correction, then what is the use of machine translation?" - says the researcher.
“Translation is the most problematic stage,” agrees Sergey Aksenov. - Everything depends on semantics and context. The available tools can be used for translation and machine voice acting, for example, children's cartoons with simple vocabulary. But with the interpretation of phraseological units, proper names, words that refer viewers to some cultural realities, difficulties arise."
In films and videos, the context is always visual and is often accompanied by music and noise. We speculate from the picture what the hero is talking about. Speech turned into text is devoid of this information, so translation is difficult. This is the situation for translators working with text subtitles without seeing the film. They are often wrong. Machine translation is the same story.
AI voices speech
To dub a series translated into Russian, you need an algorithm for generating natural speech from text - a synthesizer. They are created by many IT companies, including Microsoft, Amazon, Yandex, and they are doing quite well.
According to Andrey Filchenkov, a couple of years ago a minute of dubbing a speech synthesizer took several hours, now the processing speed has greatly increased. The task of speech synthesis for some areas where neutral dialogues are required is solved quite well.
Many already take for granted a conversation with a robot on the phone, the execution of commands from a car navigator, a dialogue with Alice in a Yandex.Drive car. But for dubbing TV series, these technologies are not yet adequate.
“The problem is emotion and acting. We have learned to make the machine voice human, but so that it still sounds appropriate to the context and inspires trust is a long way off. Poor voice acting can easily kill the perception of a film,”Filchenkov said.
According to Mikhail Burtsev, speech synthesis is quite real. However, this is computationally intensive and cannot be done in real time for a reasonable price.
“There are algorithms that synthesize speech that is similar to that of a particular actor. This is the timbre, and the manner of speaking, and much more. So any foreign actor will actually speak Russian,”predicts Burtsev. He expects noticeable progress in the coming years.
Sergei Aksenov gives five to ten years to develop tools for translating and dubbing complex works from the most common languages like English. The scientist cites the example of Skype, which several years ago demonstrated the possibility of organizing online lessons for schoolchildren speaking different languages. But even then, the system will not be ideal, it will constantly have to learn: gain vocabulary, take into account the cultural context.