Russian specialists from the Samsung AI Center-Moscow, in collaboration with engineers from the Skolkovo Institute of Science and Technology, have developed a system capable of creating realistic animated images of human faces based on just a few static human frames. Usually, in this case, the use of large databases of images is required, however, in the example presented by the developers, the system was trained to create an animated image of a human face from only eight static frames, and in some cases one was enough. For more details on development, see an article published on the ArXiv.org online repository.
As a rule, it is rather difficult to reproduce a photorealistic personalized module of a human face due to the high photometric, geometric and kinematic complexity of reproducing the human head. This is explained not only by the complexity of modeling the face as a whole (for this there are a large number of approaches to modeling), but also by the complexity of modeling certain features: the oral cavity, hair, and so on. The second complicating factor is our tendency to catch even minor flaws in the finished model of human heads. This low tolerance for modeling errors explains the current prevalence of non-photorealistic avatars used in teleconferencing.
According to the authors, the system, dubbed Fewshot learning, is capable of creating highly realistic models of talking heads of people and even portrait paintings. The algorithms synthesize the image of the head of the same person with the lines of the face reference taken from another fragment of the video, or using the reference points of the face of another person. As a source of material for training the system, the developers used an extensive database of celebrity video images. To get the most accurate talking head possible, the system needs to use more than 32 images.
To create more realistic animated facial images, the developers used previous developments in generative adversarial modeling (GAN, where a neural network thinks out the details of an image, in fact, becoming an artist), as well as a machine meta-learning approach, where each element of the system is trained and designed to solve some specific task.
Meta-learning schema.
Promotional video:
Three neural networks were used to process static images of people's heads and turn them into animated ones: Embedder (implementation network), Generator (generation network) and Discriminator (discriminator network). The first partitions the head images (with approximate facial landmarks) into embedding vectors, which contain information independent of the pose, the second network uses the facial landmarks obtained by the embedding network and generates new data based on them through a set of convolutional layers that provide resistance to changes in scale, displacements, turns, change of angle and other distortions of the original face image. A network discriminator is used to assess the quality and authenticity of the other two networks. As a result, the system transforms landmarks of a person's face into realistic-looking personalized photos.
The developers emphasize that their system is able to initialize the parameters of both the generator network and the discriminator network individually for each person in the picture, so the learning process can be based on just a few images, which increases its speed, despite the need to select tens of millions of parameters.
Nikolay Khizhnyak