Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture
Speech-driven facial animation methods should produce accurate and realistic lip motions with natural expressions and realistic texture portraying target-specific facial characteristics. Moreover, the methods should also be adaptable to any unknown faces and speech quickly during inference. Current state-of-the-art methods fail to generate realistic animation from any speech on unknown faces due to their poor gen-eralization over different facial characteristics, languages, and accents. Some of these failures can be attributed to the end-to-end learning of the complex relationship between the multiple modalities of speech and the video. In this paper, we propose a novel strategy where we partition the problem and learn the motion and texture separately. Firstly, we train a GAN network to learn the lip motion in a canonical landmark using DeepSpeech features and induce eye-blinks before transferring the motion to the person-specific face. Next, we use another GAN based texture generator network to generate high fidelity face corresponding to the motion on person-specific landmark. We use meta-learning to make the texture generator GAN more flexible to adapt to the unknown subject’s traits of the face during inference. Our method gives significantly improved facial animation than the state-of-the-art methods and generalizes well across the different datasets, different languages, and accents,and also works reliably well in presence of noises in the speech."