EmoTalker: Audio Driven Emotion Aware Talking Head Generation

Anonymous Submission

Abstract

Talking head synthesis aims to create videos of a person speaking with accurately synchronized lip movements and natural facial expressions that correspond to the driving audio. However, previous approaches have used reference frames or extra labels to control emotions and facial expressions, which disentangle utterance and expression and ignore the impact of audio fluctuations on face motions, e.g., head pose, facial expressions and emotions. In this work, we present EmoTalker, which generates arbitrary identities with diverse and natural facial expressions from audio, without relying on driving frames or emotion labels as input. To achieve this, we present frames as a sequence of 3D motion coefficients of 3DMM representation and separate them into lip-related coefficients and the remaining (head pose, expressions) as facial motions. To model lip movement, we start from a pre-trained audio encoder and map it to corresponding lip representation. While for facial motions, we employ a two-stage training strategy: 1) We first project facial motions into a finite space of the codebook embedded with emotion-aware facial expression priors. 2) Moreover, a cross-modal Transformer is devised to explicitly model the correlations between audio and different types of facial motions. Experimental results and user studies show our model achieves state-of-the-art performance on the emotional audio-visual dataset and produces more realistic talking head videos with synchronized lip movement and vivid facial expressions.

SoTA Comparison

Image Description

Neutral identity reference




+

Same speech content:

"The revolution now under

way in materials handling

makes this much easier"




+

Different audio cadence


Wav2Lip


MakeItTalk


Audio2Head


SadTalker


EmoTalker (ours)

Emotional Videos

Enhanced with GFPGAN from 2562 to 5122

Different emotions and intensities

Angry level 1

Angry level 2

Angry level 3

Contempt level 1

Contempt level 2

Contempt level 3

Fear level 1

Fear level 2

Fear level 3

Sad level 1

Sad level 2

Sad level 3

Happy level 1

Happy level 2

Happy level 3

In the wild generation

Image Description

Identity reference

Original video

SadTalker

EAMM

Ours

Image Description

Identity reference

Original video

SadTalker

EAMM

Ours

Image Description

Identity reference

Original video

SadTalker

EAMM

Ours

Image Description

Identity reference

Original video

SadTalker

EAMM

Ours

3D Morphable Faces

Different identity texture


Free Control by Driven Frames

Lip Movement Control

Reference Lip



Facial Expression Control

Reference
Expression



Head Pose Control

Reference Pose