EmoTalker

Abstract

Talking head synthesis aims to create videos of a person speaking with accurately synchronized lip movements and natural facial expressions that correspond to the driving audio. However, previous approaches have used reference frames or extra labels to control emotions and facial expressions, which disentangle utterance and expression and ignore the impact of audio fluctuations on face motions, e.g., head pose, facial expressions and emotions. In this work, we present EmoTalker, which generates arbitrary identities with diverse and natural facial expressions from audio, without relying on driving frames or emotion labels as input. To achieve this, we present frames as a sequence of 3D motion coefficients of 3DMM representation and separate them into lip-related coefficients and the remaining (head pose, expressions) as facial motions. To model lip movement, we start from a pre-trained audio encoder and map it to corresponding lip representation. While for facial motions, we employ a two-stage training strategy: 1) We first project facial motions into a finite space of the codebook embedded with emotion-aware facial expression priors. 2) Moreover, a cross-modal Transformer is devised to explicitly model the correlations between audio and different types of facial motions. Experimental results and user studies show our model achieves state-of-the-art performance on the emotional audio-visual dataset and produces more realistic talking head videos with synchronized lip movement and vivid facial expressions.

SoTA Comparison

Neutral identity reference

+

Same speech content:

"The revolution now under

way in materials handling

makes this much easier"

+

Different audio cadence

Wav2Lip

MakeItTalk

Audio2Head

SadTalker

EmoTalker (ours)

Emotional Videos

Enhanced with GFPGAN from 256² to 512²

Different emotions and intensities

Angry level 1	Angry level 2	Angry level 3
Contempt level 1	Contempt level 2	Contempt level 3
Fear level 1	Fear level 2	Fear level 3
Sad level 1	Sad level 2	Sad level 3
Happy level 1	Happy level 2	Happy level 3

In the wild generation

Identity reference	Original video	SadTalker	EAMM	Ours
Identity reference	Original video	SadTalker	EAMM	Ours
Identity reference	Original video	SadTalker	EAMM	Ours
Identity reference	Original video	SadTalker	EAMM	Ours