PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation

Junchuan Zhao*,  Qifan Liang*,  Ye Wang

School of Computing, National University of Singapore  ·  *Equal contribution

01 · Abstract
Overview

Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt.


Motivation Motivation

Prior methods encode gesture style as a global attribute, failing to separate what gesture to make from how to make it. PersonaGest explicitly disentangles content and style within a hierarchical motion representation, enabling both semantic coherence and personalization fidelity.

02 · Generated Results
Free-form Co-speech Generation

Comparison with baselines on free-form co-speech gesture generation · zero-shot unseen speaker

Sample 1
GestureLSM
PyraMotion
SemTalk
PersonaGest (Ours)
Sample 2
GestureLSM
PyraMotion
SemTalk
PersonaGest (Ours)
Sample 3
GestureLSM
PyraMotion
SemTalk
PersonaGest (Ours)
02 · Generated Results
Style-conditioned Generation

Comparison with style-conditioned baselines · zero-shot unseen speaker

Sample 1
Style Prompt
ZeroEGGS
SynTalker
PersonaGest (Ours)
Sample 2
Style Prompt
ZeroEGGS
SynTalker
PersonaGest (Ours)
Sample 3
Style Prompt
ZeroEGGS
SynTalker
PersonaGest (Ours)
03 · Showcase
Independent Part-wise Style Controllability

Each sample is conditioned on two complementary style references, with upper body, hands, and lower body styles independently sourced from different speakers.

Sample 1 · A: lower body · B: upper body + hands
Style Prompt A · lower body
+
Style Prompt B · upper body + hands
PersonaGest (Ours)
Sample 2 · A: lower body · B: upper body + hands
Style Prompt A · lower body
+
Style Prompt B · upper body + hands
PersonaGest (Ours)
Sample 3 · A: hands · B: upper body + lower body
Style Prompt A · hands
+
Style Prompt B · upper + lower body
PersonaGest (Ours)
Sample 4 · A: hands · B: upper body + lower body
Style Prompt A · hands
+
Style Prompt B · upper + lower body
PersonaGest (Ours)
Sample 5 · A: lower body · B: upper body + hands
Style Prompt A · lower body
+
Style Prompt B · upper body + hands
PersonaGest (Ours)
Sample 6 · A: upper body · B: hands + lower body
Style Prompt A · upper body
+
Style Prompt B · hands + lower body
PersonaGest (Ours)
Sample 7 · A: lower body · B: upper body + hands
Style Prompt A · lower body
+
Style Prompt B · upper body + hands
PersonaGest (Ours)
Sample 8 · A: upper body · B: hands + lower body
Style Prompt A · upper body
+
Style Prompt B · hands + lower body
PersonaGest (Ours)
04 · Method
Method Overview

Overview of PersonaGest. Stage 1 A semantic-aware RVQ-VAE encodes motion into disentangled content and style latent codes. Stage 2 A Content Masked Transformer generates content tokens conditioned on speech and speaker identity, followed by a Style Residual Transformer that generates style tokens conditioned on a reference motion prompt.

Method Overview
Stage 1

Semantic-Aware RVQ-VAE

The first stage disentangles motion content and gestural style within the residual quantization structure. A Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics, and contrastive learning further enforces content-style separation.

Stage 1

Stage 2

Gesture Generation

The second stage generates content tokens via a Masked Generative Transformer with a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control.

Stage 2
05 · Evaluation
Quantitative Comparison
Co-speech gesture generation · zero-shot unseen speaker
ModelFGD ↓FGDskBC ↑Diversity ↑
EMAGE3.6674.0800.81212.116
MambaTalk3.7063.7520.80710.743
EchoMask3.1725.2680.80013.881
SemTalk3.5784.2960.80711.788
PyraMotion2.4113.7610.6788.637
GestureLSM2.9493.9280.7257.600
PersonaGest Ours2.3112.6600.82611.970
Style-conditioned co-speech generation · zero-shot unseen speaker
ModelFGD ↓FGDskBC ↑Diversity ↑
SynTalker3.2683.2420.68710.233
ZeroEGGS2.8753.7790.7173.274
PersonaGest Ours2.6172.7260.8158.985

Bold = best · underline = second best · scaled as in paper

05 · Evaluation
Perceptual User Study

21 participants rated on a 5-point Likert scale (higher is better)

Co-speech generation benchmark
Human-likeness
GestureLSM
3.45
SemTalk
3.40
PyraMotion
3.37
Ours
3.76
Semantic consistency
GestureLSM
3.60
SemTalk
3.68
PyraMotion
3.49
Ours
3.75
Motion–speech synchronization
GestureLSM
3.89
SemTalk
3.63
PyraMotion
3.35
Ours
3.94
Diversity
GestureLSM
3.50
SemTalk
3.97
PyraMotion
3.29
Ours
3.59
Style-conditioned generation
Human-likeness
ZeroEGGS
3.35
SynTalker
3.63
Ours
3.52
Semantic consistency
ZeroEGGS
3.41
SynTalker
3.57
Ours
3.67
Motion–speech synchronization
ZeroEGGS
3.49
SynTalker
3.71
Ours
3.81
Diversity
ZeroEGGS
3.51
SynTalker
3.71
Ours
3.76
Style consistency
ZeroEGGS
3.48
SynTalker
3.48
Ours
3.83
05 · Evaluation
Motion Tokenizer Comparison
VQ-based motion representation · zero-shot unseen speaker
ModelJRMSE ↓MSE ↓LVD ↓
VQ-VAE1.6373.9003.850
APVQ-VAE0.9333.1103.520
RVQ-VAE (S)0.3450.8141.890
RVQ-VAE (B)0.4211.2002.310
PersonaGest Ours0.3260.8091.880

Bold = best · underline = second best · scaled as in paper

05 · Evaluation
Inference Speed

Runtime per second of generated motion (s/s) on a single NVIDIA A100 GPU. PersonaGest runs at ~26× real-time, while reaching the best overall generation quality among style-conditioned baselines.

SynTalker
0.770
s/s  ·  ± 0.003
ZeroEGGS
0.014
s/s  ·  ± 0.000
PersonaGest Ours
0.038
s/s  ·  ± 0.000  ·  ~26× real-time
PersonaGest module breakdown · s/s
Audio Encoder
Whisper0.00179
RVQVAE Encoder
Upper0.00027
Hands0.00023
Lower0.00023
Face0.00022
RVQVAE Decoder
Upper0.00029
Hands0.00022
Lower0.00023
Face0.00022
Generative Transformers
CMT0.02643
SRT0.00932
Total Time · PersonaGest 0.038 ± 0.000  s/s

Mean shown in cards · ± std as in paper · single NVIDIA A100 GPU

05 · Evaluation
Content–Style Disentanglement · t-SNE

Colors denote different speakers. Content representations interleave across speakers; style representations form well-separated speaker clusters.

Content embeddings
Content Embeddings
Style embeddings
Style Embeddings