Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt.
Overview of PersonaGest. Stage 1 A semantic-aware RVQ-VAE encodes motion into disentangled content and style latent codes. Stage 2 A Content Masked Transformer generates content tokens conditioned on speech and speaker identity, followed by a Style Residual Transformer that generates style tokens conditioned on a reference motion prompt.
The first stage disentangles motion content and gestural style within the residual quantization structure. A Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics, and contrastive learning further enforces content-style separation.
The second stage generates content tokens via a Masked Generative Transformer with a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control.
Quantitative Comparison
Bold = best · underline = second best · scaled as in paper
Perceptual User Study · 5-point Likert scale
21 participants rated on a 5-point Likert scale (higher is better)
Motion Tokenizer Comparison
Bold = best · underline = second best · scaled as in paper
Content–Style Disentanglement · t-SNE
Colors denote different speakers. Content representations interleave across speakers; style representations form well-separated speaker clusters.
Comparison with baselines on free-form co-speech gesture generation
GestureLSM
PyraMotion
SemTalk
PersonaGest Ours
GestureLSM
PyraMotion
SemTalk
PersonaGest Ours
GestureLSM
PyraMotion
SemTalk
PersonaGest Ours
Comparison with style-conditioned co-speech baselines
Style Prompt
ZeroEGGS
SynTalker
PersonaGest Ours
Style Prompt
ZeroEGGS
SynTalker
PersonaGest Ours
Style Prompt
ZeroEGGS
SynTalker
PersonaGest Ours
Part-wise style controllability: each sample is conditioned on two complementary style references, with upper body, hands, and lower body styles independently sourced from different speakers.
AStyle Prompt A
lower body
BStyle Prompt B
upper body + hands
PersonaGest Ours
A lower body + B upper body + hands
AStyle Prompt A
lower body
BStyle Prompt B
upper body + hands
PersonaGest Ours
A lower body + B upper body + hands
AStyle Prompt A
hands
BStyle Prompt B
upper body + lower body
PersonaGest Ours
A hands + B upper body + lower body
AStyle Prompt A
hands
BStyle Prompt B
upper body + lower body
PersonaGest Ours
A hands + B upper body + lower body
AStyle Prompt A
lower body
BStyle Prompt B
upper body + hands
PersonaGest Ours
A lower body + B upper body + hands
AStyle Prompt A
upper body
BStyle Prompt B
hands + lower body
PersonaGest Ours
A upper body + B hands + lower body
AStyle Prompt A
lower body
BStyle Prompt B
upper body + hands
PersonaGest Ours
A lower body + B upper body + hands
AStyle Prompt A
upper body
BStyle Prompt B
hands + lower body
PersonaGest Ours
A upper body + B hands + lower body