SPSinger: Multi-Singer Singing Voice Synthesis with Short Reference Prompt
Abstract of the paper
Current singing voice synthesis systems often struggle in multi-singer scenarios due to limited training data that only includes a few singers. Existing zero-shot multi-singer singing voice synthesis systems are criticized for their reliance on global timbre embeddings from single reference audio, which fail to capture sufficient timbre details. This paper introduces SPSinger, a multi-singer singing voice synthesizer that generates singer-specific voices from brief reference audio (around \textit{5 seconds}) without prior training on the singer's voice.
SPSinger builds on the StableDiffusion framework by adding a global encoder to capture consistent timbre features from short reference prompts and an attention-based local encoder to capture detailed variations from long prompts, used only during training.
To overcome the challenge of requiring long audio prompts during inference, we introduce the Latent Prompt Adaptation Model (LPAM), a Transformer-based module that derives timbre features from global embeddings. This approach eliminates the need for long reference prompts. Additionally, we propose a novel pitch shift algorithm that uses LPAM to predict the pitch shift values.
Our experiments show that SPSinger achieves high-quality singing voice synthesis that preserves the identity of the target singer, even when using only short reference audio inputs in zero-shot scenarios.
Model Architecture
Fig.1 Overall Architecture of SPSinger.
Synthesis Results on Seen Singers
Synthesis on Seen Singers with Short Music Scores
Singer Identity
GT mel + Vocoder
Reference
SPSinger
Female 0
Female 1
Female 2
Male 0
Male 1
Male 2
Synthesis on Seen Singers with Long Music Scores
Singer Identity
GT mel + Vocoder
Reference
SPSinger
Female 0
Female 1
Female 2
Male 0
Male 1
Male 2
Synthesis Results on Unseen Singers
Synthesis on Unseen Singers with Short Music Scores
Singer Identity
GT mel + Vocoder
Reference
SPSinger
Female 0
Female 1
Female 2
Male 0
Male 1
Male 2
Synthesis on Unseen Singers with Long Music Scores