Singing Voice Synthesis · Zero-Shot

IEEE Transactions on Audio, Speech and Language Processing · 2026

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

In short: CoMelSinger keeps pitch and timbre on separate threads, so a short reference clip lends its voice — not its melody — to any lyric and pitch sequence you give it.

Pitch sequence — from the demo set
"刻在我心底的名字" — ke zai wo xin di de ming zi
I. Introduction

A short reference clip should lend its voice — not its melody.

Zero-shot singing synthesis conditions on a short acoustic prompt to copy a singer's timbre. But pitch and timing ride along in that same prompt, so the model quietly borrows melody from it too — even when an explicit pitch sequence says otherwise.

Comparison of representative SVS system architectures
SVS architectures Token-based SVS among feature, end-to-end, and codec-guided systems.
Prosody leakage illustration in prompt-based SVS
Prosody leakage Prompt acoustic tokens can carry pitch and timing cues.
II. Methodology

Two stages, one frozen, one fine-tuned with LoRA.

A Text-to-Semantic stage turns lyrics into semantic tokens. A Semantic-to-Acoustic stage — the only part that's fine-tuned — turns those tokens into singing, steered by an explicit pitch sequence and supervised by a frozen pitch transcriber.

Figure from the paper: overview of CoMelSinger (left) and the coarse-to-fine contrastive learning strategy (right)
Overview. CoMelSinger overview from the paper. The full figure is kept smaller here so the section can foreground the two contrastive learning views below.
II-B. Coarse-to-Fine Contrastive Learning

Two contrastive objectives, two time scales

One operates over a whole phrase, one frame by frame — together they pull melody and timbre apart.

Sequence-level contrastive learning diagram
Global contrastive learning Phrase-level pairs separate melody from prompt timbre.
Local contrastive learning diagram
Local contrastive learning Frame-aware perturbation sharpens local melody control.
III. Experimental Results

What got measured, and what the paper found.

Four objective metrics and three subjective ones, on both the seen-singer and zero-shot test sets. Numbers are taken directly from the paper. Within non-GT systems, bold marks the best result and underlining marks the second best.

F0-RMSE ↓

RMSE between generated and reference pitch contours — the direct test of melody control.

SECS ↑

WavLM speaker-embedding cosine similarity — how closely timbre matches the prompt.

MCD ↓

Mel-cepstral distortion between synthesized and reference audio — overall spectral fidelity.

SingMOS ↑

A learned, reference-free model of perceived singing quality.

MOS-Q ↑ — audio quality
MOS-N ↑ — naturalness
SMOS ↑ — timbre similarity
rated by 20 listeners with formal vocal training, 95% CI

Seen-singer synthesis

50 utterances each from Opencpop and M4Singer · vocoder fixed to HiFi-GAN across all systems

ModelMOS-QMOS-NSMOSMCDF0-RMSESingMOSSECS
GT4.174.384.414.370.925
GT (acoustic codec)4.014.194.480.930.0124.310.906
DiffSinger3.683.793.864.590.0844.130.769
VISinger23.593.863.915.360.0614.150.792
StyleSinger3.673.924.114.950.1124.190.833
SPSinger3.814.104.064.280.0544.280.860
Vevo 1.53.853.964.174.180.0514.390.907
CoMelSinger (ours)3.904.024.224.170.0424.320.912

CoMelSinger gets the lowest F0-RMSE and highest SMOS/SECS of any system in the comparison — strongest melody accuracy and timbre consistency together, approaching the GT-through-codec upper bound.

Zero-shot synthesis

10 male + 10 female unseen singers from OpenSinger, paired with M4Singer score sequences · MCD omitted — no ground-truth alignment exists in this setting

ModelMOS-QMOS-NSMOSF0-RMSESingMOSSECS
GT4.204.354.554.410.932
GT (acoustic codec)4.074.224.320.0154.660.921
DiffSinger3.753.723.250.0984.110.658
VISinger23.723.743.310.0744.080.704
StyleSinger3.483.823.850.1254.220.853
SPSinger3.924.033.760.0654.290.844
Vevo 1.53.723.814.020.0944.160.870
CoMelSinger (ours)3.874.114.140.0484.250.897

Baselines drop noticeably on SMOS, SECS, and F0-RMSE for unseen singers; CoMelSinger holds up with only minimal degradation from the seen-singer setting.

Component ablation

Seen-singer test set · CL = coarse-to-fine contrastive learning (SCL + FCL) · SVT = pitch-guidance module · "−CL+SVT" is the plain MaskGCT-based SVS baseline

ConfigurationMCDF0-RMSESingMOSSECS
CoMelSinger4.170.0424.320.912
− CL4.910.0804.120.895
− SCL only4.530.0624.250.900
− FCL only4.820.0754.180.892
− SVT5.530.1943.950.883
− CL + SVT5.890.2103.830.874

Removing SVT hurts F0-RMSE the most (0.042 → 0.194); removing FCL hurts pitch detail more than removing SCL hurts timbre — consistent with their intended roles.

Two further analyses in the paper aren't reproduced here for space: a comparison against supervised pitch-aware baselines (XiaoiceSing, SingAug, RMSSinger), and a six-way fine-tuning-strategy ablation showing LoRA winning on every metric while updating only ~4.8% of parameters. See Section V of the paper.
Ground-truth mel-spectrogram and pitch contour

Ground truth pitch contour

The target contour provides the reference structure used to judge melody fidelity.

CoMelSinger mel-spectrogram and pitch contour

CoMelSinger output

The predicted trajectory tracks the reference closely while preserving the prompted timbre.

Ablation without contrastive learning

Without contrastive learning

Removing CL weakens the melody-timbre separation and visibly destabilizes local pitch detail.

Ablation without SVT

Without SVT

The largest F0-RMSE jump in Table V aligns with this degraded pitch trajectory.

IV. Seen-Singer Evaluation

Same lyric, five systems, one singer the model has trained on.

GT is the ground-truth recording re-synthesized through the codec, so it bounds what any codec-based system could possibly reach. Reference is the acoustic prompt. CoMelSinger is highlighted throughout.

V. Zero-Shot Evaluation

Four singers the model has never heard, each given only a short prompt.

No ground truth exists here — these singers' voices come only from a few seconds of speech-like prompt audio. The melody sparkline shows the real pitch sequence each system was asked to follow.