Skip to the content.

SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System

Abstract of the paper

The precise control of singing techniques is of utmost importance in achieving emotionally expressive vocal performances. To bridge the gap between current Singing Voice Synthesis (SVS) systems and human singers, our paper focuses on developing an SVS system that allows for control over singing techniques. In this paper, we introduce SinTechSVS, a singing technique controllable SVS system composed of a singing technique annotator, a singing technique controllable synthesizer, and a singing technique recommender. Our approach leverages transfer learning for efficient singing technique annotation and adapts the DiffSinger framework with additional style encoders and an attention-based singing technique local score (STLS) module to enhance singing technique controllability. We also propose a Seq2Seq singing technique recommender for the new task of Singing Technique Recommendation (STR). Experimental results demonstrate that SinTechSVS significantly improves the quality and expressiveness of synthesized vocal performances, with comparable general synthesis capabilities to state-of-the-art SVS systems and enhanced control over singing techniques, as evidenced by objective and subjective evaluations. To the best of our knowledge, SinTechSVS is the first SVS capable of controlling singing techniques.

Overall Architecture

SinTechSVS consists of three key components: singing technique annotator (STA), singing voice synthesizer conditioned on singing techniques (SVS), and singing technique recommender (STR). The 'OR' symbol in this figure means that the input of SVS is either by a user-input singing technique sequence or the predicted singing technique sequence from the singing technique recommender.


The training process of SinTechSVS consists of three steps, with each step laying the foundation for the next. Modules depicted with full shadows remain unfixed during the training step, while those with half shadows are first fixed and then unfixed during training. (Top-left): Training of STA; (Bottom-left): Training of STA; (Bottom-right): Inference of SinTechSVS.

Singing Technique Annotations

In this section, we provide samples of singing techniques that we annotated on opencpop dataset. In sentence-level samples, bolded words are sung in the specific technique.


Singing Voice Samples of Pitch Singing Techniques
Pitch Singing Techniques
Word-level Sample 1
Word-level Sample 2
Sentence-level Sample
Scooping
具(ju) 象(xiang)
Bend
没(mei) 限(xian) 期(qi)
Drop
再(zai) 给(gei) 我(wo) 两(liang) 分(fen) 钟(zhong)
Melisma
双(shuang) 眼(yan)
Singing Voice Samples of Timbre Singing Techniques
Vocal Fry
我(wo) 们(men) 都(dou) 需(xu) 要(yao) 勇(yong) 气(qi)
Falsetto
没(mei) 有(wo) 你(you) 根(gen) 本(ben) 不(bu) 想(xiang) 逃(tao)
Breathy
我(wo) 不(bu) 会(hui) 发(fa) 现(xian) 我(wo) 难(nan) 受(shou)
Belting
一(yi) 辈(bei) 子(zi) 暖(nuan) 暖(nuan) 的(de) 好(hao)

Data Acquirement

If you want the manual singing technique annotation file for opencpop dataset, access here and request a permission. The annotation is research purpose only.
The request should include the following. Otherwise it will be rejected.
For Opencpop dataset: Please trictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.

Annotated Data Statistics

SinTechSVS Samples

Singing Voice Synthesis with Singing Technique Control

In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques. In Word-level Lyric Sequence, bolded words are sung in the specific technique. Regular/Straight denotes synthesized audio with no singing techniques, serving as a reference for comparison.


SinTechSVS Synthesized Samples Conditioned on Pitch Singing Techniques
Pitch Singing Techniques
Word-level Lyric Sequence
Regular/Straight
SinTechSVS
Scooping
冰 刀 的 圈
bing dao hua de quan
Bend
又 无 可
you wu ke nai he
Drop
小 火 车 摆 的 旋 律
xiao huo che bai dong de xuan lv
Melisma
你 在 世 俗 里 的 名 字 被 人 用
ni zai shi su li de ming zi bei ren yong le
SinTechSVS Synthesized Samples Conditioned on Timbre Singing Techniques
Vocal Fry
喔 喔
wo wo
Falsetto
我 恨 你
wo hen ni
Breathy
很 少 人 看 诗
hen shao ren kan shi
Belting
你 好 吗
ni hao ma

Singing Voice Synthesis with Singing Technique Recommendation

In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques recommended from the music score. STan represents SinTechSVS with annotated singing technique labels. SinTechSVS use the STR to predict the singing techniques for input.


SinTechSVS Synthesized Samples Conditioned on Recommended Singing Techniques
Ground Truth
STan
SinTechSVS

Singing Voice Synthesis with Singing Technique Recommendation (Unseen)

This section showcases singing technique recommendations through unseen music score samples and their corresponding synthesized audio demonstrations. For pitch singing techniques, the abbreviations are: (1) straight - STR; (2) scooping - SCO; (3) bend - BEND; (4) drop - DROP; (5) melisma - MEL. For timbre singing techniques, the abbreviations are: (1) regular - REG; (2) vocal fry - FRY; (3) falsetto - FAL; (4) breathy: BRE; (5) belting: BEL. The word-level pitch, lyric, slur, and singing technique will be separated by |.


Unseen Sample 1
Input Sequence
Lyrics: 但(dan) | 我(wo) | 早(zao) | 已(yi) | 学(xue) | 会(hui) | 一(yi) | 个(ge) | 人(ren) | 想(xiang) | 你(ni)
Pitch: A3 | F#4/Gb4 E4 D4 | F#4/Gb4 | B4 | C#5/Db5 | B4 | A4 | F#4/Gb4 | E4 | D4 | F#4/Gb4 E4 D4
Slur: 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1
Output Pitch Singing Techniques
STR | MEL | REG | REG | SCO | REG | REG | DROP | REG | REG | MEL
Output Timber Singing Techniques
REG | BEL | BEL | FAL | FAL | FAL | FAL | BRE | BRE | REG | BRE
Synthesized Audio
Unseen Sample 2
Input Sequence
Lyrics: 看(kan) | 着(zhe) | 我(wo) | 的(de) | 脚(jiao) | 印(yin) | 一(yi) | 个(ge) | 人(ren) | 一(yi) | 步(bu) | 步(bu) | 好(hao) | 寂(ji) | 寞(mo)
Pitch: F4 | G4 | E4 | D4 | C4 | D4 C4 A4 | E4 | F4 | C5 | E4 | F4 | C5 | E4 | F4 | F4
Slur: 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1
Output Pitch Singing Techniques
STR | STR | SCO | STR | REG | MEL | STR | STR | SCO | STR | STR | SCO | STR | STR | SCO
Output Timber Singing Techniques
BEL | BEL | BEL | BEL | BEL | REG | REG | REG | FAL | REG | REG | FAL | REG | REG | BEL
Synthesized Audio
Unseen Sample 3
Input Sequence
Lyrics: 有(you) | 什(shen) | 么(me) | 方(fang) | 法(fa) | 让(rang) | 自(zi) | 己(ji) | 真(zhen) | 的(de) | 忘(wang) | 记 (ji) | ha | ha
Pitch: G4 | A4 | A#4/Bb4 | A4 F4 | C4 | A#4/Bb4 | A4 | F4 | A#4/Bb4 | A4 | A#4/Bb4 | C5 | A#4/Bb4 A4 | A4 G4 F4
Slur: 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1
Output Pitch Singing Techniques
STR | STR | SCO | BEND | BEND | STR | STR | STR | BEND | STR | STR | SCO | DROP | MEL
Output Timber Singing Techniques
REG | REG | REG | BRE | BRE | BEL | BEL | BEL | BRE | BRE | BRE | FAL | BRE | BRE
Synthesized Audio

Citation

Cite the IEEE/ACM Transactions on Audio, Speech, and Language Processing journal paper.

@ARTICLE{10509739,
  author={Zhao, Junchuan and Chetwin, Low Qi Hong and Wang, Ye},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System}, 
  year={2024},
  volume={},
  number={},
  pages={1-13},
  keywords={Hidden Markov models;Annotations;Timbre;Task analysis;Deep learning;Synthesizers;Controllability;Singing voice synthesis;singing voice synthesis conditioned on singing techniques;singing technique classification;singing technique recommendation;metric;deep learning},
  doi={10.1109/TASLP.2024.3394769}
}

Contact

If you have any questions about the paper, please contact the first author Junchuan by junchuan@comp.nus.edu.sg.

License