SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System

Abstract of the paper

The precise control of singing techniques is of utmost importance in achieving emotionally expressive vocal performances. To bridge the gap between current Singing Voice Synthesis (SVS) systems and human singers, our paper focuses on developing an SVS system that allows for control over singing techniques. In this paper, we introduce SinTechSVS, a singing technique controllable SVS system composed of a singing technique annotator, a singing technique controllable synthesizer, and a singing technique recommender. Our approach leverages transfer learning for efficient singing technique annotation and adapts the DiffSinger framework with additional style encoders and an attention-based singing technique local score (STLS) module to enhance singing technique controllability. We also propose a Seq2Seq singing technique recommender for the new task of Singing Technique Recommendation (STR). Experimental results demonstrate that SinTechSVS significantly improves the quality and expressiveness of synthesized vocal performances, with comparable general synthesis capabilities to state-of-the-art SVS systems and enhanced control over singing techniques, as evidenced by objective and subjective evaluations. To the best of our knowledge, SinTechSVS is the first SVS capable of controlling singing techniques.

Overall Architecture

SinTechSVS consists of three key components: singing technique annotator (STA), singing voice synthesizer conditioned on singing techniques (SVS), and singing technique recommender (STR). The 'OR' symbol in this figure means that the input of SVS is either by a user-input singing technique sequence or the predicted singing technique sequence from the singing technique recommender.

The training process of SinTechSVS consists of three steps, with each step laying the foundation for the next. Modules depicted with full shadows remain unfixed during the training step, while those with half shadows are first fixed and then unfixed during training. (Top-left): Training of STA; (Bottom-left): Training of STA; (Bottom-right): Inference of SinTechSVS.

Singing Technique Annotations

In this section, we provide samples of singing techniques that we annotated on opencpop dataset. In sentence-level samples, bolded words are sung in the specific technique.

Singing Voice Samples of Pitch Singing Techniques
Pitch Singing Techniques	Word-level Sample 1	Word-level Sample 2	Sentence-level Sample
Scooping			具(ju) 象(xiang)
Bend			没(mei) 限(xian) 期(qi)
Drop			再(zai) 给(gei) 我(wo) 两(liang) 分(fen) 钟(zhong)
Melisma			双(shuang) 眼(yan)
Singing Voice Samples of Timbre Singing Techniques
Vocal Fry			我(wo) 们(men) 都(dou) 需(xu) 要(yao) 勇(yong) 气(qi)
Falsetto			没(mei) 有(wo) 你(you) 根(gen) 本(ben) 不(bu) 想(xiang) 逃(tao)
Breathy			我(wo) 不(bu) 会(hui) 发(fa) 现(xian) 我(wo) 难(nan) 受(shou)
Belting			一(yi) 辈(bei) 子(zi) 暖(nuan) 暖(nuan) 的(de) 好(hao)

Data Acquirement

If you want the manual singing technique annotation file for opencpop dataset, access here and request a permission. The annotation is research purpose only.

The request should include the following. Otherwise it will be rejected.

Name
Affiliation
Email Address
Agree to the License

For Opencpop dataset: Please trictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.

Annotated Data Statistics

Distribution of manually annotated portion of Opencpop dataset. (The singing techniques "whisper" and "hiccup" are removed due to the small amount of labels.)

Distribution of duration of each singing technique.

SinTechSVS Samples

Singing Voice Synthesis with Singing Technique Control

In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques. In Word-level Lyric Sequence, bolded words are sung in the specific technique. Regular/Straight denotes synthesized audio with no singing techniques, serving as a reference for comparison.

SinTechSVS Synthesized Samples Conditioned on Pitch Singing Techniques
Pitch Singing Techniques	Word-level Lyric Sequence	Regular/Straight	SinTechSVS
Scooping	冰刀划的圈 bing dao hua de quan
Bend	又无可奈何 you wu ke nai he
Drop	小火车摆动的旋律 xiao huo che bai dong de xuan lv
Melisma	你在世俗里的名字被人用了 ni zai shi su li de ming zi bei ren yong le
SinTechSVS Synthesized Samples Conditioned on Timbre Singing Techniques
Vocal Fry	喔喔 wo wo
Falsetto	我恨你 wo hen ni
Breathy	很少人看诗 hen shao ren kan shi
Belting	你好吗 ni hao ma

Singing Voice Synthesis with Singing Technique Recommendation

In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques recommended from the music score. STan represents SinTechSVS with annotated singing technique labels. SinTechSVS use the STR to predict the singing techniques for input.

SinTechSVS Synthesized Samples Conditioned on Recommended Singing Techniques
Ground Truth	STan	SinTechSVS

Singing Voice Synthesis with Singing Technique Recommendation (Unseen)

This section showcases singing technique recommendations through unseen music score samples and their corresponding synthesized audio demonstrations. For pitch singing techniques, the abbreviations are: (1) straight - STR; (2) scooping - SCO; (3) bend - BEND; (4) drop - DROP; (5) melisma - MEL. For timbre singing techniques, the abbreviations are: (1) regular - REG; (2) vocal fry - FRY; (3) falsetto - FAL; (4) breathy: BRE; (5) belting: BEL. The word-level pitch, lyric, slur, and singing technique will be separated by |.

Unseen Sample 1
Input Sequence	Lyrics: 但(dan) \| 我(wo) \| 早(zao) \| 已(yi) \| 学(xue) \| 会(hui) \| 一(yi) \| 个(ge) \| 人(ren) \| 想(xiang) \| 你(ni) Pitch: A3 \| F#4/Gb4 E4 D4 \| F#4/Gb4 \| B4 \| C#5/Db5 \| B4 \| A4 \| F#4/Gb4 \| E4 \| D4 \| F#4/Gb4 E4 D4 Slur: 0 \| 0 \| 0 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 1 \| 0 \| 0 \| 0 \| 1
Output Pitch Singing Techniques	STR \| MEL \| REG \| REG \| SCO \| REG \| REG \| DROP \| REG \| REG \| MEL
Output Timber Singing Techniques	REG \| BEL \| BEL \| FAL \| FAL \| FAL \| FAL \| BRE \| BRE \| REG \| BRE
Synthesized Audio

Unseen Sample 2
Input Sequence	Lyrics: 看(kan) \| 着(zhe) \| 我(wo) \| 的(de) \| 脚(jiao) \| 印(yin) \| 一(yi) \| 个(ge) \| 人(ren) \| 一(yi) \| 步(bu) \| 步(bu) \| 好(hao) \| 寂(ji) \| 寞(mo) Pitch: F4 \| G4 \| E4 \| D4 \| C4 \| D4 C4 A4 \| E4 \| F4 \| C5 \| E4 \| F4 \| C5 \| E4 \| F4 \| F4 Slur: 0 \| 0 \| 0 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 1 \| 0 \| 0 \| 0 \| 1
Output Pitch Singing Techniques	STR \| STR \| SCO \| STR \| REG \| MEL \| STR \| STR \| SCO \| STR \| STR \| SCO \| STR \| STR \| SCO
Output Timber Singing Techniques	BEL \| BEL \| BEL \| BEL \| BEL \| REG \| REG \| REG \| FAL \| REG \| REG \| FAL \| REG \| REG \| BEL
Synthesized Audio

Unseen Sample 3
Input Sequence	Lyrics: 有(you) \| 什(shen) \| 么(me) \| 方(fang) \| 法(fa) \| 让(rang) \| 自(zi) \| 己(ji) \| 真(zhen) \| 的(de) \| 忘(wang) \| 记 (ji) \| ha \| ha Pitch: G4 \| A4 \| A#4/Bb4 \| A4 F4 \| C4 \| A#4/Bb4 \| A4 \| F4 \| A#4/Bb4 \| A4 \| A#4/Bb4 \| C5 \| A#4/Bb4 A4 \| A4 G4 F4 Slur: 0 \| 0 \| 0 \| 0 \| 1 \| 0 \| 0 \| 0 \| 0 \| 0 \| 0 \| 1 \| 0 \| 1
Output Pitch Singing Techniques	STR \| STR \| SCO \| BEND \| BEND \| STR \| STR \| STR \| BEND \| STR \| STR \| SCO \| DROP \| MEL
Output Timber Singing Techniques	REG \| REG \| REG \| BRE \| BRE \| BEL \| BEL \| BEL \| BRE \| BRE \| BRE \| FAL \| BRE \| BRE
Synthesized Audio

Citation

Cite the IEEE/ACM Transactions on Audio, Speech, and Language Processing journal paper.

@ARTICLE{10509739,
  author={Zhao, Junchuan and Chetwin, Low Qi Hong and Wang, Ye},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System}, 
  year={2024},
  volume={},
  number={},
  pages={1-13},
  keywords={Hidden Markov models;Annotations;Timbre;Task analysis;Deep learning;Synthesizers;Controllability;Singing voice synthesis;singing voice synthesis conditioned on singing techniques;singing technique classification;singing technique recommendation;metric;deep learning},
  doi={10.1109/TASLP.2024.3394769}
}

Contact

If you have any questions about the paper, please contact the first author Junchuan by junchuan@comp.nus.edu.sg.

License

The singing technique annotation for Opencpop is available to download for non-commercial purposes under the Opencpop license.
This annotation may not be sold, leased, published or distributed to any third party without written permission from the administrator.
The National University of Singapore is not responsible for errors in the annotation's content or any damages resulting from its use. The administrator may update these conditions of use at any time.