SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System
Abstract of the paper
The precise control of singing techniques is of utmost importance in achieving emotionally expressive vocal performances. To bridge the gap between current Singing Voice Synthesis (SVS) systems and human singers, our paper focuses on developing an SVS system that allows for control over singing techniques.
In this paper, we introduce SinTechSVS, a singing technique controllable SVS system composed of a singing technique annotator, a singing technique controllable synthesizer, and a singing technique recommender. Our approach leverages transfer learning for efficient singing technique annotation and adapts the DiffSinger framework with additional style encoders and an attention-based singing technique local score (STLS) module to enhance singing technique controllability.
We also propose a Seq2Seq singing technique recommender for the new task of Singing Technique Recommendation (STR).
Experimental results demonstrate that SinTechSVS significantly improves the quality and expressiveness of synthesized vocal performances, with comparable general synthesis capabilities to state-of-the-art SVS systems and enhanced control over singing techniques, as evidenced by objective and subjective evaluations. To the best of our knowledge, SinTechSVS is the first SVS capable of controlling singing techniques.
Overall Architecture
SinTechSVS consists of three key components: singing technique annotator (STA), singing voice synthesizer conditioned on singing techniques (SVS), and singing technique recommender (STR). The 'OR' symbol in this figure means that the input of SVS is either by a user-input singing technique sequence or the predicted singing technique sequence from the singing technique recommender.
The training process of SinTechSVS consists of three steps, with each step laying the foundation for the next. Modules depicted with full shadows remain unfixed during the training step, while those with half shadows are first fixed and then unfixed during training. (Top-left): Training of STA; (Bottom-left): Training of STA; (Bottom-right): Inference of SinTechSVS.
Singing Technique Annotations
In this section, we provide samples of singing techniques that we annotated on opencpop dataset. In sentence-level samples, bolded words are sung in the specific technique.
Singing Voice Samples of Pitch Singing Techniques | |||
---|---|---|---|
Scooping |
|
||
Bend |
|
||
Drop |
|
||
Melisma |
|
||
Singing Voice Samples of Timbre Singing Techniques | |||
Vocal Fry |
|
||
Falsetto |
|
||
Breathy |
|
||
Belting |
|
Data Acquirement
If you want the manual singing technique annotation file for opencpop dataset, access here and request a permission. The annotation is research purpose only.
The request should include the following. Otherwise it will be rejected.
- Name
- Affiliation
- Email Address
- Agree to the License
For Opencpop dataset: Please trictly follow the instructions of Opencpop. We have no right to give you the access to Opencpop.
Annotated Data Statistics
- Distribution of manually annotated portion of Opencpop dataset. (The singing techniques "whisper" and "hiccup" are removed due to the small amount of labels.)
- Distribution of duration of each singing technique.
SinTechSVS Samples
Singing Voice Synthesis with Singing Technique Control
In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques. In Word-level Lyric Sequence, bolded words are sung in the specific technique. Regular/Straight denotes synthesized audio with no singing techniques, serving as a reference for comparison.
SinTechSVS Synthesized Samples Conditioned on Pitch Singing Techniques | |||
---|---|---|---|
Word-level Lyric Sequence |
|||
Scooping |
bing dao hua de quan |
||
Bend |
you wu ke nai he |
||
Drop |
xiao huo che bai dong de xuan lv |
||
Melisma |
ni zai shi su li de ming zi bei ren yong le |
||
SinTechSVS Synthesized Samples Conditioned on Timbre Singing Techniques | |||
Vocal Fry |
wo wo |
||
Falsetto |
wo hen ni |
||
Breathy |
hen shao ren kan shi |
||
Belting |
ni hao ma |
Singing Voice Synthesis with Singing Technique Recommendation
In this section, we provide synthesized samples of SinTechSVS conditioned on singing techniques recommended from the music score. STan represents SinTechSVS with annotated singing technique labels. SinTechSVS use the STR to predict the singing techniques for input.
SinTechSVS Synthesized Samples Conditioned on Recommended Singing Techniques | ||
---|---|---|
Singing Voice Synthesis with Singing Technique Recommendation (Unseen)
This section showcases singing technique recommendations through unseen music score samples and their corresponding synthesized audio demonstrations. For pitch singing techniques, the abbreviations are: (1) straight - STR; (2) scooping - SCO; (3) bend - BEND; (4) drop - DROP; (5) melisma - MEL. For timbre singing techniques, the abbreviations are: (1) regular - REG; (2) vocal fry - FRY; (3) falsetto - FAL; (4) breathy: BRE; (5) belting: BEL. The word-level pitch, lyric, slur, and singing technique will be separated by |.
Unseen Sample 1 | |
---|---|
Input Sequence |
|
|
|
|
|
|
|
Unseen Sample 2 | |
---|---|
Input Sequence |
|
|
|
|
|
|
|
Unseen Sample 3 | |
---|---|
Input Sequence |
|
|
|
|
|
|
|
Citation
Cite the IEEE/ACM Transactions on Audio, Speech, and Language Processing journal paper.
@ARTICLE{10509739,
author={Zhao, Junchuan and Chetwin, Low Qi Hong and Wang, Ye},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System},
year={2024},
volume={},
number={},
pages={1-13},
keywords={Hidden Markov models;Annotations;Timbre;Task analysis;Deep learning;Synthesizers;Controllability;Singing voice synthesis;singing voice synthesis conditioned on singing techniques;singing technique classification;singing technique recommendation;metric;deep learning},
doi={10.1109/TASLP.2024.3394769}
}
Contact
If you have any questions about the paper, please contact the first author Junchuan by junchuan@comp.nus.edu.sg.
License
- The singing technique annotation for Opencpop is available to download for non-commercial purposes under the Opencpop license.
-
This annotation may not be sold, leased, published or distributed to any third party without written permission from the administrator.
-
The National University of Singapore is not responsible for errors in the annotation's content or any damages resulting from its use. The administrator may update these conditions of use at any time.