Abstract:Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date. Although contrastive learning has been a prominent class of representation learning approaches, the state-of-the-art (SOTA) contrastive learning methods were found to have limited ability for learning unseen out-of-domain speech representations. This paper presents SynCLR, a synthesis framework for contrastive learning of speech representations that can be generalized over unseen domains. Specifically, instead of using data augmentation approach, SynCLR employs data synthesis for multi-view generation. To ensure a highly-varied conditional speech distribution in view generation, we design a novel diffusion-based speech synthesizer. A new contrastive loss is also proposed to construct multiple embedding spaces, each of which preserves view-sensitive information to reduce domain reliance for a better disentanglement. Our experiments showed that SynCLR outperformed the SOTA contrastive learning methods with a 17.2% relative reduction of EER in speaker verification tested on an unseen speech corpus, and considerably reduced 50.8% relative FIDs in a challenging speech-to-image translation task given out-of-domain test speeches.
Paper: https://openreview.net/forum?id=S-sYYe0P0Hd/.
Demo Page: https://synclr.github.io/.
1. Diffusion-based Multi-view Data Synthesis
Note: "+" denotes FastSpeech 2 combined with corresponding method. All diffusion-based neural vocoders generate samples within 6 reverse steps.
Text | and having, quote, somewhat bushy, end quote, hair. | since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. | Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition. |
---|---|---|---|
+HIFI-GAN | |||
+WaveGrad | |||
+Diffwave | |||
+SynGrad |
2. View generation
A. Multi-view in speech-to-image translation
Multi-view | the medium sized bird has a dark grey color, a black downward curved beak, and long wings. | this large bird has a bright orange bill, a white colored belly, and white eyebrows and cheek patches. |
---|---|---|
Target image |
![]() |
![]() |
Base | ||
Text | ||
Prosody | ||
Speaker |
B. Multi-view in speaker verification
Multi-view | That will not signify; I never mind dirt." |
To the understanding or to the senses? |
Thus spake Zarathustra. |
The congregation rose. |
---|---|---|---|---|
Speaker ID | 14 | 296 | 1392 | 2269 |
Base | ||||
Text | ||||
Prosody |
3. Ablation Study
Text | and having, quote, somewhat bushy, end quote, hair. | since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. |
---|---|---|
GT | ||
w/o LVC | ||
w/o NP | ||
Continuous level 6 steps | ||
Continuous level 1000 steps | ||
Discrete index 6 steps | ||
Discrete index 1000 steps |