SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations

Abstract:Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date. Although contrastive learning has been a prominent class of representation learning approaches, the state-of-the-art (SOTA) contrastive learning methods were found to have limited ability for learning unseen out-of-domain speech representations. This paper presents SynCLR, a synthesis framework for contrastive learning of speech representations that can be generalized over unseen domains. Specifically, instead of using data augmentation approach, SynCLR employs data synthesis for multi-view generation. To ensure a highly-varied conditional speech distribution in view generation, we design a novel diffusion-based speech synthesizer. A new contrastive loss is also proposed to construct multiple embedding spaces, each of which preserves view-sensitive information to reduce domain reliance for a better disentanglement. Our experiments showed that SynCLR outperformed the SOTA contrastive learning methods with a 17.2% relative reduction of EER in speaker verification tested on an unseen speech corpus, and considerably reduced 50.8% relative FIDs in a challenging speech-to-image translation task given out-of-domain test speeches.

Paper: https://openreview.net/forum?id=S-sYYe0P0Hd/.

Demo Page: https://synclr.github.io/.

1. Diffusion-based Multi-view Data Synthesis

Note: "+" denotes FastSpeech 2 combined with corresponding method. All diffusion-based neural vocoders generate samples within 6 reverse steps.

Text and having, quote, somewhat bushy, end quote, hair. since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.
+HIFI-GAN
+WaveGrad
+Diffwave
+SynGrad

2. View generation

A. Multi-view in speech-to-image translation

Multi-view the medium sized bird has a dark grey color, a black downward curved beak, and long wings. this large bird has a bright orange bill, a white colored belly, and white eyebrows and cheek patches.
Target image

Base
Text
Prosody
Speaker

B. Multi-view in speaker verification


Multi-view That will not signify; I never mind dirt."


To the understanding or to the senses?


Thus spake Zarathustra.


The congregation rose.


Speaker ID 14 296 1392 2269
Base
Text
Prosody

3. Ablation Study

Text and having, quote, somewhat bushy, end quote, hair. since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President.
GT
w/o LVC
w/o NP
Continuous level 6 steps
Continuous level 1000 steps
Discrete index 6 steps
Discrete index 1000 steps