The document describes experiments using cGANs for speech style conversion from habitual to clear speech. It found that:
1) In a speaker-dependent one-to-one mapping experiment, cGANs improved speech intelligibility over DNN mapping for 2 out of 3 speakers based on keyword recall accuracy.
2) A speaker-independent many-to-one mapping experiment showed cGANs improved intelligibility for 1 out of 3 speakers.
3) A speaker-independent many-to-many mapping experiment showed cGANs improved keyword recall for 2 speakers but results were not significant. The modest results were likely due to a small dataset and not transforming additional acoustic features like duration.
1 of 26
More Related Content
Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion
1. 1/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Improving Speech Intelligibility through Speaker
Dependent and Independent Spectral Style
Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
Oregon Health & Science University, University at Bualo
October 23, 2020
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
2. 2/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Background
Approximately 28 ¡Á 106 people in the United States have some
degree of hearing loss
Speakers naturally adopt a special clear speaking style when
talking to
listeners with hearing loss
normal-hearing listeners in adverse environments
Clear speech features
high degree of articulation
slower speaking rate
more frequent and longer pauses
exact strategy varies from speaker to speaker
Clear speech is more intelligible than habitual speech
1424% improvement in keyword recall in noise [Kain08]
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
4. 4/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Hybridization
Replacing certain acoustic features of habitual speech with
those from clear speech cause improved intelligibility
for typical speakers, incorporating [Kain08]
clear spectrum and duration yielded 24% improvement
for dysarthric speakers, incorporating [Tjaden14]
clear energy yielded 8.7% improvement
clear spectrum yielded 18% improvement
clear spectrum and duration yielded 13.4% improvement
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
5. 5/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Style conversion converts speaking style
Previously
mapping habitual (HAB) to clear (CLR) VAE-12 resulted in
improvement of intelligibility for one speaker from 24% to 46%
[Dinh19]
Generated parameters from DNN-mapping can be
over-smoothing
Generative adversarial nets (GANs) can be a promising
approach to address over-smoothness
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
6. 6/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Aim
To further increase intelligibility automatically by style conversion,
through the use of a conditional GANs (cGANs)
Experiments showing ecacy of cGANs in terms of speech
intelligibility when performing
1 speaker dependent one-to-one mapping
2 speaker independent many-to-one mapping
3 speaker independent many-to-many mapping
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
7. GANs
Traditional GAN has 2 components: a Generator (G) and a
Discriminator (D) that play a min-max game [Goodfellow14]
Figure: GANs
7/26
8. 8/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed cGANs for style conversion
Left Context
HAB VAE
Right Context
G
D
Mapped VAE
HAB VAE
CLR VAE
D
Real Pairs?
Real Pairs?
Figure: cGAN framework for style conversion
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
9. 9/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
Figure: Generator architecture
No random noise z
The component G learns the dierences between HAB VAE-12
and CLR VAE-12
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
10. 10/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Discriminator
Discriminator has 2 hidden layers of 256 nodes, an output layer
of 1 nodes with sigmoid function
In addition to adversarial loss, we use mean-absolute dierence
loss between G(z) and aligned real data x
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
11. 11/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Tips and Tricks to Train cGANs
a leaky ReLU activation function with a negative slope of 0.2
for both G and D
a dropout layer following each hidden layer of D with a
dropout rate of 0.5,
use the Adam optimizer:
learning rate: 0.0001, momentum ¦Â1: 0.5 and learning rate
decay: 0.00001 for D
learning rate: 0.0002, momentum ¦Â1: 0.5 and learning rate
decay: 0.00001 for G
weights initialized from a zero-centered Normal distribution
with standard deviation 0.02
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
12. 12/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Experiment: One-to-one mapping
Train speaker-dependent HAB-to-CLR mapping:
Require parallel data of HAB and CLR speech
Database: Used a 78 speaker database:
Consisting of control speakers (CS, N = 32)
Speaker with multiple sclerosis (MS, N = 30)
Speakers with Parkinson's disease (PD, N = 16)
A speaker read 25 Harvard sentences in 2 speaking styles
(HAB, CLR)
Select three speakers: PDM6, CSM7, PDF7 that showed the
most benet from the CLR spectrum
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
13. 13/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Method
HAB
VAE-
12
style
mapping
CLR
VAE-
12
Figure: cGANs-based mapping
We aligned each HAB utterance to its parallel CLR utterance
of the same speaker using DTW on 32nd-order log lter-bank
features.
Then, we pre-trained the generator that maps HAB VAE-12 to
CLR VAE-12 to minimize mean-squared-error loss function
Then, we trained our proposed cGANs structure
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
14. 14/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Log Spectral Distortion
mapping speakers PD_F7 PD_M6 C_M7
DNN 16.8 16.67 16.44
GAN 12.85 12.58 12.67
Table: Average LSD (in dB)
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
15. 15/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: LSD
0
10
20
LSD(dB)
Speaker PD_F7
DNN
GAN
0
10
LSD(dB)
Speaker PD_M6
DNN
GAN
0 5 10 15 20 25
Sentence ID
0
10
20
LSD(dB)
Speaker C_M7
DNN
GAN
Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
16. 16/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Variance ratio
0
1
22
CLR
2
MAP
Speaker PD_F7
DNN
GAN
0
1
2
2
CLR
2
MAP
Speaker PD_M6
DNN
GAN
2 4 6 8 10 12
VAE-12 component
0
1
2
2
CLR
2
MAP
Speaker C_M7
DNN
GAN
Figure: Variance ratio
¦Ò2
CLR
¦Ò2
MAP
between CLR VAE-12 (CLR) and mapped
VAE-12 (MAP); between GAN and DNN. Smaller is better.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
17. 17/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Example
Figure: Sentence: Four hours of steady work faced us.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
18. 18/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Subjective Evaluation
Loudness dierence was minimized using RMSA measure
Stimuli was mixed with babble noise at 0 dB SNR
The test consists of 25 sentences ¡Á 3 speakers ¡Á 5 conditions
(2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails
60 participants on AMT, each listened to 25 sentences then
typed down the sentences
We manually counted the accurate keywords of each sentence
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
26. 26/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Conclusion
Apply cGANs in HAB-to-CLR style conversion
1 In speaker-dependent one-to-one mapping, cGANs outperform
DNN in term keyword recall accuracy. cGANs improved
intelligibility of two of three speakers
2 In speaker-independent many-to-one mapping, cGANs can
improve speech intelligibility of one of three speakers
3 In speaker-independent many-to-many mapping, cGANs can
improve keyword recall accuracy of two speakers but the
results are not signicant
The modest results of speaker-independent style conversion are
due to small dataset, and the fact that we did not attempt to
transform additional acoustic features, such as phoneme
durations
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion