ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
1/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Improving Speech Intelligibility through Speaker
Dependent and Independent Spectral Style
Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
Oregon Health & Science University, University at Bualo
October 23, 2020
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
2/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Background
Approximately 28 ¡Á 106 people in the United States have some
degree of hearing loss
Speakers naturally adopt a special clear speaking style when
talking to
listeners with hearing loss
normal-hearing listeners in adverse environments
Clear speech features
high degree of articulation
slower speaking rate
more frequent and longer pauses
exact strategy varies from speaker to speaker
Clear speech is more intelligible than habitual speech
1424% improvement in keyword recall in noise [Kain08]
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
Hybridization
Figure: Hybridization Algorithm Flowchart
4/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Hybridization
Replacing certain acoustic features of habitual speech with
those from clear speech cause improved intelligibility
for typical speakers, incorporating [Kain08]
clear spectrum and duration yielded 24% improvement
for dysarthric speakers, incorporating [Tjaden14]
clear energy yielded 8.7% improvement
clear spectrum yielded 18% improvement
clear spectrum and duration yielded 13.4% improvement
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
5/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Style conversion converts speaking style
Previously
mapping habitual (HAB) to clear (CLR) VAE-12 resulted in
improvement of intelligibility for one speaker from 24% to 46%
[Dinh19]
Generated parameters from DNN-mapping can be
over-smoothing
Generative adversarial nets (GANs) can be a promising
approach to address over-smoothness
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
6/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Aim
To further increase intelligibility automatically by style conversion,
through the use of a conditional GANs (cGANs)
Experiments showing ecacy of cGANs in terms of speech
intelligibility when performing
1 speaker dependent one-to-one mapping
2 speaker independent many-to-one mapping
3 speaker independent many-to-many mapping
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
GANs
Traditional GAN has 2 components: a Generator (G) and a
Discriminator (D) that play a min-max game [Goodfellow14]
Figure: GANs
7/26
8/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed cGANs for style conversion
Left Context
HAB VAE
Right Context
G
D
Mapped VAE
HAB VAE
CLR VAE
D
Real Pairs?
Real Pairs?
Figure: cGAN framework for style conversion
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
9/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
Figure: Generator architecture
No random noise z
The component G learns the dierences between HAB VAE-12
and CLR VAE-12
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
10/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Proposed Discriminator
Discriminator has 2 hidden layers of 256 nodes, an output layer
of 1 nodes with sigmoid function
In addition to adversarial loss, we use mean-absolute dierence
loss between G(z) and aligned real data x
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
11/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Background
Hybridization
Style Conversion
Tips and Tricks to Train cGANs
a leaky ReLU activation function with a negative slope of 0.2
for both G and D
a dropout layer following each hidden layer of D with a
dropout rate of 0.5,
use the Adam optimizer:
learning rate: 0.0001, momentum ¦Â1: 0.5 and learning rate
decay: 0.00001 for D
learning rate: 0.0002, momentum ¦Â1: 0.5 and learning rate
decay: 0.00001 for G
weights initialized from a zero-centered Normal distribution
with standard deviation 0.02
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
12/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Experiment: One-to-one mapping
Train speaker-dependent HAB-to-CLR mapping:
Require parallel data of HAB and CLR speech
Database: Used a 78 speaker database:
Consisting of control speakers (CS, N = 32)
Speaker with multiple sclerosis (MS, N = 30)
Speakers with Parkinson's disease (PD, N = 16)
A speaker read 25 Harvard sentences in 2 speaking styles
(HAB, CLR)
Select three speakers: PDM6, CSM7, PDF7 that showed the
most benet from the CLR spectrum
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
13/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Method
HAB
VAE-
12
style
mapping
CLR
VAE-
12
Figure: cGANs-based mapping
We aligned each HAB utterance to its parallel CLR utterance
of the same speaker using DTW on 32nd-order log lter-bank
features.
Then, we pre-trained the generator that maps HAB VAE-12 to
CLR VAE-12 to minimize mean-squared-error loss function
Then, we trained our proposed cGANs structure
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
14/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Log Spectral Distortion
mapping  speakers PD_F7 PD_M6 C_M7
DNN 16.8 16.67 16.44
GAN 12.85 12.58 12.67
Table: Average LSD (in dB)
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
15/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: LSD
0
10
20
LSD(dB)
Speaker PD_F7
DNN
GAN
0
10
LSD(dB)
Speaker PD_M6
DNN
GAN
0 5 10 15 20 25
Sentence ID
0
10
20
LSD(dB)
Speaker C_M7
DNN
GAN
Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
16/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Variance ratio
0
1
22
CLR
2
MAP
Speaker PD_F7
DNN
GAN
0
1
2
2
CLR
2
MAP
Speaker PD_M6
DNN
GAN
2 4 6 8 10 12
VAE-12 component
0
1
2
2
CLR
2
MAP
Speaker C_M7
DNN
GAN
Figure: Variance ratio
¦Ò2
CLR
¦Ò2
MAP
between CLR VAE-12 (CLR) and mapped
VAE-12 (MAP); between GAN and DNN. Smaller is better.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
17/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Objective Evaluation: Example
Figure: Sentence: Four hours of steady work faced us.
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
18/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Subjective Evaluation
Loudness dierence was minimized using RMSA measure
Stimuli was mixed with babble noise at 0 dB SNR
The test consists of 25 sentences ¡Á 3 speakers ¡Á 5 conditions
(2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails
60 participants on AMT, each listened to 25 sentences then
typed down the sentences
We manually counted the accurate keywords of each sentence
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
19/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Subjective Evaluation
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy CSM7
PDF7
PDM6
Figure: Keyword recall accuracy
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
20/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Experiment: Many-to-one mapping
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
mapping
Best
CLR
VAE-
12
Figure: cGANs-based mapping
Maps HAB speech from many speakers to CLR speech of a
target speaker
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
21/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR
Test consists of 25 sentences ¡Á3 source speakers (CSM7,
PDM7, PDM6) ¡Á3 conditions (vocoded HAB, cGAN-mapping,
hybrid) + 25 sentences ¡Á2 target speakers (CSM10, CSF15)
¡Á1 conditions (vocoded CLR) = 275 unique trials
44 participants on AMT
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
22/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
23/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Experiment: Many-to-many mapping
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
style
mapping
style
mapping
style
mapping
SPK-i
CLR
VAE-
12
SPK-1
CLR
VAE-
12
SPK-N
CLR
VAE-
12
Figure: cGANs-based mapping
Learn the style dierences, preserve speaker identities
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
24/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Subjective Evaluation
Subjective Evaluation
A test consists of 5 sentences ¡Á3 speakers ¡Á4 conditions
(vocoded HAB, GAN, hybrid, vocoded CLR) = 300 unique
trials
24 listeners participated
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
Subjective Evaluation
CSM7 PDF7 PDM6
vocoded HAB 36.8 10 28.8
GAN 39.6 15.6 26.8
hybrid 62 22.8 57.6
vocoded CLR 66.8 22.4 48
Table: Average keyword accuracy
25/26
26/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Conclusion
Apply cGANs in HAB-to-CLR style conversion
1 In speaker-dependent one-to-one mapping, cGANs outperform
DNN in term keyword recall accuracy. cGANs improved
intelligibility of two of three speakers
2 In speaker-independent many-to-one mapping, cGANs can
improve speech intelligibility of one of three speakers
3 In speaker-independent many-to-many mapping, cGANs can
improve keyword recall accuracy of two speakers but the
results are not signicant
The modest results of speaker-independent style conversion are
due to small dataset, and the fact that we did not attempt to
transform additional acoustic features, such as phoneme
durations
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion

More Related Content

Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion

  • 1. 1/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Improving Speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion Tuan Dinh, Alexander Kain, Kris Tjaden Oregon Health & Science University, University at Bualo October 23, 2020 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 2. 2/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Background Approximately 28 ¡Á 106 people in the United States have some degree of hearing loss Speakers naturally adopt a special clear speaking style when talking to listeners with hearing loss normal-hearing listeners in adverse environments Clear speech features high degree of articulation slower speaking rate more frequent and longer pauses exact strategy varies from speaker to speaker Clear speech is more intelligible than habitual speech 1424% improvement in keyword recall in noise [Kain08] Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 4. 4/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Hybridization Replacing certain acoustic features of habitual speech with those from clear speech cause improved intelligibility for typical speakers, incorporating [Kain08] clear spectrum and duration yielded 24% improvement for dysarthric speakers, incorporating [Tjaden14] clear energy yielded 8.7% improvement clear spectrum yielded 18% improvement clear spectrum and duration yielded 13.4% improvement Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 5. 5/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Style Conversion Style conversion converts speaking style Previously mapping habitual (HAB) to clear (CLR) VAE-12 resulted in improvement of intelligibility for one speaker from 24% to 46% [Dinh19] Generated parameters from DNN-mapping can be over-smoothing Generative adversarial nets (GANs) can be a promising approach to address over-smoothness Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 6. 6/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Style Conversion Aim To further increase intelligibility automatically by style conversion, through the use of a conditional GANs (cGANs) Experiments showing ecacy of cGANs in terms of speech intelligibility when performing 1 speaker dependent one-to-one mapping 2 speaker independent many-to-one mapping 3 speaker independent many-to-many mapping Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 7. GANs Traditional GAN has 2 components: a Generator (G) and a Discriminator (D) that play a min-max game [Goodfellow14] Figure: GANs 7/26
  • 8. 8/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed cGANs for style conversion Left Context HAB VAE Right Context G D Mapped VAE HAB VAE CLR VAE D Real Pairs? Real Pairs? Figure: cGAN framework for style conversion Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 9. 9/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed Generator Current HAB VAE 12 Left Context 60 Right Context 60 Concat Dense 512 Dense 512 Concat Dense 512 Dense 512 Linear 12 Add Current CLR VAE 12 Figure: Generator architecture No random noise z The component G learns the dierences between HAB VAE-12 and CLR VAE-12 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 10. 10/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Proposed Discriminator Discriminator has 2 hidden layers of 256 nodes, an output layer of 1 nodes with sigmoid function In addition to adversarial loss, we use mean-absolute dierence loss between G(z) and aligned real data x Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 11. 11/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Background Hybridization Style Conversion Tips and Tricks to Train cGANs a leaky ReLU activation function with a negative slope of 0.2 for both G and D a dropout layer following each hidden layer of D with a dropout rate of 0.5, use the Adam optimizer: learning rate: 0.0001, momentum ¦Â1: 0.5 and learning rate decay: 0.00001 for D learning rate: 0.0002, momentum ¦Â1: 0.5 and learning rate decay: 0.00001 for G weights initialized from a zero-centered Normal distribution with standard deviation 0.02 Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 12. 12/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Experiment: One-to-one mapping Train speaker-dependent HAB-to-CLR mapping: Require parallel data of HAB and CLR speech Database: Used a 78 speaker database: Consisting of control speakers (CS, N = 32) Speaker with multiple sclerosis (MS, N = 30) Speakers with Parkinson's disease (PD, N = 16) A speaker read 25 Harvard sentences in 2 speaking styles (HAB, CLR) Select three speakers: PDM6, CSM7, PDF7 that showed the most benet from the CLR spectrum Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 13. 13/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Method HAB VAE- 12 style mapping CLR VAE- 12 Figure: cGANs-based mapping We aligned each HAB utterance to its parallel CLR utterance of the same speaker using DTW on 32nd-order log lter-bank features. Then, we pre-trained the generator that maps HAB VAE-12 to CLR VAE-12 to minimize mean-squared-error loss function Then, we trained our proposed cGANs structure Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 14. 14/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Log Spectral Distortion mapping speakers PD_F7 PD_M6 C_M7 DNN 16.8 16.67 16.44 GAN 12.85 12.58 12.67 Table: Average LSD (in dB) Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 15. 15/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: LSD 0 10 20 LSD(dB) Speaker PD_F7 DNN GAN 0 10 LSD(dB) Speaker PD_M6 DNN GAN 0 5 10 15 20 25 Sentence ID 0 10 20 LSD(dB) Speaker C_M7 DNN GAN Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 16. 16/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Variance ratio 0 1 22 CLR 2 MAP Speaker PD_F7 DNN GAN 0 1 2 2 CLR 2 MAP Speaker PD_M6 DNN GAN 2 4 6 8 10 12 VAE-12 component 0 1 2 2 CLR 2 MAP Speaker C_M7 DNN GAN Figure: Variance ratio ¦Ò2 CLR ¦Ò2 MAP between CLR VAE-12 (CLR) and mapped VAE-12 (MAP); between GAN and DNN. Smaller is better. Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 17. 17/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Objective Evaluation: Example Figure: Sentence: Four hours of steady work faced us. Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 18. 18/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Subjective Evaluation Loudness dierence was minimized using RMSA measure Stimuli was mixed with babble noise at 0 dB SNR The test consists of 25 sentences ¡Á 3 speakers ¡Á 5 conditions (2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails 60 participants on AMT, each listened to 25 sentences then typed down the sentences We manually counted the accurate keywords of each sentence Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 19. 19/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Method Objective Evaluation Subjective Evaluation Subjective Evaluation vocoded HAB DNN GAN hybrid vocoded CLR 0 20 40 60 80 100 Averagekeywordaccuracy CSM7 PDF7 PDM6 Figure: Keyword recall accuracy Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 20. 20/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Experiment: Many-to-one mapping SPK-i HAB VAE- 12 SPK-1 HAB VAE- 12 SPK-N HAB VAE- 12 mapping Best CLR VAE- 12 Figure: cGANs-based mapping Maps HAB speech from many speakers to CLR speech of a target speaker Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 21. 21/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation Loudness dierence was minimized using RMSA measure. Stimuli was mix with babble noise at 0 dB SNR Test consists of 25 sentences ¡Á3 source speakers (CSM7, PDM7, PDM6) ¡Á3 conditions (vocoded HAB, cGAN-mapping, hybrid) + 25 sentences ¡Á2 target speakers (CSM10, CSF15) ¡Á1 conditions (vocoded CLR) = 275 unique trials 44 participants on AMT Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 22. 22/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation vocoded HAB GAN hybrid vocoded CLR 0 20 40 60 80 100 Averagekeywordaccuracy CSM7 PDF7 PDM6 Figure: Keyword recall accuracy Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 23. 23/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Experiment: Many-to-many mapping SPK-i HAB VAE- 12 SPK-1 HAB VAE- 12 SPK-N HAB VAE- 12 style mapping style mapping style mapping SPK-i CLR VAE- 12 SPK-1 CLR VAE- 12 SPK-N CLR VAE- 12 Figure: cGANs-based mapping Learn the style dierences, preserve speaker identities Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 24. 24/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Subjective Evaluation Subjective Evaluation A test consists of 5 sentences ¡Á3 speakers ¡Á4 conditions (vocoded HAB, GAN, hybrid, vocoded CLR) = 300 unique trials 24 listeners participated Loudness dierence was minimized using RMSA measure. Stimuli was mix with babble noise at 0 dB SNR Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion
  • 25. Subjective Evaluation CSM7 PDF7 PDM6 vocoded HAB 36.8 10 28.8 GAN 39.6 15.6 26.8 hybrid 62 22.8 57.6 vocoded CLR 66.8 22.4 48 Table: Average keyword accuracy 25/26
  • 26. 26/26 Introduction Experiment: One-to-one mapping Experiment: Many-to-one mapping Experiment: Many-to-many mapping Conclusion Conclusion Apply cGANs in HAB-to-CLR style conversion 1 In speaker-dependent one-to-one mapping, cGANs outperform DNN in term keyword recall accuracy. cGANs improved intelligibility of two of three speakers 2 In speaker-independent many-to-one mapping, cGANs can improve speech intelligibility of one of three speakers 3 In speaker-independent many-to-many mapping, cGANs can improve keyword recall accuracy of two speakers but the results are not signicant The modest results of speaker-independent style conversion are due to small dataset, and the fact that we did not attempt to transform additional acoustic features, such as phoneme durations Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion