�ݺ�ߣ

1/26
Introduction
Experiment: One-to-one mapping
Experiment: Many-to-one mapping
Experiment: Many-to-many mapping
Conclusion
Improving Speech Intelligibility through Speaker
Dependent and Independent Spectral Style
Conversion
Tuan Dinh, Alexander Kain, Kris Tjaden
Oregon Health & Science University, University at Bualo
October 23, 2020
Tuan Dinh, Alexander Kain, Kris Tjaden cGANs for Voice and Style Conversion

2/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Background
Approximately 28 �� 106 people in the United States have some
degree of hearing loss
Speakers naturally adopt a special clear speaking style when
talking to
listeners with hearing loss
normal-hearing listeners in adverse environments
Clear speech features
high degree of articulation
slower speaking rate
more frequent and longer pauses
exact strategy varies from speaker to speaker
Clear speech is more intelligible than habitual speech
1424% improvement in keyword recall in noise [Kain08]

Hybridization
Figure: Hybridization Algorithm Flowchart

4/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Hybridization
Replacing certain acoustic features of habitual speech with
those from clear speech cause improved intelligibility
for typical speakers, incorporating [Kain08]
clear spectrum and duration yielded 24% improvement
for dysarthric speakers, incorporating [Tjaden14]
clear energy yielded 8.7% improvement
clear spectrum yielded 18% improvement
clear spectrum and duration yielded 13.4% improvement

5/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Style conversion converts speaking style
Previously
mapping habitual (HAB) to clear (CLR) VAE-12 resulted in
improvement of intelligibility for one speaker from 24% to 46%
[Dinh19]
Generated parameters from DNN-mapping can be
over-smoothing
Generative adversarial nets (GANs) can be a promising
approach to address over-smoothness

6/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Style Conversion
Aim
To further increase intelligibility automatically by style conversion,
through the use of a conditional GANs (cGANs)
Experiments showing ecacy of cGANs in terms of speech
intelligibility when performing
1 speaker dependent one-to-one mapping
2 speaker independent many-to-one mapping
3 speaker independent many-to-many mapping

GANs
Traditional GAN has 2 components: a Generator (G) and a
Discriminator (D) that play a min-max game [Goodfellow14]
Figure: GANs
7/26

8/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Proposed cGANs for style conversion
Left Context
HAB VAE
Right Context
G
D
Mapped VAE
HAB VAE
CLR VAE
D
Real Pairs?
Real Pairs?
Figure: cGAN framework for style conversion

9/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Proposed Generator
Current
HAB
VAE
12
Left
Context
60
Right
Context
60
Concat
Dense
512
Dense
512
Concat
Dense
512
Dense
512
Linear
12
Add
Current
CLR
VAE
12
Figure: Generator architecture
No random noise z
The component G learns the dierences between HAB VAE-12
and CLR VAE-12

10/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Proposed Discriminator
Discriminator has 2 hidden layers of 256 nodes, an output layer
of 1 nodes with sigmoid function
In addition to adversarial loss, we use mean-absolute dierence
loss between G(z) and aligned real data x

11/26
Introduction
Conclusion
Background
Hybridization
Style Conversion
Tips and Tricks to Train cGANs
a leaky ReLU activation function with a negative slope of 0.2
for both G and D
a dropout layer following each hidden layer of D with a
dropout rate of 0.5,
use the Adam optimizer:
learning rate: 0.0001, momentum ��1: 0.5 and learning rate
decay: 0.00001 for D
learning rate: 0.0002, momentum ��1: 0.5 and learning rate
decay: 0.00001 for G
weights initialized from a zero-centered Normal distribution
with standard deviation 0.02

12/26
Introduction
Conclusion
Method
Objective Evaluation
Subjective Evaluation
Train speaker-dependent HAB-to-CLR mapping:
Require parallel data of HAB and CLR speech
Database: Used a 78 speaker database:
Consisting of control speakers (CS, N = 32)
Speaker with multiple sclerosis (MS, N = 30)
Speakers with Parkinson's disease (PD, N = 16)
A speaker read 25 Harvard sentences in 2 speaking styles
(HAB, CLR)
Select three speakers: PDM6, CSM7, PDF7 that showed the
most benet from the CLR spectrum

13/26
Introduction
Conclusion
Method
Method
HAB
VAE-
12
style
mapping
CLR
VAE-
12
Figure: cGANs-based mapping
We aligned each HAB utterance to its parallel CLR utterance
of the same speaker using DTW on 32nd-order log lter-bank
features.
Then, we pre-trained the generator that maps HAB VAE-12 to
CLR VAE-12 to minimize mean-squared-error loss function
Then, we trained our proposed cGANs structure

14/26
Introduction
Conclusion
Method
Objective Evaluation: Log Spectral Distortion
mapping speakers PD_F7 PD_M6 C_M7
DNN 16.8 16.67 16.44
GAN 12.85 12.58 12.67
Table: Average LSD (in dB)

15/26
Introduction
Conclusion
Method
Objective Evaluation: LSD
0
10
20
LSD(dB)
Speaker PD_F7
DNN
GAN
0
10
LSD(dB)
Speaker PD_M6
DNN
GAN
0 5 10 15 20 25
Sentence ID
0
10
20
LSD(dB)
Speaker C_M7
DNN
GAN
Figure: LSD of 25 test sentences for 3 speakers; GAN vs DNN

16/26
Introduction
Conclusion
Method
Objective Evaluation: Variance ratio
0
1
22
CLR
2
MAP
Speaker PD_F7
DNN
GAN
0
1
2
2
CLR
2
MAP
Speaker PD_M6
DNN
GAN
2 4 6 8 10 12
VAE-12 component
0
1
2
2
CLR
2
MAP
Speaker C_M7
DNN
GAN
Figure: Variance ratio
��2
CLR
��2
MAP
between CLR VAE-12 (CLR) and mapped
VAE-12 (MAP); between GAN and DNN. Smaller is better.

17/26
Introduction
Conclusion
Method
Objective Evaluation: Example
Figure: Sentence: Four hours of steady work faced us.

18/26
Introduction
Conclusion
Method
Loudness dierence was minimized using RMSA measure
Stimuli was mixed with babble noise at 0 dB SNR
The test consists of 25 sentences �� 3 speakers �� 5 conditions
(2 purely vocoded, 1 hybrid, 2 mappings) = 375 unique trails
60 participants on AMT, each listened to 25 sentences then
typed down the sentences
We manually counted the accurate keywords of each sentence

19/26
Introduction
Conclusion
Method
vocoded HAB DNN GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy CSM7
PDF7
PDM6
Figure: Keyword recall accuracy

20/26
Introduction
Conclusion
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
mapping
Best
CLR
VAE-
12
Maps HAB speech from many speakers to CLR speech of a
target speaker

21/26
Introduction
Conclusion
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR
Test consists of 25 sentences ��3 source speakers (CSM7,
PDM7, PDM6) ��3 conditions (vocoded HAB, cGAN-mapping,
hybrid) + 25 sentences ��2 target speakers (CSM10, CSF15)
��1 conditions (vocoded CLR) = 275 unique trials
44 participants on AMT

22/26
Introduction
Conclusion
vocoded HAB GAN hybrid vocoded CLR
0
20
40
60
80
100
Averagekeywordaccuracy
CSM7
PDF7
PDM6
Figure: Keyword recall accuracy

23/26
Introduction
Conclusion
SPK-i
HAB
VAE-
12
SPK-1
HAB
VAE-
12
SPK-N
HAB
VAE-
12
style
mapping
style
mapping
style
mapping
SPK-i
CLR
VAE-
12
SPK-1
CLR
VAE-
12
SPK-N
CLR
VAE-
12
Learn the style dierences, preserve speaker identities

24/26
Introduction
Conclusion
A test consists of 5 sentences ��3 speakers ��4 conditions
(vocoded HAB, GAN, hybrid, vocoded CLR) = 300 unique
trials
24 listeners participated
Loudness dierence was minimized using RMSA measure.
Stimuli was mix with babble noise at 0 dB SNR

CSM7 PDF7 PDM6
vocoded HAB 36.8 10 28.8
GAN 39.6 15.6 26.8
hybrid 62 22.8 57.6
vocoded CLR 66.8 22.4 48
Table: Average keyword accuracy
25/26

26/26
Introduction
Conclusion
Conclusion
Apply cGANs in HAB-to-CLR style conversion
1 In speaker-dependent one-to-one mapping, cGANs outperform
DNN in term keyword recall accuracy. cGANs improved
intelligibility of two of three speakers
2 In speaker-independent many-to-one mapping, cGANs can
improve speech intelligibility of one of three speakers
3 In speaker-independent many-to-many mapping, cGANs can
improve keyword recall accuracy of two speakers but the
results are not signicant
The modest results of speaker-independent style conversion are
due to small dataset, and the fact that we did not attempt to
transform additional acoustic features, such as phoneme
durations

�ݺ�ߣ

Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion

More Related Content

Improving speech Intelligibility through Speaker Dependent and Independent Spectral Style Conversion