ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
ICLR2020 reviews on Speech domain
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
21, May. 2020.
ICLR 2020 (2020.04.26 ~ 2020.04.30)
Content
? DDSP: Differentiable Digital Signal Processing (Spotlight)
? Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
? High Fidelity Speech Synthesis with Adversarial Networks (Talk)
? Miko?aj Bi¨½kowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen
Simonyan
DDSP: Differentiable Digital Signal
Processing (Spotlight)
Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
Google Research, Brain Team
Overview
? Digital Signal Processing (DSP) is one of the backbones of modern society, integral to
? Telecommunications, Transportation, Audio, Many medical technologies
? Key idea
? Use simple interpretable DSP elements to create complex realistic signals by precisely controlling their many parameters
? E.g., a collection of linear filters and sinusoidal oscillators (DSP elements) can create the sound of a realistic violin
? In this paper,
? Use a neural network to convert a user¡¯s input into complex DSP controls that can produce more realistic signals
Challenges of ¡°pure¡± neural audio synthesis
Audio is highly periodic
Ears are sensitive to discontinuities
DSP Components
? Oscillators (Harmonic Sinusoids)
Differentiable Additive Synthesizer
Filters (LTV-FIR)
? Linear Time Variant Filter
Magnitude ???? Impulse
(freq, t) + ?????? Response (t)
Room Reverberation (Reverb)
? Very long 1-D convolution (filter size = 64k)
? Learned for a given dataset
? Can also be generated by other DDSP components
Room Reverberation (Reverb)
Where,
?? ?? = reverberation time, in seconds
? = volume of room, in cubic feet (or m^3)
? = surface area, in square feet (or m^2)
? = average absorbtion coefficient
??60 =
24 ??10 ?
?20 ? ?
??60 =
0.161?
??
Overview to until this page
Additive Synthesizer Parameters
Noise Synthesizer and Reverb Parameters
Proposed model
Proposed model
Proposed Model (Encoder)
Proposed Model (Decoder)
Result
Timbre Transfer (singing voice to violin)
Extrapolation
Dereverberation and Acoustic Transfer
High Fidelity Speech Synthesis with
Adversarial Networks (Talk)
Miko?aj Bi¨½kowski et al., Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis
C. Cobo, Karen Simonyan
DeepMind
Overview
? Neural Text-to-Speech (TTS)
? Acoustic model: receives text and predicts an intermediate representation such as a mel-spectrogram (ex: Tacotron,
Transformer-TTS, MelNet)
? Vocoder: converts predicted mel-spectrogram into audible raw audio (ex: WaveNet, WaveRNN, WaveGlow, MelGAN)
? Evaluation method of TTS
? Mean Opinion Score (MOS), which is evaluated by humans
? Contribution
? They used the power of linguistic feature extractor to train a generator that produces raw audio from text with GAN
? Also, four metrics that can be used instead of MOS are presented
GAN-TTS
Generator
? 567 input features per 5ms
windows
? Generator gradually
upsamples the representation
? Residual GBlocks use dilated
convolutions and batch-norm
conditioned on the noise
? 30 layers in total
Generator block
? GBlocks are 4-layer residual blocks
with 2 skip-connections, upsampling
and dilated convolutions
[batch, 1, time] (wav)
[batch, 567, time] [batch, Latent dim]
Dilated convolution
Compare (Dilated Conv. vs Upsampled Conv.)
Discriminator
[batch, 1, time], [batch, 567, time]
? = ???
??, ? to [?, ?], where ? is the downsample factor
(e.g. ? = ? for input window size 1920)
[batch, 1]
Discriminator
[batch, 567, time]
[batch, l]
Discriminator
? Discriminator Block
Discriminator
Discriminator
Discriminator
Pseudo code of TTS-GAN
Algorithm 1
? ?: waveform length
? ?: waveform-conditioning frequency ratio
? ?: base window size
? ? ?????: number of training steps
? ? ?????: batch size
? ??, ??: discriminator and generator learning rates
Experiments and results
? Same scale ? Performance degradation
? Random window ? Data augmentation, getting fast of learning speed
? If input size is fixed ? it can be accelerated with torch.backends.cudnn.benchmark = True (based on PyTorch)
? Three times faster than Parallel WaveNet, MOS is almost same
? Despite being a GAN, it was learned very stably.
Note
? All the figures are from authors, papers, blogs
Reference
? Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." ICLR (2020).
? https://www.dsprelated.com/freebooks/filters/View_Linear_Time_Varying.html (A view of linear varying digital
filters)
? https://www.bobgolds.com/RT60/rt60.htm (How to do a RT60 calculation)
? https://ccrma.stanford.edu/~adnanm/SCI220/Music318ir.pdf (Room impulse response measurement and analysis)
? Bi¨½kowski, Miko?aj, et al. "High fidelity speech synthesis with adversarial networks." ICLR (2020)
? Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." ICLR (2016).
Thank you!

More Related Content

ICLR 2 papers review in signal processing domain

  • 1. ICLR2020 reviews on Speech domain Presented by: June-Woo Kim Artificial Brain Research Lab., School of Sensor and Display, Kyungpook National University 21, May. 2020. ICLR 2020 (2020.04.26 ~ 2020.04.30)
  • 2. Content ? DDSP: Differentiable Digital Signal Processing (Spotlight) ? Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts ? High Fidelity Speech Synthesis with Adversarial Networks (Talk) ? Miko?aj Bi¨½kowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan
  • 3. DDSP: Differentiable Digital Signal Processing (Spotlight) Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts Google Research, Brain Team
  • 4. Overview ? Digital Signal Processing (DSP) is one of the backbones of modern society, integral to ? Telecommunications, Transportation, Audio, Many medical technologies ? Key idea ? Use simple interpretable DSP elements to create complex realistic signals by precisely controlling their many parameters ? E.g., a collection of linear filters and sinusoidal oscillators (DSP elements) can create the sound of a realistic violin ? In this paper, ? Use a neural network to convert a user¡¯s input into complex DSP controls that can produce more realistic signals
  • 5. Challenges of ¡°pure¡± neural audio synthesis Audio is highly periodic Ears are sensitive to discontinuities
  • 6. DSP Components ? Oscillators (Harmonic Sinusoids) Differentiable Additive Synthesizer
  • 7. Filters (LTV-FIR) ? Linear Time Variant Filter Magnitude ???? Impulse (freq, t) + ?????? Response (t)
  • 8. Room Reverberation (Reverb) ? Very long 1-D convolution (filter size = 64k) ? Learned for a given dataset ? Can also be generated by other DDSP components
  • 9. Room Reverberation (Reverb) Where, ?? ?? = reverberation time, in seconds ? = volume of room, in cubic feet (or m^3) ? = surface area, in square feet (or m^2) ? = average absorbtion coefficient ??60 = 24 ??10 ? ?20 ? ? ??60 = 0.161? ??
  • 10. Overview to until this page Additive Synthesizer Parameters Noise Synthesizer and Reverb Parameters
  • 16. Timbre Transfer (singing voice to violin)
  • 19. High Fidelity Speech Synthesis with Adversarial Networks (Talk) Miko?aj Bi¨½kowski et al., Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan DeepMind
  • 20. Overview ? Neural Text-to-Speech (TTS) ? Acoustic model: receives text and predicts an intermediate representation such as a mel-spectrogram (ex: Tacotron, Transformer-TTS, MelNet) ? Vocoder: converts predicted mel-spectrogram into audible raw audio (ex: WaveNet, WaveRNN, WaveGlow, MelGAN) ? Evaluation method of TTS ? Mean Opinion Score (MOS), which is evaluated by humans ? Contribution ? They used the power of linguistic feature extractor to train a generator that produces raw audio from text with GAN ? Also, four metrics that can be used instead of MOS are presented
  • 22. Generator ? 567 input features per 5ms windows ? Generator gradually upsamples the representation ? Residual GBlocks use dilated convolutions and batch-norm conditioned on the noise ? 30 layers in total
  • 23. Generator block ? GBlocks are 4-layer residual blocks with 2 skip-connections, upsampling and dilated convolutions [batch, 1, time] (wav) [batch, 567, time] [batch, Latent dim]
  • 25. Compare (Dilated Conv. vs Upsampled Conv.)
  • 26. Discriminator [batch, 1, time], [batch, 567, time] ? = ??? ??, ? to [?, ?], where ? is the downsample factor (e.g. ? = ? for input window size 1920) [batch, 1]
  • 32. Pseudo code of TTS-GAN Algorithm 1 ? ?: waveform length ? ?: waveform-conditioning frequency ratio ? ?: base window size ? ? ?????: number of training steps ? ? ?????: batch size ? ??, ??: discriminator and generator learning rates
  • 33. Experiments and results ? Same scale ? Performance degradation ? Random window ? Data augmentation, getting fast of learning speed ? If input size is fixed ? it can be accelerated with torch.backends.cudnn.benchmark = True (based on PyTorch) ? Three times faster than Parallel WaveNet, MOS is almost same ? Despite being a GAN, it was learned very stably.
  • 34. Note ? All the figures are from authors, papers, blogs
  • 35. Reference ? Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." ICLR (2020). ? https://www.dsprelated.com/freebooks/filters/View_Linear_Time_Varying.html (A view of linear varying digital filters) ? https://www.bobgolds.com/RT60/rt60.htm (How to do a RT60 calculation) ? https://ccrma.stanford.edu/~adnanm/SCI220/Music318ir.pdf (Room impulse response measurement and analysis) ? Bi¨½kowski, Miko?aj, et al. "High fidelity speech synthesis with adversarial networks." ICLR (2020) ? Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." ICLR (2016).

Editor's Notes

  1. Hello everyone, I am June-Woo Kim from ABR LAB. I will presenting the paper: ICLR2020.
  2. Here is summary.
  3. RWD ensemble in discriminator