1. DDSP: Differentiable Digital Signal Processing (Spotlight), Engel, Jesse, Chenjie Gu, and Adam Roberts. "DDSP: Differentiable Digital Signal Processing." International Conference on Learning Representations. 2020.
2. High Fidelity Speech Synthesis with Adversarial Networks (Talk), Bi¨½kowski, Miko?aj, et al. "High Fidelity Speech Synthesis with Adversarial Networks." International Conference on Learning Representations. 2020.
review by June-Woo Kim
1 of 36
More Related Content
ICLR 2 papers review in signal processing domain
1. ICLR2020 reviews on Speech domain
Presented by: June-Woo Kim
Artificial Brain Research Lab., School of Sensor and Display,
Kyungpook National University
21, May. 2020.
ICLR 2020 (2020.04.26 ~ 2020.04.30)
2. Content
? DDSP: Differentiable Digital Signal Processing (Spotlight)
? Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
? High Fidelity Speech Synthesis with Adversarial Networks (Talk)
? Miko?aj Bi¨½kowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen
Simonyan
3. DDSP: Differentiable Digital Signal
Processing (Spotlight)
Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
Google Research, Brain Team
4. Overview
? Digital Signal Processing (DSP) is one of the backbones of modern society, integral to
? Telecommunications, Transportation, Audio, Many medical technologies
? Key idea
? Use simple interpretable DSP elements to create complex realistic signals by precisely controlling their many parameters
? E.g., a collection of linear filters and sinusoidal oscillators (DSP elements) can create the sound of a realistic violin
? In this paper,
? Use a neural network to convert a user¡¯s input into complex DSP controls that can produce more realistic signals
5. Challenges of ¡°pure¡± neural audio synthesis
Audio is highly periodic
Ears are sensitive to discontinuities
8. Room Reverberation (Reverb)
? Very long 1-D convolution (filter size = 64k)
? Learned for a given dataset
? Can also be generated by other DDSP components
9. Room Reverberation (Reverb)
Where,
?? ?? = reverberation time, in seconds
? = volume of room, in cubic feet (or m^3)
? = surface area, in square feet (or m^2)
? = average absorbtion coefficient
??60 =
24 ??10 ?
?20 ? ?
??60 =
0.161?
??
10. Overview to until this page
Additive Synthesizer Parameters
Noise Synthesizer and Reverb Parameters
19. High Fidelity Speech Synthesis with
Adversarial Networks (Talk)
Miko?aj Bi¨½kowski et al., Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis
C. Cobo, Karen Simonyan
DeepMind
20. Overview
? Neural Text-to-Speech (TTS)
? Acoustic model: receives text and predicts an intermediate representation such as a mel-spectrogram (ex: Tacotron,
Transformer-TTS, MelNet)
? Vocoder: converts predicted mel-spectrogram into audible raw audio (ex: WaveNet, WaveRNN, WaveGlow, MelGAN)
? Evaluation method of TTS
? Mean Opinion Score (MOS), which is evaluated by humans
? Contribution
? They used the power of linguistic feature extractor to train a generator that produces raw audio from text with GAN
? Also, four metrics that can be used instead of MOS are presented
22. Generator
? 567 input features per 5ms
windows
? Generator gradually
upsamples the representation
? Residual GBlocks use dilated
convolutions and batch-norm
conditioned on the noise
? 30 layers in total
23. Generator block
? GBlocks are 4-layer residual blocks
with 2 skip-connections, upsampling
and dilated convolutions
[batch, 1, time] (wav)
[batch, 567, time] [batch, Latent dim]
32. Pseudo code of TTS-GAN
Algorithm 1
? ?: waveform length
? ?: waveform-conditioning frequency ratio
? ?: base window size
? ? ?????: number of training steps
? ? ?????: batch size
? ??, ??: discriminator and generator learning rates
33. Experiments and results
? Same scale ? Performance degradation
? Random window ? Data augmentation, getting fast of learning speed
? If input size is fixed ? it can be accelerated with torch.backends.cudnn.benchmark = True (based on PyTorch)
? Three times faster than Parallel WaveNet, MOS is almost same
? Despite being a GAN, it was learned very stably.
35. Reference
? Engel, Jesse, et al. "DDSP: Differentiable Digital Signal Processing." ICLR (2020).
? https://www.dsprelated.com/freebooks/filters/View_Linear_Time_Varying.html (A view of linear varying digital
filters)
? https://www.bobgolds.com/RT60/rt60.htm (How to do a RT60 calculation)
? https://ccrma.stanford.edu/~adnanm/SCI220/Music318ir.pdf (Room impulse response measurement and analysis)
? Bi¨½kowski, Miko?aj, et al. "High fidelity speech synthesis with adversarial networks." ICLR (2020)
? Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." ICLR (2016).