12. Things not described and Guesses
? Kernel size of the dilation filters 2
? Number of the layers (ResNet-blocks) 4*10~ 6*10
? Number of the channels in hidden layers hundreds? 256?
? the other activation function in a Res-block? may be no
? Batch normalization no reason not to use
? Sampling frequency `at least 16kHz¨
? Where to let the skip connection out? Every 10?
? Skip connections have weights yes?
14. Text-to-Speech (TTS)
? Single-speaker speech dataset
? North American English dataset: 24.6hr
? Mandarin Chinese dataset: 34.8hr
? Receptive field 240ms
? Ad hoc architecture as ★
WaveNet
Audio(t)
Yet another
model
Liguistic feature h_i
(possibly phoneme)
Another model
Fundamental
frequency F0(t) duration(t)
Liguistic feature h(t)
☆猟とは`った催を聞っています。
15. TTS: Mean Opinion Score
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
16. Speech Recoginition
? TIMIT dataset (possibly ~4hrs)
? Add pooling layer after dilated convolution
? of 160x down sampling (Does it mean 7th layer?)
? Then a few non-causal convolutions.
? Loss to predict the next sample (same as ordinary WaveNet)
? And a loss to classify the frame
? 18.8PER, which is best score among raw-audio models.