際際滷

際際滷Share a Scribd company logo
WaveNet: A Generative
Model for Raw Audio
TIS + Albert 茶氏
2017/01/24
恷貧肪伏
tsuguo_mogami@albert2005.co.jp
Why?
? Autoregressive model (e.g. PixelCNN)が措く撹孔している。
?
? ★ 咄蕗はどうであろうか。
? それをRNNより紳糞弔CNNで佩いたい。
Contributions
? これまでにない瞳|の咄蕗栽撹。
? Dilated convolutionを聞い、寄きな鞭否勸を隔つにもvわらず
紳糞弔淵〒`キテクチャ
? 咄蕗JRも
Dilated convolutionとは
https://github.com/vdumoulin/conv_arithmetic
寄まかに冱えば、云輝は寄きなkernel sizeのフィルタを聞いたいとき、
これを聞えば麻楚をやさずに、寄きなカ`ネルと除貌のY惚が竃せる。
stack of dilated causal convolutional layers
鞭否勸の寄の古廷蹐任△蝓gHはResNetLblockのRり卦しです
Repetition Structure
1,2,4,´,512, 1,2,4,´,512, 1,2,4,´,512.
Suspected to be repeating the 1´512 blocks 16 times
Autoregression
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
residual block and the entire architecture
ちょっとわかりにくいので噸宥の燕に宙きなおします
???
Gated activation units
?
? K: layerindex, f for filter, g for gate
? : elementwise multipulication
? h: condition (person, text, etc.)
? Why?
? PixelCNN (1606.05328)で秘
? それ參念のCNN伏撹モデルがPixelRNNに喪ったのは
LSTMのゲ`ト夛のせいだと深えてLSTM貌のゲ`
トを秘した
Input/output
http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/
jに冱えばLogスケ`ルで
quantizeして256粁Aにコ`ド
Things not described and Guesses
? Kernel size of the dilation filters 2
? Number of the layers (ResNet-blocks) 4*10~ 6*10
? Number of the channels in hidden layers hundreds? 256?
? the other activation function in a Res-block? may be no
? Batch normalization no reason not to use
? Sampling frequency `at least 16kHz¨
? Where to let the skip connection out? Every 10?
? Skip connections have weights yes?
Experiments
Text-to-Speech (TTS)
? Single-speaker speech dataset
? North American English dataset: 24.6hr
? Mandarin Chinese dataset: 34.8hr
? Receptive field 240ms
? Ad hoc architecture as ★
WaveNet
Audio(t)
Yet another
model
Liguistic feature h_i
(possibly phoneme)
Another model
Fundamental
frequency F0(t) duration(t)
Liguistic feature h(t)
☆猟とは`った催を聞っています。
TTS: Mean Opinion Score
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Speech Recoginition
? TIMIT dataset (possibly ~4hrs)
? Add pooling layer after dilated convolution
? of 160x down sampling (Does it mean 7th layer?)
? Then a few non-causal convolutions.
? Loss to predict the next sample (same as ordinary WaveNet)
? And a loss to classify the frame
? 18.8PER, which is best score among raw-audio models.
End
(Multi-speaker) Speech Generation
? Conditioned on the speaker
? 44 hours of data (from 109 speakers)
TTS: Mean opinion score
μ-law transformation (ITU-T, 1988)
?
? で-1,1のgを256蛍護している。
? 寄j委には log でコ`ドしているだけ。

More Related Content

WaveNet