狠狠撸

WaveNet: A Generative
Model for Raw Audio
TIS + Albert 勉強会
2017/01/24
最上嗣生
tsuguo_mogami@albert2005.co.jp

Why?
? Autoregressive model (e.g. PixelCNN)が良く成功している。
?
? → 音声はどうであろうか。
? それをRNNより効率的なCNNで行いたい。

Contributions
? これまでにない品質の音声合成。
? Dilated convolutionを使い、大きな受容野を持つにも関わらず
効率的なアーキテクチャ
? （音声認識も）

Dilated convolutionとは
https://github.com/vdumoulin/conv_arithmetic
大まかに言えば、本当は大きなkernel sizeのフィルタを使いたいとき、
これを使えば計算量を増やさずに、大きなカーネルと近似の結果が出せる。

stack of dilated causal convolutional layers
受容野の拡大の概念図であり、実際はResNet風blockの繰り返しです

Repetition Structure
1,2,4,…,512, 1,2,4,…,512, 1,2,4,…,512.
Suspected to be repeating the 1…512 blocks 16 times

Autoregression
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

residual block and the entire architecture
ちょっとわかりにくいので普通の表記に描きなおします

Gated activation units
?
? K: layerindex, f for filter, g for gate
? : elementwise multipulication
? h: condition (person, text, etc.)
? Why?
? PixelCNN (1606.05328)で導入
? それ以前のCNN生成モデルがPixelRNNに劣ったのは
LSTMのゲート構造のせいだと考えてLSTM似のゲー
トを導入した

Input/output
http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/
雑に言えばLogスケールで
quantizeして256段階にコード

Things not described and Guesses
? Kernel size of the dilation filters 2
? Number of the layers (ResNet-blocks) 4*10~ 6*10
? Number of the channels in hidden layers hundreds? 256?
? the other activation function in a Res-block? may be no
? Batch normalization no reason not to use
? Sampling frequency ‘at least 16kHz’
? Where to let the skip connection out? Every 10?
? Skip connections have weights yes?

Text-to-Speech (TTS)
? Single-speaker speech dataset
? North American English dataset: 24.6hr
? Mandarin Chinese dataset: 34.8hr
? Receptive field 240ms
? Ad hoc architecture as →
WaveNet
Audio(t)
Yet another
model
Liguistic feature h_i
(possibly phoneme)
Another model
Fundamental
frequency F0(t) duration(t)
Liguistic feature h(t)
※論文とは違った記号を使っています。

TTS: Mean Opinion Score
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Speech Recoginition
? TIMIT dataset (possibly ~4hrs)
? Add pooling layer after dilated convolution
? of 160x down sampling (Does it mean 7th layer?)
? Then a few non-causal convolutions.
? Loss to predict the next sample (same as ordinary WaveNet)
? And a loss to classify the frame
? 18.8PER, which is best score among raw-audio models.

(Multi-speaker) Speech Generation
? Conditioned on the speaker
? 44 hours of data (from 109 speakers)

μ-law transformation (ITU-T, 1988)
?
? で-1,1の間を256分割している。
? 大雑把には log でコードしているだけ。

狠狠撸

WaveNet

Convert to study guideBETA

More Related Content

WaveNet