狠狠撸

IV WORKSHOP NVIDIA DE GPU E CUDA
Audio Processing using
Convolutional Neural Network
Diego Augusto
September 6, 2016

Speech Activity Detection (SAD)
? Distinguish speech and noise segments.
? Estimate start and end times of speech events.
WAVEFORM

Speech Activity Detection (SAD)
? Distinguish speech and noise segments.
? Estimate start and end times of speech events.
#1, START: 1.2 sec, END: 2.5 sec
#2, START: 3.3 sec END: 4.9 sec
WAVEFORM
speech speech

Applications
? Segmentation of spontaneous speech:
? Live language translation.
? Speech transmission over audio codec’s.
? Retrieval of speech in video and social networks.

Applications
? Segmentation of spontaneous speech:
? Live language translation.
? Speech transmission over audio codec’s.
? Retrieval of speech in video and social networks.
? Pre processing of speech engines:
? Speech Recognition - “what is being said?”
? Speaker Authentication - “who is speaking?”
? Speaker Diarization - “who spoke when?”

Challenges
? Large variety of different types of noises:
? Clicking, Motor sound, Background voice.
? Voice distortion, overlapping sounds.

Convolutional Neural Network (CNN)
? CNN approach:
? Features are extracted automatically by the network.
? Inspired by human vision system (visual cortex).
? Extract distinctive features.

CPqD Dataset
? > 300 hours of speech and noise.
? with ground truth.
? Environments:
? Phone conversation.
? PCs and IoT devices (mobile apps).
? Split into two parts:
? Development = 75%.
? Evaluation = 25%.

Speech/Noise Features
SPECTROGRAMWAVEFORM

SPECTROGRAMWAVEFORM
1 1 1 1 1 1
0 = NOISE
1 = SPEECH

SPECTROGRAMWAVEFORM
0 = NOISE
1 = SPEECH
0 0 1 0 0 0 0 01 1 1 1 1

Deep Learning Platform
MANAGE DEVELOPMENT EVALUATION
NVIDIA DIGITS 4
GPU GRID K520 Linux 64-bit
FEAT. EXTRACT
TRAIN TEST
REINFORCEMENT
LEARNING

NVIDIA DIGITS
Monitor Train Test Model
99,93
0,07
1
0

Evaluation
FA MSMS FAFA
Ground Truth:
Spectrogram:
? Half-Total Error Rate: HTER = (MR + FAR) / 2
? Miss Speech Rate (%):
■ (# Speech samples not detected as speech / Total number of speech samples) x 100
? False Alarm Rate (%):
■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech

Evaluation
FA MS FA MS FA
Ground Truth:
Spectrogram:
Hypothesis:
? Half-Total Error Rate: HTER = (MR + FAR) / 2
? Miss Speech Rate (%):
■ (# Speech samples not detected as speech / Total number of speech samples) x 100
? False Alarm Rate (%):
■ (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech

Evaluation
? QUT-NOISE-TIMIT:
? Large-scale dataset to evaluation SAD algorithms.
? Technical challenges and Future:
? Automatic adaptation to environment.
? Overlapping sound events.
? CNN approach to perform others problems.
Features Classifier HTER
Energy Threshold 26,3%
MFCC GMM-HMM 4,7 %
Spectrogram CNN 3,2%

References
● J. Sohn, N. S. Kim, and W. Sung, “A statistical model based voice activity detection,” Signal Processing Letters, IEEE, vol. 6,
no. 1, pp. 1–3, 1999.
● W. H. Abdulla, Z. Guan, and H. C. Sou, “Noise robust speech activity detection,” in Signal Processing and Information
Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473–477.
● D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The qut-noise-timit corpus for the evaluation of voice
activity detection algorithms,” Proceedings of Interspeech 2010, 2010.
● D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J.
Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech
Recognition and Understanding. IEEE Signal Processing Society, 2011.
● S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, “Analyzing convolutional neural networks for speech activity detection in
mismatched acoustic conditions,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 2519– 2523.
● Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe:
Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
● H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, “Complete-linkage clustering for voice
activity detection in audio and visual speech,” 2015.
● NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from
https://developer.nvidia.com/digits.

www.cpqd.com.br
TURNING
INTO REALITY
Diego Augusto
diegoa@cpqd.com.br

狠狠撸

IV_WORKSHOP_NVIDIA-Audio_Processing

More Related Content

Viewers also liked (15)

Similar to IV_WORKSHOP_NVIDIA-Audio_Processing (20)

IV_WORKSHOP_NVIDIA-Audio_Processing