ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
IV WORKSHOP NVIDIA DE GPU E CUDA
Audio Processing using
Convolutional Neural Network
Diego Augusto
September 6, 2016
Speech Activity Detection (SAD)
? Distinguish speech and noise segments.
? Estimate start and end times of speech events.
WAVEFORM
Speech Activity Detection (SAD)
? Distinguish speech and noise segments.
? Estimate start and end times of speech events.
#1, START: 1.2 sec, END: 2.5 sec
#2, START: 3.3 sec END: 4.9 sec
WAVEFORM
speech speech
Applications
? Segmentation of spontaneous speech:
? Live language translation.
? Speech transmission over audio codec¡¯s.
? Retrieval of speech in video and social networks.
Applications
? Segmentation of spontaneous speech:
? Live language translation.
? Speech transmission over audio codec¡¯s.
? Retrieval of speech in video and social networks.
? Pre processing of speech engines:
? Speech Recognition - ¡°what is being said?¡±
? Speaker Authentication - ¡°who is speaking?¡±
? Speaker Diarization - ¡°who spoke when?¡±
Challenges
? Large variety of different types of noises:
? Clicking, Motor sound, Background voice.
? Voice distortion, overlapping sounds.
Convolutional Neural Network (CNN)
? CNN approach:
? Features are extracted automatically by the network.
? Inspired by human vision system (visual cortex).
? Extract distinctive features.
CPqD Dataset
? > 300 hours of speech and noise.
? with ground truth.
? Environments:
? Phone conversation.
? PCs and IoT devices (mobile apps).
? Split into two parts:
? Development = 75%.
? Evaluation = 25%.
Speech/Noise Features
SPECTROGRAMWAVEFORM
Speech/Noise Features
SPECTROGRAMWAVEFORM
1 1 1 1 1 1
0 = NOISE
1 = SPEECH
Speech/Noise Features
SPECTROGRAMWAVEFORM
0 = NOISE
1 = SPEECH
0 0 1 0 0 0 0 01 1 1 1 1
Deep Learning Platform
MANAGE DEVELOPMENT EVALUATION
NVIDIA DIGITS 4
GPU GRID K520 Linux 64-bit
FEAT. EXTRACT
TRAIN TEST
REINFORCEMENT
LEARNING
NVIDIA DIGITS
Monitor Train Test Model
99,93
0,07
1
0
Evaluation
FA MSMS FAFA
Ground Truth:
Spectrogram:
? Half-Total Error Rate: HTER = (MR + FAR) / 2
? Miss Speech Rate (%):
¡ö (# Speech samples not detected as speech / Total number of speech samples) x 100
? False Alarm Rate (%):
¡ö (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech
Evaluation
FA MS FA MS FA
Ground Truth:
Spectrogram:
Hypothesis:
? Half-Total Error Rate: HTER = (MR + FAR) / 2
? Miss Speech Rate (%):
¡ö (# Speech samples not detected as speech / Total number of speech samples) x 100
? False Alarm Rate (%):
¡ö (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100
speech speech
Evaluation
? QUT-NOISE-TIMIT:
? Large-scale dataset to evaluation SAD algorithms.
? Technical challenges and Future:
? Automatic adaptation to environment.
? Overlapping sound events.
? CNN approach to perform others problems.
Features Classifier HTER
Energy Threshold 26,3%
MFCC GMM-HMM 4,7 %
Spectrogram CNN 3,2%
References
¡ñ J. Sohn, N. S. Kim, and W. Sung, ¡°A statistical model based voice activity detection,¡± Signal Processing Letters, IEEE, vol. 6,
no. 1, pp. 1¨C3, 1999.
¡ñ W. H. Abdulla, Z. Guan, and H. C. Sou, ¡°Noise robust speech activity detection,¡± in Signal Processing and Information
Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473¨C477.
¡ñ D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, ¡°The qut-noise-timit corpus for the evaluation of voice
activity detection algorithms,¡± Proceedings of Interspeech 2010, 2010.
¡ñ D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J.
Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech
Recognition and Understanding. IEEE Signal Processing Society, 2011.
¡ñ S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, ¡°Analyzing convolutional neural networks for speech activity detection in
mismatched acoustic conditions,¡± in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on. IEEE, 2014, pp. 2519¨C 2523.
¡ñ Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, ¡°Caffe:
Convolutional architecture for fast feature embedding,¡± arXiv preprint arXiv:1408.5093, 2014.
¡ñ H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, ¡°Complete-linkage clustering for voice
activity detection in audio and visual speech,¡± 2015.
¡ñ NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from
https://developer.nvidia.com/digits.
www.cpqd.com.br
TURNING
INTO REALITY
Diego Augusto
diegoa@cpqd.com.br

More Related Content

Viewers also liked (15)

PPTX
Vecka 7 relationer kopia
Marie S?dergren
?
PDF
MMT Audio Technology and Applications
Apisake Hongwitayakorn
?
PPTX
Top 10 los youtubers m¨¢s ifluyentes en latinoam¨¦rica
Helen Ariana Beltr¨¢n Romero
?
PDF
H?n ch? m?t m?i sau 1 chuy?n bay d¨¤i
C?ng ty V?n t?i Con M¨¨o
?
PPT
El reconstruccionismo
Exavier Blasini
?
PPT
Implicaciones de la investigacion
agustin rojas
?
PDF
Better Tests, Less Code: Property-based Testing
C4Media
?
PPTX
Belajar Seo untuk pemula Terbaru 2017
Putra Firmansyah
?
PPTX
JS: Audio Data Processing
Ingvar Stepanyan
?
PDF
Jenkins-CI
Gong Haibing
?
PPTX
Application of digital_signal_processing_in_audio_processing[1]
Sveris COE Pandharpur
?
PPTX
Dinamica de grupos
Luis Ramirez
?
PPTX
Audio Processing and Music Recognition
Mrinmoy Dalal
?
PPT
Digitization of Audio.ppt
Videoguy
?
PDF
Sound of Safety
Harman Innovation
?
Vecka 7 relationer kopia
Marie S?dergren
?
MMT Audio Technology and Applications
Apisake Hongwitayakorn
?
Top 10 los youtubers m¨¢s ifluyentes en latinoam¨¦rica
Helen Ariana Beltr¨¢n Romero
?
H?n ch? m?t m?i sau 1 chuy?n bay d¨¤i
C?ng ty V?n t?i Con M¨¨o
?
El reconstruccionismo
Exavier Blasini
?
Implicaciones de la investigacion
agustin rojas
?
Better Tests, Less Code: Property-based Testing
C4Media
?
Belajar Seo untuk pemula Terbaru 2017
Putra Firmansyah
?
JS: Audio Data Processing
Ingvar Stepanyan
?
Jenkins-CI
Gong Haibing
?
Application of digital_signal_processing_in_audio_processing[1]
Sveris COE Pandharpur
?
Dinamica de grupos
Luis Ramirez
?
Audio Processing and Music Recognition
Mrinmoy Dalal
?
Digitization of Audio.ppt
Videoguy
?
Sound of Safety
Harman Innovation
?

Similar to IV_WORKSHOP_NVIDIA-Audio_Processing (20)

PPTX
Final_Presentation_ENDSEMFORNITJSRI.pptx
2023pgcsis004
?
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
?
PDF
A Study of Digital Media Based Voice Activity Detection Protocols
ijtsrd
?
PDF
SPEECH RECOGNITION USING SONOGRAM AND AANN
AM Publications
?
PDF
Trends of ICASSP 2022
Kwanghee Choi
?
PDF
On the use of voice activity detection in speech emotion recognition
journalBEEI
?
PDF
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
ijsrd.com
?
PDF
Introduction to deep learning based voice activity detection
NAVER Engineering
?
PDF
Audio insights
diegogee
?
PDF
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Databricks
?
PDF
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
IJMER
?
PDF
Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA
IJERA Editor
?
PDF
A review of Noise Suppression Technology for Real-Time Speech Enhancement
IRJET Journal
?
PDF
ADVANCEMENTS IN AI AND BIOACOUSTIC SIGNAL PROCESSING - ATAL FDP Presentation ...
John Amose
?
PDF
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
sipij
?
PDF
Emotional telugu speech signals classification based on k nn classifier
eSAT Journals
?
PDF
Emotional telugu speech signals classification based on k nn classifier
eSAT Publishing House
?
PDF
ºÝºÝߣs of my presentation at EUSIPCO 2017
Hamid Eghbal-zadeh
?
PDF
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
CSCJournals
?
PPTX
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
Mehmet ?a?r? Aksoy
?
Final_Presentation_ENDSEMFORNITJSRI.pptx
2023pgcsis004
?
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
?
A Study of Digital Media Based Voice Activity Detection Protocols
ijtsrd
?
SPEECH RECOGNITION USING SONOGRAM AND AANN
AM Publications
?
Trends of ICASSP 2022
Kwanghee Choi
?
On the use of voice activity detection in speech emotion recognition
journalBEEI
?
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
ijsrd.com
?
Introduction to deep learning based voice activity detection
NAVER Engineering
?
Audio insights
diegogee
?
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Databricks
?
Development of Algorithm for Voice Operated Switch for Digital Audio Control ...
IJMER
?
Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA
IJERA Editor
?
A review of Noise Suppression Technology for Real-Time Speech Enhancement
IRJET Journal
?
ADVANCEMENTS IN AI AND BIOACOUSTIC SIGNAL PROCESSING - ATAL FDP Presentation ...
John Amose
?
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
sipij
?
Emotional telugu speech signals classification based on k nn classifier
eSAT Journals
?
Emotional telugu speech signals classification based on k nn classifier
eSAT Publishing House
?
ºÝºÝߣs of my presentation at EUSIPCO 2017
Hamid Eghbal-zadeh
?
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
CSCJournals
?
CNN architectures for large-scale audio classification CONFERENCE PAPER REVIE...
Mehmet ?a?r? Aksoy
?
Ad

IV_WORKSHOP_NVIDIA-Audio_Processing

  • 1. IV WORKSHOP NVIDIA DE GPU E CUDA Audio Processing using Convolutional Neural Network Diego Augusto September 6, 2016
  • 2. Speech Activity Detection (SAD) ? Distinguish speech and noise segments. ? Estimate start and end times of speech events. WAVEFORM
  • 3. Speech Activity Detection (SAD) ? Distinguish speech and noise segments. ? Estimate start and end times of speech events. #1, START: 1.2 sec, END: 2.5 sec #2, START: 3.3 sec END: 4.9 sec WAVEFORM speech speech
  • 4. Applications ? Segmentation of spontaneous speech: ? Live language translation. ? Speech transmission over audio codec¡¯s. ? Retrieval of speech in video and social networks.
  • 5. Applications ? Segmentation of spontaneous speech: ? Live language translation. ? Speech transmission over audio codec¡¯s. ? Retrieval of speech in video and social networks. ? Pre processing of speech engines: ? Speech Recognition - ¡°what is being said?¡± ? Speaker Authentication - ¡°who is speaking?¡± ? Speaker Diarization - ¡°who spoke when?¡±
  • 6. Challenges ? Large variety of different types of noises: ? Clicking, Motor sound, Background voice. ? Voice distortion, overlapping sounds.
  • 7. Convolutional Neural Network (CNN) ? CNN approach: ? Features are extracted automatically by the network. ? Inspired by human vision system (visual cortex). ? Extract distinctive features.
  • 8. CPqD Dataset ? > 300 hours of speech and noise. ? with ground truth. ? Environments: ? Phone conversation. ? PCs and IoT devices (mobile apps). ? Split into two parts: ? Development = 75%. ? Evaluation = 25%.
  • 10. Speech/Noise Features SPECTROGRAMWAVEFORM 1 1 1 1 1 1 0 = NOISE 1 = SPEECH
  • 11. Speech/Noise Features SPECTROGRAMWAVEFORM 0 = NOISE 1 = SPEECH 0 0 1 0 0 0 0 01 1 1 1 1
  • 12. Deep Learning Platform MANAGE DEVELOPMENT EVALUATION NVIDIA DIGITS 4 GPU GRID K520 Linux 64-bit FEAT. EXTRACT TRAIN TEST REINFORCEMENT LEARNING
  • 13. NVIDIA DIGITS Monitor Train Test Model 99,93 0,07 1 0
  • 14. Evaluation FA MSMS FAFA Ground Truth: Spectrogram: ? Half-Total Error Rate: HTER = (MR + FAR) / 2 ? Miss Speech Rate (%): ¡ö (# Speech samples not detected as speech / Total number of speech samples) x 100 ? False Alarm Rate (%): ¡ö (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  • 15. Evaluation FA MS FA MS FA Ground Truth: Spectrogram: Hypothesis: ? Half-Total Error Rate: HTER = (MR + FAR) / 2 ? Miss Speech Rate (%): ¡ö (# Speech samples not detected as speech / Total number of speech samples) x 100 ? False Alarm Rate (%): ¡ö (# Nonspeech samples detected as speech / Total number of nonspeech samples) x 100 speech speech
  • 16. Evaluation ? QUT-NOISE-TIMIT: ? Large-scale dataset to evaluation SAD algorithms. ? Technical challenges and Future: ? Automatic adaptation to environment. ? Overlapping sound events. ? CNN approach to perform others problems. Features Classifier HTER Energy Threshold 26,3% MFCC GMM-HMM 4,7 % Spectrogram CNN 3,2%
  • 17. References ¡ñ J. Sohn, N. S. Kim, and W. Sung, ¡°A statistical model based voice activity detection,¡± Signal Processing Letters, IEEE, vol. 6, no. 1, pp. 1¨C3, 1999. ¡ñ W. H. Abdulla, Z. Guan, and H. C. Sou, ¡°Noise robust speech activity detection,¡± in Signal Processing and Information Technology (ISSPIT), 2009 IEEE International Symposium on. IEEE, 2009, pp. 473¨C477. ¡ñ D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, ¡°The qut-noise-timit corpus for the evaluation of voice activity detection algorithms,¡± Proceedings of Interspeech 2010, 2010. ¡ñ D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. ¡ñ S. Thomas, S. Ganapathy, G. Saon, and H. Soltau, ¡°Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions,¡± in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 2519¨C 2523. ¡ñ Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, ¡°Caffe: Convolutional architecture for fast feature embedding,¡± arXiv preprint arXiv:1408.5093, 2014. ¡ñ H. Ghaemmaghami, D. Dean, S. Kalantari, S. Sridharan, and C. Fookes, ¡°Complete-linkage clustering for voice activity detection in audio and visual speech,¡± 2015. ¡ñ NVIDIA Deep Learning GPU Training System (DIGITS) 4. Retrieved July 18, 2016, from https://developer.nvidia.com/digits.