The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
?
This document investigates the use of synthetic parallel data (SPD) to enhance non-parallel voice conversion (VC) through sequence-to-sequence modeling. The study evaluates the feasibility and influence of SPD on VC performance, analyzing various training pairs and the effectiveness of semiparallel datasets. Findings indicate that SPD is viable for VC, but its success depends on the training data quality and the size of the dataset used.
Interactive voice conversion for augmented speech productionNU_I_TODALAB
?
This document discusses recent progress in interactive voice conversion techniques for augmenting speech production. It begins by explaining the physical limitations of normal speech production and how voice conversion can augment speech by controlling more information. It then discusses how interactive voice conversion allows for quick response times, better controllability through real-time feedback, and understanding user intent from multimodal behavior signals. Recent advances discussed include low-latency voice conversion networks, controllable waveform generation respecting the source-filter model of speech, and expression control using signals like arm movements. The goal is to develop cooperatively augmented speech that can help users with lost speech abilities.
Recent progress on voice conversion: What is next?NU_I_TODALAB
?
The document discusses recent advancements in voice conversion (VC) techniques, emphasizing the importance of preserving linguistic content while modifying non-linguistic features. It outlines the Voice Conversion Challenges (VCC) from 2016 to 2020, highlighting different training methods and the role of neural vocoders. The paper also suggests future directions for VC research, focusing on improving performance, developing interactive applications, and exploring higher-level feature conversions.
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
?
This document presents a weakly-supervised sound event detection method using self-attention, aiming to enhance detection performance through the utilization of weak label data. The proposed approach introduces a special tag token for weak label handling and employs a transformer encoder for improved sequence modeling, achieving performance improvements from a baseline CRNN model. Experimental results indicate a notable increase in sound event detection accuracy, with the new method outperforming the baseline in various evaluation metrics.
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
?
This document provides an outline for a tutorial on voice conversion techniques. It begins with an introduction to the goal of the tutorial, which is to help participants grasp the basics and recent progress of VC, develop a baseline VC system, and develop a more sophisticated system using a neural vocoder. The tutorial will include an overview of VC techniques, introduction of freely available software for building a VC system, and breaks between sessions. The first session will cover the basics of VC, improvements to VC techniques, and an overview of recent progress in direct waveform modeling. The second session will demonstrate how to develop a VC system using the WaveNet vocoder with freely available tools.
The document outlines a hands-on workshop for developing voice conversion (VC) systems using open-source software called Sprocket, created by Nagoya University. It details the process of building a traditional GMM-based VC and includes instructions for installing the software, preparing datasets, and configuring the system for speaker conversion. The overall goal is to provide participants with the knowledge and tools needed to initiate their own VC research and development.
The document discusses advancements in voice conversion (VC) techniques, focusing on its definition, necessity, methodologies, and research progress. VC is a method for altering speech to convey desired characteristics while retaining the linguistic content. It highlights various models, comparisons of techniques, training improvements, and applications in real-time communication and character voice modulation.
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
?
The document discusses a method for restoring masked speech signals using time-domain spectrogram factorization, addressing challenges caused by aggressive noise suppression that leads to sparse spectrograms. It evaluates various strategies for enhancing speech features by utilizing low-rank structures, the redundancy of spectrograms, and distributions of clean speech features through Gaussian Mixture Models (GMM). Experimental results indicate that the proposed time-domain signal factorization (TSF) method outperforms conventional Non-negative Matrix Factorization (NMF) approaches in terms of restoration quality.
The VoiceMOS Challenge 2022 aimed to encourage research in automatic prediction of mean opinion scores (MOS) for speech quality. It featured two tracks evaluating systems' ability to predict MOS ratings from a large existing dataset or a separate listening test. 21 teams participated in the main track and 15 in the out-of-domain track. Several teams outperformed the best baseline, which fine-tuned a self-supervised model, though the top-performing approaches generally involved ensembling or multi-task learning. While unseen systems were predictable, unseen listeners and speakers remained a difficulty, especially for generalizing to a new test. The challenge highlighted progress in MOS prediction but also the need for metrics reflecting both ranking and absolute accuracy
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
?
This document investigates the use of synthetic parallel data (SPD) to enhance non-parallel voice conversion (VC) through sequence-to-sequence modeling. The study evaluates the feasibility and influence of SPD on VC performance, analyzing various training pairs and the effectiveness of semiparallel datasets. Findings indicate that SPD is viable for VC, but its success depends on the training data quality and the size of the dataset used.
Interactive voice conversion for augmented speech productionNU_I_TODALAB
?
This document discusses recent progress in interactive voice conversion techniques for augmenting speech production. It begins by explaining the physical limitations of normal speech production and how voice conversion can augment speech by controlling more information. It then discusses how interactive voice conversion allows for quick response times, better controllability through real-time feedback, and understanding user intent from multimodal behavior signals. Recent advances discussed include low-latency voice conversion networks, controllable waveform generation respecting the source-filter model of speech, and expression control using signals like arm movements. The goal is to develop cooperatively augmented speech that can help users with lost speech abilities.
Recent progress on voice conversion: What is next?NU_I_TODALAB
?
The document discusses recent advancements in voice conversion (VC) techniques, emphasizing the importance of preserving linguistic content while modifying non-linguistic features. It outlines the Voice Conversion Challenges (VCC) from 2016 to 2020, highlighting different training methods and the role of neural vocoders. The paper also suggests future directions for VC research, focusing on improving performance, developing interactive applications, and exploring higher-level feature conversions.
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
?
This document presents a weakly-supervised sound event detection method using self-attention, aiming to enhance detection performance through the utilization of weak label data. The proposed approach introduces a special tag token for weak label handling and employs a transformer encoder for improved sequence modeling, achieving performance improvements from a baseline CRNN model. Experimental results indicate a notable increase in sound event detection accuracy, with the new method outperforming the baseline in various evaluation metrics.
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
?
This document provides an outline for a tutorial on voice conversion techniques. It begins with an introduction to the goal of the tutorial, which is to help participants grasp the basics and recent progress of VC, develop a baseline VC system, and develop a more sophisticated system using a neural vocoder. The tutorial will include an overview of VC techniques, introduction of freely available software for building a VC system, and breaks between sessions. The first session will cover the basics of VC, improvements to VC techniques, and an overview of recent progress in direct waveform modeling. The second session will demonstrate how to develop a VC system using the WaveNet vocoder with freely available tools.
The document outlines a hands-on workshop for developing voice conversion (VC) systems using open-source software called Sprocket, created by Nagoya University. It details the process of building a traditional GMM-based VC and includes instructions for installing the software, preparing datasets, and configuring the system for speaker conversion. The overall goal is to provide participants with the knowledge and tools needed to initiate their own VC research and development.
The document discusses advancements in voice conversion (VC) techniques, focusing on its definition, necessity, methodologies, and research progress. VC is a method for altering speech to convey desired characteristics while retaining the linguistic content. It highlights various models, comparisons of techniques, training improvements, and applications in real-time communication and character voice modulation.
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
?
The document discusses a method for restoring masked speech signals using time-domain spectrogram factorization, addressing challenges caused by aggressive noise suppression that leads to sparse spectrograms. It evaluates various strategies for enhancing speech features by utilizing low-rank structures, the redundancy of spectrograms, and distributions of clean speech features through Gaussian Mixture Models (GMM). Experimental results indicate that the proposed time-domain signal factorization (TSF) method outperforms conventional Non-negative Matrix Factorization (NMF) approaches in terms of restoration quality.