狠狠撸

http://mac.citi.sinica.edu.tw/~yang/
yhyang@ailabs.tw
Yi-Hsuan Yang Ph.D. 1,2
1 Taiwan AI Labs
2 Research Center for IT Innovation, Academia Sinica

20190625 Research at Taiwan AI Labs: Music and Speech AI

Music AI Research (in the Old Days)
? Algorithmic composition
?MIDI in, MIDI out
? Limitations
?Lack diversity and expressivity
?Some music genres are not “written language”
3
NLU
NLG
(Music encoding used by openAI’s MuseNet model)

Music AI Research (at the Taiwan AILabs)
4
? audio in, audio out
? audio → audio: source separation (SS) [denoising]
? audio → score: music transcription (MT) [ASR]
? score → score: composition [NLG]
? score → audio: synthesis [TTS]

Note: A Song is Composed of Multiple Tracks
5
I Have Nothing ~Whitney Houston
(狠狠撸 made by Hao-Min Liu)

Step 1: Source Separation
? “Demix” the music signal
? input: audio mixture
?output: individual tracks
6
(image from the Internet)

Step 1: Source Separation
? https://ailabs.tw/human-
interaction/transcription4generation/
7

Step 2: Music Transcription
? https://ailabs.tw/human-
interaction/transcription4generation/
8

Beyond Piano
? Input
? mixture
? Output
? piano
? guitar
? drum
9

Step 3: Music Composition
? https://vibertthio.com/jazz-rnn/
10

? https://ailabs.tw/human-interaction/ai-jazz-bass-player/
? https://youtu.be/TS6pQdUM0Ws
11

12
JazzRNN
(or any target style)
Transcription
(Training Data)
Source
Separation
Data
Mode
l
Chord Pop StyleJazz Style

Use SS for Making Hip-Hop Music
? https://youtu.be/WW_4sTMLIVg
13

Music AI Research (at the Taiwan AILabs)
14
? Human in the loop

雅婷逐字稿: Why?
17
? Mission: define the future experiences
with AI in Taiwan and for the world

Task Tackled/Tackling
? Task tackled
? Stream decoder pipeline
? Data annotation pipeline
? Automatic data/model management
? TTS
? Task tackling
? Code switching
? Sequence to sequence ASR
18

ASR Data Labeling
19
「若水AI數據服務團隊，致力於提供安全、快速及高品質的標注數據做為各
式機器學習及AI應用。我們以‘挖掘世界潛在的智慧，為AI引擎提供高品質
訓練數據’為願景… 在創辦人趨勢科技董事長張明正的全力支持下，我們加
速模型學習，並為AI工程師與數據科學家帶來無限便利」

Sequence to Sequence ASR
? Advantages
1. Optimize the word accuracy directly
2. Downsize the model
3. Don't need to be dependent on lexicon, which is good for
some languages (e.g., 台語)
? Disadvantages
1. Need more data than traditional model (e.g., Kaldi) to get
comparable results
2.比較難針對詞做改動 (e.g., 難以修改某些特定詞的機率)
20

狠狠撸

20190625 Research at Taiwan AI Labs: Music and Speech AI

More Related Content

20190625 Research at Taiwan AI Labs: Music and Speech AI