ݺߣ

ݺߣShare a Scribd company logo
http://mac.citi.sinica.edu.tw/~yang/
yhyang@ailabs.tw
Yi-Hsuan Yang Ph.D. 1,2
1 Taiwan AI Labs
2 Research Center for IT Innovation, Academia Sinica
20190625 Research at Taiwan AI Labs: Music and Speech AI
Music AI Research (in the Old Days)
? Algorithmic composition
?MIDI in, MIDI out
? Limitations
?Lack diversity and expressivity
?Some music genres are not written language
3
NLU
NLG
(Music encoding used by openAIs MuseNet model)
Music AI Research (at the Taiwan AILabs)
4
? audio in, audio out
? audio  audio: source separation (SS) [denoising]
? audio  score: music transcription (MT) [ASR]
? score  score: composition [NLG]
? score  audio: synthesis [TTS]
Note: A Song is Composed of Multiple Tracks
5
I Have Nothing ~Whitney Houston
(ݺߣ made by Hao-Min Liu)
Step 1: Source Separation
? Demix the music signal
? input: audio mixture
?output: individual tracks
6
(image from the Internet)
Step 1: Source Separation
? https://ailabs.tw/human-
interaction/transcription4generation/
7
Step 2: Music Transcription
? https://ailabs.tw/human-
interaction/transcription4generation/
8
Beyond Piano
? Input
? mixture
? Output
? piano
? guitar
? drum
9
Step 3: Music Composition
? https://vibertthio.com/jazz-rnn/
10
Step 3: Music Composition
? https://ailabs.tw/human-interaction/ai-jazz-bass-player/
? https://youtu.be/TS6pQdUM0Ws
11
Step 3: Music Composition
12
JazzRNN
(or any target style)
Transcription
(Training Data)
Source
Separation
Data
Mode
l
Chord Pop StyleJazz Style
Use SS for Making Hip-Hop Music
? https://youtu.be/WW_4sTMLIVg
13
Music AI Research (at the Taiwan AILabs)
14
? Human in the loop
20190625 Research at Taiwan AI Labs: Music and Speech AI
ָAPP
16
ָ: Why?
17
? Mission: define the future experiences
with AI in Taiwan and for the world
Task Tackled/Tackling
? Task tackled
? Stream decoder pipeline
? Data annotation pipeline
? Automatic data/model management
? TTS
? Task tackling
? Code switching
? Sequence to sequence ASR
18
ASR Data Labeling
19
ˮAIՈFꠣṩȫټƷ|Ęע
ʽCWAIá ҂ԡھ睓ڵǻۣAIṩƷ|
Ӗ ڄkڅݿƼLȫ֧£҂
ģ͌WKAI̎cƌWҎoޱ
Sequence to Sequence ASR
? Advantages
1. Optimize the word accuracy directly
2. Downsize the model
3. Don't need to be dependent on lexicon, which is good for
some languages (e.g., ̨Z)
? Disadvantages
1. Need more data than traditional model (e.g., Kaldi) to get
comparable results
2.^yᘌ~Ą (e.g., y޸ijЩض~ęC)
20
Welcome Visiting Us!
21

More Related Content

20190625 Research at Taiwan AI Labs: Music and Speech AI

  • 1. http://mac.citi.sinica.edu.tw/~yang/ yhyang@ailabs.tw Yi-Hsuan Yang Ph.D. 1,2 1 Taiwan AI Labs 2 Research Center for IT Innovation, Academia Sinica
  • 3. Music AI Research (in the Old Days) ? Algorithmic composition ?MIDI in, MIDI out ? Limitations ?Lack diversity and expressivity ?Some music genres are not written language 3 NLU NLG (Music encoding used by openAIs MuseNet model)
  • 4. Music AI Research (at the Taiwan AILabs) 4 ? audio in, audio out ? audio audio: source separation (SS) [denoising] ? audio score: music transcription (MT) [ASR] ? score score: composition [NLG] ? score audio: synthesis [TTS]
  • 5. Note: A Song is Composed of Multiple Tracks 5 I Have Nothing ~Whitney Houston (ݺߣ made by Hao-Min Liu)
  • 6. Step 1: Source Separation ? Demix the music signal ? input: audio mixture ?output: individual tracks 6 (image from the Internet)
  • 7. Step 1: Source Separation ? https://ailabs.tw/human- interaction/transcription4generation/ 7
  • 8. Step 2: Music Transcription ? https://ailabs.tw/human- interaction/transcription4generation/ 8
  • 9. Beyond Piano ? Input ? mixture ? Output ? piano ? guitar ? drum 9
  • 10. Step 3: Music Composition ? https://vibertthio.com/jazz-rnn/ 10
  • 11. Step 3: Music Composition ? https://ailabs.tw/human-interaction/ai-jazz-bass-player/ ? https://youtu.be/TS6pQdUM0Ws 11
  • 12. Step 3: Music Composition 12 JazzRNN (or any target style) Transcription (Training Data) Source Separation Data Mode l Chord Pop StyleJazz Style
  • 13. Use SS for Making Hip-Hop Music ? https://youtu.be/WW_4sTMLIVg 13
  • 14. Music AI Research (at the Taiwan AILabs) 14 ? Human in the loop
  • 17. ָ: Why? 17 ? Mission: define the future experiences with AI in Taiwan and for the world
  • 18. Task Tackled/Tackling ? Task tackled ? Stream decoder pipeline ? Data annotation pipeline ? Automatic data/model management ? TTS ? Task tackling ? Code switching ? Sequence to sequence ASR 18
  • 19. ASR Data Labeling 19 ˮAIՈFꠣṩȫټƷ|Ęע ʽCWAIá ҂ԡھ睓ڵǻۣAIṩƷ| Ӗ ڄkڅݿƼLȫ֧£҂ ģ͌WKAI̎cƌWҎoޱ
  • 20. Sequence to Sequence ASR ? Advantages 1. Optimize the word accuracy directly 2. Downsize the model 3. Don't need to be dependent on lexicon, which is good for some languages (e.g., ̨Z) ? Disadvantages 1. Need more data than traditional model (e.g., Kaldi) to get comparable results 2.^yᘌ~Ą (e.g., y޸ijЩض~ęC) 20