際際滷

際際滷Share a Scribd company logo
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
arXiv:1706.02361
abstract
1. Introduction
@KeunwooChoi
2014--present: PhD, Queen Mary University of London

2016--present: Buzzmusiq lnc.

2016/ 06--12: Visiting PhD, NYU

2015/ 06--09: Intern, Naver Labs

2011--2014: Audio research team, ETRI

2009--2011: Applied Acoustic Lab, EECS, SNU

2005--2009: EECS, SNU

Papers on ISMIR/ICASSP/IEEE Trans./Etc.

Python/Keras/Pytorch
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
1. INTRODUCTION
Tagging
 Anyone can tag any words (or non-words) to any song

 The quality is ****.

 Poor, innocent, (鍖nancially) poor researchers need to use it
Tagging
(Tag, count)
rock 101071
pop 69159
alternative 55777
indie 48175
electronic 46270
female vocalists 42565
favorites 39921
00s 31432
Awesome 26248
american 22694
seen live 20705
cool 19581
Favorite18864
Favourites 17722
female vocalist 17328
guitar 17302
loved 12483
favorite songs 12392
heard on Pandora 10470
USA 8725
2000s 8671
Favourite Songs 8661
drjazzmrfunkmusic 8364
77davez-all-tracks7278
fav 6155
bass 3364
songs I absolutely love
3293
vocals 2369
drums2281

Female vocalists
Male vocalist
Guitar
Bass
Vocals
Drums
0% 25% 50% 75% 100%
True False
Questions
How noisy?
Is training
alright?
How about
evaluation?
What are
they
learning?
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
2. HOW NOISY?
IS TRAINING OK?
Measuring the noise
 We need strongly-labelled re-annotations
 Instrumentation labels are (sort of) objective

(instrumental, female vocal, male vocal, guitar)
 242K songs are still a lot  select a subset (or two)!
I can do it!
..but not
all of them
Strongly labelling: Subset100
 Subset100: random 50 from True 
+ random 50 from False (for each label)
Instrumental
Female vocalists
Male vocalist
Guitar
True False
50songs 50songs
50 50
50 50
50 50
Strongly labelling: Subset400
 Subset400: Just random 400 items
242K songs  50 tags
400 songs
4 tagsSubset400
オ.......................
AFTER
BEFORE
Evaluating groundtruth on Subset100
0
25
50
75
100
+ Error rate Precision
Instrumental female voc
male vocal guitar
0
25
50
75
100
- Error rate Recall
Instrumental female voc
male vocal guitar
#Occurrences estimation
0
20
40
60
80
In all, by GT My estimation
using S100
My re-annotation
on S400
Instrumental female voc male vocal guitar
Again, with box plots
{Instrumental, female vocalists}
vs.
{male vocalists, guitar}
Group A vs B, but why?
 Tagging vocals, drums, bass is like.. 

 Theyre not tag-worthy

 Lets call it taggability
Female vocalists
Male vocalist
Guitar
Bass
Vocals
Drums
0% 25% 50% 75% 100
True False
***?
Whats on
the desk?
The hypothesis
If unusual  high taggability.
Instrumental, female vocal :
high taggability
Male vocal, guitar:
low taggability
The hypothesis
If unusual  high taggability.
If high taggability
 less false negative = higher recall (of GT)
Instrumental, female vocal :
high taggability,
less false neg, higher recall
Male vocal, guitar:
low taggability,
more false neg, lower recall
The hypothesis
If unusual  high taggability.
If high taggability
 less false negative = higher recall (of GT)
If higher recall (=more reliable GT),
 ?
[33] Choi et al. 2017, Convolutional recu...
Hypothesis
If unusual  high taggability.
If high taggability
 less false negative = higher recall (of GT)
If higher recall (=more reliable GT),
 ?
Performance(AUC)
!!!
The hypothesis
If unusual  high taggability.
If high taggability
 less false negative = higher recall (of GT)
Instrumental, female vocal :
high taggability,
less false neg, higher recall,
better classi鍖cation
Male vocal, guitar:
low taggability,
more false neg, lower recall,
worse classi鍖cation
If higher recall (=more reliable GT),
 better classi鍖cation
The effects of noisy labels on deep convolutional neural networks for music tagging
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
3. IS EVALUATION OK?
Really?
So, we evaluate the classi鍖er based on..

I need a noise-free groundtruth...
Evaluate the evaluation
242K songs  50 tags
400 songs
4 tagsSubset400
HAHAHAH!Subset400!
Results
Evaluate the evaluation
Interesting! With such noise, 
the results are still okay.
Its not perfect though.
HAHAHA!
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
4. LABEL VECTOR
ANALYSIS
Label vector
(50,	50)
Label vector similarity
 Similarity between labels
according to the trained convnet.
Label vector
Label vector vs co-occurrence (GT)
Label vector vs co-occurrence (GT)
 Mostly, LV reproduces the groundtruth.

 Except: similar pairs only by label vector:

(sad, beautiful), (happy, catchy), (rnb, sexy)
Sad songs are beautiful.
Catchy songs are often happy songs.
R&B claims to be sexy.
 Makes sense..
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
5. CONCLUSIONS
Conclusions
 We quanti鍖ed how noisy weakly-labelled groundtruth is.

 We conjectured why some labels are noisier.

 We showed what happens to the noisier labels on training
and evaluation.

 We investigated what a convnet learns.
The Effects of Noisy Labels
Keunwoo.Choi

@qmul.ac.uk
on deep convolutional neural networks for music tagging
Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler
arXiv:1706.02361
Links
My blog | blog post 1, blog post 2 | Paper!

More Related Content

The effects of noisy labels on deep convolutional neural networks for music tagging

  • 1. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging arXiv:1706.02361
  • 3. @KeunwooChoi 2014--present: PhD, Queen Mary University of London 2016--present: Buzzmusiq lnc. 2016/ 06--12: Visiting PhD, NYU 2015/ 06--09: Intern, Naver Labs 2011--2014: Audio research team, ETRI 2009--2011: Applied Acoustic Lab, EECS, SNU 2005--2009: EECS, SNU Papers on ISMIR/ICASSP/IEEE Trans./Etc. Python/Keras/Pytorch
  • 4. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361 1. INTRODUCTION
  • 5. Tagging Anyone can tag any words (or non-words) to any song The quality is ****. Poor, innocent, (鍖nancially) poor researchers need to use it
  • 6. Tagging (Tag, count) rock 101071 pop 69159 alternative 55777 indie 48175 electronic 46270 female vocalists 42565 favorites 39921 00s 31432 Awesome 26248 american 22694 seen live 20705 cool 19581 Favorite18864 Favourites 17722 female vocalist 17328 guitar 17302 loved 12483 favorite songs 12392 heard on Pandora 10470 USA 8725 2000s 8671 Favourite Songs 8661 drjazzmrfunkmusic 8364 77davez-all-tracks7278 fav 6155 bass 3364 songs I absolutely love 3293 vocals 2369 drums2281
  • 8. Questions How noisy? Is training alright? How about evaluation? What are they learning?
  • 9. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361 2. HOW NOISY? IS TRAINING OK?
  • 10. Measuring the noise We need strongly-labelled re-annotations Instrumentation labels are (sort of) objective (instrumental, female vocal, male vocal, guitar) 242K songs are still a lot select a subset (or two)! I can do it! ..but not all of them
  • 11. Strongly labelling: Subset100 Subset100: random 50 from True + random 50 from False (for each label) Instrumental Female vocalists Male vocalist Guitar True False 50songs 50songs 50 50 50 50 50 50
  • 12. Strongly labelling: Subset400 Subset400: Just random 400 items 242K songs 50 tags 400 songs 4 tagsSubset400
  • 14. Evaluating groundtruth on Subset100 0 25 50 75 100 + Error rate Precision Instrumental female voc male vocal guitar 0 25 50 75 100 - Error rate Recall Instrumental female voc male vocal guitar
  • 15. #Occurrences estimation 0 20 40 60 80 In all, by GT My estimation using S100 My re-annotation on S400 Instrumental female voc male vocal guitar
  • 16. Again, with box plots {Instrumental, female vocalists} vs. {male vocalists, guitar}
  • 17. Group A vs B, but why? Tagging vocals, drums, bass is like.. Theyre not tag-worthy Lets call it taggability Female vocalists Male vocalist Guitar Bass Vocals Drums 0% 25% 50% 75% 100 True False ***? Whats on the desk?
  • 18. The hypothesis If unusual high taggability. Instrumental, female vocal : high taggability Male vocal, guitar: low taggability
  • 19. The hypothesis If unusual high taggability. If high taggability less false negative = higher recall (of GT) Instrumental, female vocal : high taggability, less false neg, higher recall Male vocal, guitar: low taggability, more false neg, lower recall
  • 20. The hypothesis If unusual high taggability. If high taggability less false negative = higher recall (of GT) If higher recall (=more reliable GT), ?
  • 21. [33] Choi et al. 2017, Convolutional recu... Hypothesis If unusual high taggability. If high taggability less false negative = higher recall (of GT) If higher recall (=more reliable GT), ? Performance(AUC) !!!
  • 22. The hypothesis If unusual high taggability. If high taggability less false negative = higher recall (of GT) Instrumental, female vocal : high taggability, less false neg, higher recall, better classi鍖cation Male vocal, guitar: low taggability, more false neg, lower recall, worse classi鍖cation If higher recall (=more reliable GT), better classi鍖cation
  • 24. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361 3. IS EVALUATION OK?
  • 25. Really? So, we evaluate the classi鍖er based on.. I need a noise-free groundtruth...
  • 26. Evaluate the evaluation 242K songs 50 tags 400 songs 4 tagsSubset400 HAHAHAH!Subset400!
  • 28. Evaluate the evaluation Interesting! With such noise, the results are still okay. Its not perfect though. HAHAHA!
  • 29. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361 4. LABEL VECTOR ANALYSIS
  • 31. Label vector similarity Similarity between labels according to the trained convnet.
  • 33. Label vector vs co-occurrence (GT)
  • 34. Label vector vs co-occurrence (GT) Mostly, LV reproduces the groundtruth. Except: similar pairs only by label vector: (sad, beautiful), (happy, catchy), (rnb, sexy) Sad songs are beautiful. Catchy songs are often happy songs. R&B claims to be sexy. Makes sense..
  • 35. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361 5. CONCLUSIONS
  • 36. Conclusions We quanti鍖ed how noisy weakly-labelled groundtruth is. We conjectured why some labels are noisier. We showed what happens to the noisier labels on training and evaluation. We investigated what a convnet learns.
  • 37. The Effects of Noisy Labels Keunwoo.Choi @qmul.ac.uk on deep convolutional neural networks for music tagging Gy旦rgy Fazekas, Kyunghyun Cho, Mark Sandler arXiv:1706.02361
  • 38. Links My blog | blog post 1, blog post 2 | Paper!