[PPT] - TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , PowerPoint Presentation

SLIDE 1

Richard Vogl1,2, Gerhard Widmer2, Peter Knees1

richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION

1 2

SLIDE 2

WHAT IS DRUM TRANSCRIPTION?

2

Input: popular music containing drums Output: symbolic representation of notes played by drum instruments

SLIDE 3

Current state-of-the-art systems:

End-to-end / activation-function-based approaches
NN based approaches and NMF approaches

Overview Article 

Wu, C.-W., Dittmar, C., Southall, C.,Vogl, R., Widmer, G., Hockman, J., Müller, M., Lerch, A.:   “An Overview of Automatic Drum Transcription,” IEEE TASLP, vol. 26, no. 9, Sept. 2018.

STATE OF THE ART

3

activation functions spectrogram

t [ms] t [ms]

bass snare hi-hat

SLIDE 4

FOCUS OF THIS WORK

4

BD SD HH

SLIDE 5

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

FOCUS OF THIS WORK

4

BD SD HH

SLIDE 6

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets

FOCUS OF THIS WORK

4

BD SD HH

SLIDE 7

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important

FOCUS OF THIS WORK

4

BD SD HH

SLIDE 8

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important
Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

SLIDE 9

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important
Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

SLIDE 10

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important
Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

SLIDE 11

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important
Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

SLIDE 12

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

Make up majority of notes in datasets
Beat defining / most important
Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

Other instruments are important!

→ Increase number of instruments for drum transcription

SLIDE 13

SYSTEM OVERVIEW

5

signal preprocessing NN   feature extraction   event detection classification peak picking NN training audio events

waveform

t [s]

A spectrogram

t [s] f [Hz]

activation functions

t [s] bass snare hi-hat

detected peaks

t [s] bass snare hi-hat

train data

SLIDE 14

NETWORK ARCHITECTURES

6

SLIDE 15

CNN train data sample

NETWORK ARCHITECTURES

Convolutional NN (CNN)

Convolutions capture local correlations
Acoustic modeling of drum sounds

6

SLIDE 16

NETWORK ARCHITECTURES

Convolutional NN (CNN)

Convolutions capture local correlations
Acoustic modeling of drum sounds

Convolutional RNN (CRNN)

”best of both worlds”
Low-level CNN for acoustic modeling
Higher-level RNN for repetitive pattern modeling

6

CRNN train data sample

SLIDE 17

NETWORK ARCHITECTURES

7

2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 2 x dense: 256

CNN

2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 3 x RNN: 50 BD GRU

CRNN

frames context

conv. layers
rec. layers

dense layers CNN — 25 see figure — 2x256 CRNN 400 13 3 x 50 BD GRU —

Early stopping Batch normalization L2 norm Dropout (30%) ADAM optimizer

SLIDE 18

DATASETS

8

SLIDE 19

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

DATASETS

8

♫

SLIDE 20

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

DATASETS

8

♫

SLIDE 21

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

Drum annotations for Medley DB subset
23 tracks, total duration: 20m

DATASETS

8

♫ ♫

SLIDE 22

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

Drum annotations for Medley DB subset
23 tracks, total duration: 20m

DATASETS

8

♫ ♫

SLIDE 23

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

Drum annotations for Medley DB subset
23 tracks, total duration: 20m

RBMA13-Drums [Vogl et al. 2017]

Music from 2013 Red Bull Music Academy, different styles
27 tracks, total duration: 1h 43m

DATASETS

8

♫ ♫ ♫

SLIDE 24

ENST-Drums [Gillet and Richard 2006]

Recordings, three drummers / drum kits
64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

Drum annotations for Medley DB subset
23 tracks, total duration: 20m

RBMA13-Drums [Vogl et al. 2017]

Music from 2013 Red Bull Music Academy, different styles
27 tracks, total duration: 1h 43m

DATASETS

8

♫ ♫ ♫

SLIDE 25

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

SLIDE 26

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

18 8 3

relative frequency of instrument onsets

SLIDE 27

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

18 8 3

relative frequency of instrument onsets

SLIDE 28

SYNTHETIC DATASET

10

NEW!

SLIDE 29

Synthetic dataset from MIDI songs

Mix of different genres, full songs

SYNTHETIC DATASET

10

NEW!

SLIDE 30

Synthetic dataset from MIDI songs

Mix of different genres, full songs
Optional accompaniment

SYNTHETIC DATASET

10

NEW!

SLIDE 31

Synthetic dataset from MIDI songs

Mix of different genres, full songs
Optional accompaniment
Diverse drum sounds (57 different drum kits, acoustic and electronic)

SYNTHETIC DATASET

10

NEW!

SLIDE 32

Synthetic dataset from MIDI songs

Mix of different genres, full songs
Optional accompaniment
Diverse drum sounds (57 different drum kits, acoustic and electronic)
Varying quality, no vocals!

SYNTHETIC DATASET

10

NEW!

SLIDE 33

Synthetic dataset from MIDI songs

Mix of different genres, full songs
Optional accompaniment
Diverse drum sounds (57 different drum kits, acoustic and electronic)
Varying quality, no vocals!
4197 tracks, total duration: 259h

SYNTHETIC DATASET

10

♫

NEW!

SLIDE 34

Synthetic dataset from MIDI songs

Mix of different genres, full songs
Optional accompaniment
Diverse drum sounds (57 different drum kits, acoustic and electronic)
Varying quality, no vocals!
4197 tracks, total duration: 259h

SYNTHETIC DATASET

10

♫

NEW!

SLIDE 35

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

SLIDE 36

Follows the same relative instrument distribution

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

SLIDE 37

Follows the same relative instrument distribution

− same bias for instruments

same problems during training

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

SLIDE 38

Follows the same relative instrument distribution

− same bias for instruments

same problems during training

+ datasets are representative samples

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

SLIDE 39

BALANCING OF SYNTHETIC DATASET

12

SLIDE 40

Swap instruments for individual tracks

BALANCING OF SYNTHETIC DATASET

12

SLIDE 41

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

SLIDE 42

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

♫

18 8 3

relative frequency of instrument onsets

SLIDE 43

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

♫

18 8 3

relative frequency of instrument onsets

SLIDE 44

RESULTS ON SYNTHETIC DATA

13

F-measure

MIDI bal. MIDI

SLIDE 45

RESULTS ON SYNTHETIC DATA

13

Overall performance for MIDI bal. is worse

It is a harder task

F-measure

MIDI bal. MIDI

SLIDE 46

RESULTS ON SYNTHETIC DATA

13

Overall performance for MIDI bal. is worse

It is a harder task

Performance of underrepresented instruments improves

Providing more samples forces

the network to learn formerly sparsely used instruments

F-measure

MIDI bal. MIDI

SLIDE 47

Model trained on synthetic data performs well on real-world data (ENST + MDB + RBMA)

OVERALL PERFORMANCE ON REAL DATA

14

0.00 0.25 0.50 0.75 1.00

3 8 18

real MIDI MIDI bal.

trained on: evaluated on: real

3 8 18

F-measure number of instrument classes

SLIDE 48

Performance decreases, but not drastically

RESULTS FOR DIFFERENT SIZES

15

0.00 0.25 0.50 0.75 1.00 3 8 18 MIDI 100% MIDI 10% MIDI 1%

F-measure trained on: evaluated on: real

3 8 18

number of instrument classes

SLIDE 49

pretrain MIDI

pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

SLIDE 50

Improvements observed on balanced synthetic data do not translate to real-world data

pretrain MIDI

pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

SLIDE 51

Improvements observed on balanced synthetic data do not translate to real-world data Small improvements using pre-training

pretrain MIDI

pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

SLIDE 52

INSTRUMENT CONFUSIONS

17

confusions hidden onsets additional onsets trained on: real+MIDI evaluated on: real+MIDI

SLIDE 53

CAN YOU HEAR THE DIFFERENCE?

18

♫ ♫

SLIDE 54

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

SLIDE 55

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

SLIDE 56

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

SLIDE 57

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

1: low tom 2: bass drum 3: bass drum

SLIDE 58

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

SLIDE 59

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

SLIDE 60

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

SLIDE 61

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum 1: crash 2: ride 3: China

SLIDE 62

CONCLUSIONS

Publicly available large scale synthetic dataset

Optional with balanced instruments
Generalizes well to real data

Dataset size important but not that critical Balancing did not improve performance on real-world data

Recurrent layers learn untypical patters

Pre-training with synthetic data provides small improvement Mistakes are understandable

Focus more on context

19

http://ifs.tuwien.ac.at/~vogl/dafx2018/