TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , - - PowerPoint PPT Presentation

towards multi instrument drum transcription
SMART_READER_LITE
LIVE PREVIEW

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , - - PowerPoint PPT Presentation

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , Gerhard Widmer 2 , Peter Knees 1 richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at 1 2 WHAT IS DRUM TRANSCRIPTION? Input: popular music containing drums


slide-1
SLIDE 1

Richard Vogl1,2, Gerhard Widmer2, Peter Knees1

richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION

1 2

slide-2
SLIDE 2

WHAT IS DRUM TRANSCRIPTION?

2

Input: popular music containing drums Output: symbolic representation of notes played by drum instruments

slide-3
SLIDE 3

Current state-of-the-art systems:

  • End-to-end / activation-function-based approaches
  • NN based approaches and NMF approaches

Overview Article


Wu, C.-W., Dittmar, C., Southall, C.,Vogl, R., Widmer, G., Hockman, J., Müller, M., Lerch, A.: 
 “An Overview of Automatic Drum Transcription,” IEEE TASLP, vol. 26, no. 9, Sept. 2018.

STATE OF THE ART

3

activation functions spectrogram

t [ms] t [ms]

bass snare hi-hat

slide-4
SLIDE 4

FOCUS OF THIS WORK

4

BD SD HH

slide-5
SLIDE 5

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

FOCUS OF THIS WORK

4

BD SD HH

slide-6
SLIDE 6

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets

FOCUS OF THIS WORK

4

BD SD HH

slide-7
SLIDE 7

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important

FOCUS OF THIS WORK

4

BD SD HH

slide-8
SLIDE 8

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

slide-9
SLIDE 9

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

slide-10
SLIDE 10

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

slide-11
SLIDE 11

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

slide-12
SLIDE 12

SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUS OF THIS WORK

4

BD SD HH

bass drum snare drum hi-hat

Other instruments are important!

→ Increase number of instruments for drum transcription

slide-13
SLIDE 13

SYSTEM OVERVIEW

5

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events

waveform

t [s]

A spectrogram

t [s] f [Hz]

activation functions

t [s] bass snare hi-hat

detected peaks

t [s] bass snare hi-hat

train data

slide-14
SLIDE 14

NETWORK ARCHITECTURES

6

slide-15
SLIDE 15

CNN train data sample

NETWORK ARCHITECTURES

Convolutional NN (CNN)

  • Convolutions capture local correlations
  • Acoustic modeling of drum sounds

6

slide-16
SLIDE 16

NETWORK ARCHITECTURES

Convolutional NN (CNN)

  • Convolutions capture local correlations
  • Acoustic modeling of drum sounds

Convolutional RNN (CRNN)

  • ”best of both worlds”
  • Low-level CNN for acoustic modeling
  • Higher-level RNN for repetitive pattern modeling

6

CRNN train data sample

slide-17
SLIDE 17

NETWORK ARCHITECTURES

7

2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 2 x dense: 256

CNN

2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 3 x RNN: 50 BD GRU

CRNN

frames context

  • conv. layers
  • rec. layers

dense layers CNN — 25 see figure — 2x256 CRNN 400 13 3 x 50 BD GRU —

Early stopping Batch normalization L2 norm Dropout (30%) ADAM optimizer

slide-18
SLIDE 18

DATASETS

8

slide-19
SLIDE 19

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

DATASETS

8

slide-20
SLIDE 20

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

DATASETS

8

slide-21
SLIDE 21

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

  • Drum annotations for Medley DB subset
  • 23 tracks, total duration: 20m

DATASETS

8

♫ ♫

slide-22
SLIDE 22

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

  • Drum annotations for Medley DB subset
  • 23 tracks, total duration: 20m

DATASETS

8

♫ ♫

slide-23
SLIDE 23

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

  • Drum annotations for Medley DB subset
  • 23 tracks, total duration: 20m

RBMA13-Drums [Vogl et al. 2017]

  • Music from 2013 Red Bull Music Academy, different styles
  • 27 tracks, total duration: 1h 43m

DATASETS

8

♫ ♫ ♫

slide-24
SLIDE 24

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers / drum kits
  • 64 tracks, total duration: 1h

MDB Drums [Southall et al. 2017]

  • Drum annotations for Medley DB subset
  • 23 tracks, total duration: 20m

RBMA13-Drums [Vogl et al. 2017]

  • Music from 2013 Red Bull Music Academy, different styles
  • 27 tracks, total duration: 1h 43m

DATASETS

8

♫ ♫ ♫

slide-25
SLIDE 25

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

  • pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

slide-26
SLIDE 26

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

  • pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

18 8 3

relative frequency of instrument onsets

slide-27
SLIDE 27

DATASETS

9

number of classes

3 8 18

instrument name

BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH

  • pen hi-hat

TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks

18 8 3

relative frequency of instrument onsets

slide-28
SLIDE 28

SYNTHETIC DATASET

10

NEW!

slide-29
SLIDE 29

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs

SYNTHETIC DATASET

10

NEW!

slide-30
SLIDE 30

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs
  • Optional accompaniment

SYNTHETIC DATASET

10

NEW!

slide-31
SLIDE 31

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs
  • Optional accompaniment
  • Diverse drum sounds (57 different drum kits, acoustic and electronic)

SYNTHETIC DATASET

10

NEW!

slide-32
SLIDE 32

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs
  • Optional accompaniment
  • Diverse drum sounds (57 different drum kits, acoustic and electronic)
  • Varying quality, no vocals!

SYNTHETIC DATASET

10

NEW!

slide-33
SLIDE 33

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs
  • Optional accompaniment
  • Diverse drum sounds (57 different drum kits, acoustic and electronic)
  • Varying quality, no vocals!
  • 4197 tracks, total duration: 259h

SYNTHETIC DATASET

10

NEW!

slide-34
SLIDE 34

Synthetic dataset from MIDI songs

  • Mix of different genres, full songs
  • Optional accompaniment
  • Diverse drum sounds (57 different drum kits, acoustic and electronic)
  • Varying quality, no vocals!
  • 4197 tracks, total duration: 259h

SYNTHETIC DATASET

10

NEW!

slide-35
SLIDE 35

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

slide-36
SLIDE 36

Follows the same relative instrument distribution

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

slide-37
SLIDE 37

Follows the same relative instrument distribution

− same bias for instruments

same problems during training

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

slide-38
SLIDE 38

Follows the same relative instrument distribution

− same bias for instruments

same problems during training

+ datasets are representative samples

SYNTHETIC DATASET

11

18 8 3

relative frequency of instrument onsets

slide-39
SLIDE 39

BALANCING OF SYNTHETIC DATASET

12

slide-40
SLIDE 40

Swap instruments for individual tracks

BALANCING OF SYNTHETIC DATASET

12

slide-41
SLIDE 41

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

slide-42
SLIDE 42

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

18 8 3

relative frequency of instrument onsets

slide-43
SLIDE 43

Swap instruments for individual tracks Artificial balancing of instrument distribution

BALANCING OF SYNTHETIC DATASET

12

18 8 3

relative frequency of instrument onsets

slide-44
SLIDE 44

RESULTS ON SYNTHETIC DATA

13

F-measure

MIDI bal. MIDI

slide-45
SLIDE 45

RESULTS ON SYNTHETIC DATA

13

Overall performance for MIDI bal. is worse

  • It is a harder task

F-measure

MIDI bal. MIDI

slide-46
SLIDE 46

RESULTS ON SYNTHETIC DATA

13

Overall performance for MIDI bal. is worse

  • It is a harder task

Performance of underrepresented instruments improves

  • Providing more samples forces

the network to learn formerly sparsely used instruments

F-measure

MIDI bal. MIDI

slide-47
SLIDE 47

Model trained on synthetic data performs well on real-world data (ENST + MDB + RBMA)

OVERALL PERFORMANCE ON REAL DATA

14

0.00 0.25 0.50 0.75 1.00

3 8 18

real MIDI MIDI bal.

trained on: evaluated on: real

3 8 18

F-measure number of instrument classes

slide-48
SLIDE 48

Performance decreases, but not drastically

RESULTS FOR DIFFERENT SIZES

15

0.00 0.25 0.50 0.75 1.00 3 8 18 MIDI 100% MIDI 10% MIDI 1%

F-measure trained on: evaluated on: real

3 8 18

number of instrument classes

slide-49
SLIDE 49

pretrain MIDI

  • pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

slide-50
SLIDE 50

Improvements observed on balanced synthetic data do not translate to real-world data

pretrain MIDI

  • pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

slide-51
SLIDE 51

Improvements observed on balanced synthetic data do not translate to real-world data Small improvements using pre-training

pretrain MIDI

  • pt. MIDI bal.

PERFORMANCE FOR INSTRUMENTS

16

F-measure

trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI

slide-52
SLIDE 52

INSTRUMENT CONFUSIONS

17

confusions hidden onsets additional onsets trained on: real+MIDI evaluated on: real+MIDI

slide-53
SLIDE 53

CAN YOU HEAR THE DIFFERENCE?

18

♫ ♫

slide-54
SLIDE 54

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

slide-55
SLIDE 55

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

slide-56
SLIDE 56

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

slide-57
SLIDE 57

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom?

18

♫ ♫

1: low tom 2: bass drum 3: bass drum

slide-58
SLIDE 58

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

slide-59
SLIDE 59

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

slide-60
SLIDE 60

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum

slide-61
SLIDE 61

CAN YOU HEAR THE DIFFERENCE?

Bass drum or low tom? Which cymbal is it?

18

♫ ♫

hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?

♫ ♫

1: low tom 2: bass drum 3: bass drum 1: crash 2: ride 3: China

slide-62
SLIDE 62

CONCLUSIONS

Publicly available large scale synthetic dataset

  • Optional with balanced instruments
  • Generalizes well to real data

Dataset size important but not that critical Balancing did not improve performance on real-world data

  • Recurrent layers learn untypical patters

Pre-training with synthetic data provides small improvement Mistakes are understandable

  • Focus more on context

19

http://ifs.tuwien.ac.at/~vogl/dafx2018/