Richard Vogl1,2, Gerhard Widmer2, Peter Knees1
richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at
TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION
1 2
TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , - - PowerPoint PPT Presentation
TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , Gerhard Widmer 2 , Peter Knees 1 richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at 1 2 WHAT IS DRUM TRANSCRIPTION? Input: popular music containing drums
richard.vogl@tuwien.ac.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at
1 2
2
Input: popular music containing drums Output: symbolic representation of notes played by drum instruments
Current state-of-the-art systems:
Overview Article
Wu, C.-W., Dittmar, C., Southall, C.,Vogl, R., Widmer, G., Hockman, J., Müller, M., Lerch, A.: “An Overview of Automatic Drum Transcription,” IEEE TASLP, vol. 26, no. 9, Sept. 2018.
3
activation functions spectrogram
t [ms] t [ms]
bass snare hi-hat
4
BD SD HH
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
bass drum snare drum hi-hat
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
bass drum snare drum hi-hat
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
bass drum snare drum hi-hat
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
bass drum snare drum hi-hat
SotA works focus bass drum (BD) snare (SD) and hi-hat (HH)
4
BD SD HH
bass drum snare drum hi-hat
Other instruments are important!
5
signal preprocessing NN feature extraction event detection classification peak picking NN training audio events
waveform
t [s]
A spectrogram
t [s] f [Hz]
activation functions
t [s] bass snare hi-hat
detected peaks
t [s] bass snare hi-hat
train data
6
CNN train data sample
Convolutional NN (CNN)
6
Convolutional NN (CNN)
Convolutional RNN (CRNN)
6
CRNN train data sample
7
2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 2 x dense: 256
CNN
2 x conv: 32 x 3x3 (batch norm) max pool: 1x3 2 x conv: 64 x 3x3 (batch norm) max pool: 1x3 3 x RNN: 50 BD GRU
CRNN
frames context
dense layers CNN — 25 see figure — 2x256 CRNN 400 13 3 x 50 BD GRU —
Early stopping Batch normalization L2 norm Dropout (30%) ADAM optimizer
8
ENST-Drums [Gillet and Richard 2006]
8
ENST-Drums [Gillet and Richard 2006]
8
ENST-Drums [Gillet and Richard 2006]
MDB Drums [Southall et al. 2017]
8
ENST-Drums [Gillet and Richard 2006]
MDB Drums [Southall et al. 2017]
8
ENST-Drums [Gillet and Richard 2006]
MDB Drums [Southall et al. 2017]
RBMA13-Drums [Vogl et al. 2017]
8
ENST-Drums [Gillet and Richard 2006]
MDB Drums [Southall et al. 2017]
RBMA13-Drums [Vogl et al. 2017]
8
9
number of classes
3 8 18
instrument name
BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH
TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks
9
number of classes
3 8 18
instrument name
BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH
TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks
18 8 3
relative frequency of instrument onsets
9
number of classes
3 8 18
instrument name
BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT hight tom MT mid tom LT low tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH
TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks
18 8 3
relative frequency of instrument onsets
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
Synthetic dataset from MIDI songs
10
NEW!
11
18 8 3
relative frequency of instrument onsets
Follows the same relative instrument distribution
11
18 8 3
relative frequency of instrument onsets
Follows the same relative instrument distribution
− same bias for instruments
same problems during training
11
18 8 3
relative frequency of instrument onsets
Follows the same relative instrument distribution
− same bias for instruments
same problems during training
11
18 8 3
relative frequency of instrument onsets
12
Swap instruments for individual tracks
12
Swap instruments for individual tracks Artificial balancing of instrument distribution
12
Swap instruments for individual tracks Artificial balancing of instrument distribution
12
18 8 3
relative frequency of instrument onsets
Swap instruments for individual tracks Artificial balancing of instrument distribution
12
18 8 3
relative frequency of instrument onsets
13
F-measure
MIDI bal. MIDI
13
Overall performance for MIDI bal. is worse
F-measure
MIDI bal. MIDI
13
Overall performance for MIDI bal. is worse
Performance of underrepresented instruments improves
the network to learn formerly sparsely used instruments
F-measure
MIDI bal. MIDI
Model trained on synthetic data performs well on real-world data (ENST + MDB + RBMA)
14
0.00 0.25 0.50 0.75 1.00
3 8 18
real MIDI MIDI bal.
trained on: evaluated on: real
3 8 18
F-measure number of instrument classes
Performance decreases, but not drastically
15
0.00 0.25 0.50 0.75 1.00 3 8 18 MIDI 100% MIDI 10% MIDI 1%
F-measure trained on: evaluated on: real
3 8 18
number of instrument classes
pretrain MIDI
16
F-measure
trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI
Improvements observed on balanced synthetic data do not translate to real-world data
pretrain MIDI
16
F-measure
trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI
Improvements observed on balanced synthetic data do not translate to real-world data Small improvements using pre-training
pretrain MIDI
16
F-measure
trained on: evaluated on: real real real+MIDI real+MIDI bal. MIDI bal. MIDI
17
confusions hidden onsets additional onsets trained on: real+MIDI evaluated on: real+MIDI
18
Bass drum or low tom?
18
Bass drum or low tom?
18
Bass drum or low tom?
18
Bass drum or low tom?
18
1: low tom 2: bass drum 3: bass drum
Bass drum or low tom? Which cymbal is it?
18
hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?
1: low tom 2: bass drum 3: bass drum
Bass drum or low tom? Which cymbal is it?
18
hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?
1: low tom 2: bass drum 3: bass drum
Bass drum or low tom? Which cymbal is it?
18
hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?
1: low tom 2: bass drum 3: bass drum
Bass drum or low tom? Which cymbal is it?
18
hi-hat splash cymbal crash cymbal Chinese cymbal ride cymbal ?
1: low tom 2: bass drum 3: bass drum 1: crash 2: ride 3: China
Publicly available large scale synthetic dataset
Dataset size important but not that critical Balancing did not improve performance on real-world data
Pre-training with synthetic data provides small improvement Mistakes are understandable
19
http://ifs.tuwien.ac.at/~vogl/dafx2018/