Department of Statistics SVM based Classification of Instruments - - - PowerPoint PPT Presentation

▶

Aug 30, 2023 378 likes •688 views

Department of Statistics SVM based Classification of Instruments - Timbre Analysis Uwe Ligges and Sebastian Krey Department of Statistics, TU Dortmund Reisensburg, Statistical Computing 2009 Introduction Model Building Timbre

SLIDE 1

Department of Statistics SVM based Classification of Instruments - Timbre Analysis

Uwe Ligges and Sebastian Krey Department of Statistics, TU Dortmund Reisensburg, Statistical Computing 2009

SLIDE 2

Introduction Model Building Timbre Features/Classification Results Summary 2

Introduction

Why Timbre Analysis? Why classification of voices or instruments?

Timbre generation

bjective criteria for the assessment of the quality of

vocal performance support for singing teachers and students who try to improve voices derive properties related to performance quality aspects

f single tones like solidity / softness / brilliance of tones

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 3

Introduction Model Building Timbre Features/Classification Results Summary 3

Introduction

Music Recommender Systems

quest for better features widely used on server infrastructure that support music listeners who download music from the web to their home computers and even their mobile devices.

Speech Recognition

where timbre should not make a difference

Hearing Aids (and other audio compression tasks)

‘Vowel Classification by a Perceptually Motivated Neurophysiologically Parameterized Auditory Model’ (Szepannek et al., 2006) perception analysis

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 4

Introduction Model Building Timbre Features/Classification Results Summary 4

Introduction

automatic transcription of (polyphonic) music

f interest for music publishers, music amateurs, and

scientists (particularly those working in music psychology) parts of transcription algorithms heavily used in music recommender systems

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 5

Introduction Model Building Timbre Features/Classification Results Summary 5

Pitch estimation

Several methods for pitch estimation (f0 tracking, ...) have been proposed: in time domain (such as a model that follows shortly) in frequency domain (such as our heuristical proposal) hybrid methods any combinations with, e.g., HMMs none of them works really well on singing data none of them works on polyphonic data

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 6

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt = cos [2πtf0 + φ ] + ǫt

f0 = fundamental frequency, the parameter of interest ǫt = error t ∈

S , 1 S , . . . , T−1 S

time, no. of observations T

φ = phase displacement

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 7

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

cos [2πtf0(h ) + φh] + ǫt

f0 = fundamental frequency, the parameter of interest ǫt = error t ∈

S , 1 S , . . . , T−1 S

time, no. of observations T

φh = phase displacement of h-th partial H = no. of partials in the model

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 8

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

Bh cos [2πtf0(h ) + φh] + ǫt

f0 = fundamental frequency, the parameter of interest ǫt = error t ∈

S , 1 S , . . . , T−1 S

time, no. of observations T

φh = phase displacement of h-th partial H = no. of partials in the model Bh = amplitude of h-th partial

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 9

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

Bh cos [2πtf0(h + δh) + φh] + ǫt

f0 = fundamental frequency, the parameter of interest ǫt = error t ∈

S , 1 S , . . . , T−1 S

time, no. of observations T

φh = phase displacement of h-th partial H = no. of partials in the model Bh = amplitude of h-th partial δh = frequency displacement of h-th partial where δ1 := 0

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 10

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

Φi(t)Bh,i cos [2πtf0(h + δh) + φh ] + ǫt

Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆

2∆

1[(i−1)∆,(i+1)∆](t) i-th basis function

defined on windows with 50% overlap, ∆ := T−1

, 1 indicator function, S sampling rate

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 11

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

Φi(t)Bh,i cos [2πtf0(h + δh) + φh +(h + δh)Av sin(2πfvt + φv)] + ǫt

Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆

2∆

1[(i−1)∆,(i+1)∆](t) i-th basis function

defined on windows with 50% overlap, ∆ := T−1

, 1 indicator function, S sampling rate fv = frequency of vibrato Av = amplitude of vibrato φv = phase displacement of vibrato

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 12

Introduction Model Building Timbre Features/Classification Results Summary 6

Pitch estimation model (monophonic)

yt =

Φi(t)Bh,i cos [2πtf0(h + δh) + φh +(h + δh)Av sin(2πfvt + φv)] + ǫt

Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆

2∆

1[(i−1)∆,(i+1)∆](t) i-th basis function

defined on windows with 50% overlap, ∆ := T−1

, 1 indicator function, S sampling rate fv = frequency of vibrato Av = amplitude of vibrato φv = phase displacement of vibrato 5 + 3H parameters to estimate, but H might be > 10

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 13

Introduction Model Building Timbre Features/Classification Results Summary 7

Pitch estimation model (POLYphonic)

yt =

Φi,j(t)Bh,i,j cos [2πtf0,j(hj + δh,j) + φh,j +(hj + δh,j)Av,j sin(2πfv,jt + φv,j)] + ǫt Joint work in progress (?) with Katrin Sommer, Claus Weihs; cooperation with Technical University of Tampere. J number of polyphonic tones Identifiability ?!

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 14

Introduction Model Building Timbre Features/Classification Results Summary 8

The Timbre Problem

Timbre Classification

Joint work with Sebastian Krey Specific task: Classification of instruments based on a given audio track of one tone Data: McGill Instrument Database, 38 instruments played in 59 ways (e.g. bowed vs. pizz.), each with 6-88 differently pitched tones, altogether 1976 wave files (44100 Hertz, 16 bit, 3-5 seconds each)

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 15

Introduction Model Building Timbre Features/Classification Results Summary 9

Let’s start

Pre-emphasis filtering to increase higher partials: yt = xt − 0.97xt−1 Short Time Fourier Transformation (on overlapping windows): F(t, k) =

N−M

j=1−M

w(j − t)xj exp

−2iπj k

N

Hamming windows (width: 25ms, overlap: 10ms):

w(t) =

0.54 − 0.46 cos

2πt

−T

2 ≤ t ≤ T 2

therwise

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 16

Introduction Model Building Timbre Features/Classification Results Summary 10

Let’s start

Mel scale: Transformation of FFT frequencies to Mel scale in order to model the emotional sense of the human ear (better resolution of human ear above 1 kHz, for example): Mel(hz) = 2595 log10

1 + hz

700

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis

Reisensburg, Statistical Computing 2009

SLIDE 17

Introduction Model Building Timbre Features/Classification Results Summary 11

Feature Extraction

Using features pretty well known from speech recognition, e.g.: (Perceptive) Linear Predictive Coding (LPC/PLP)

Filter even more in order to get a somehow uniform loudness impression on the whole frequency range (PLP) Loudness compression by looking at cubic roots of amplitudes (PLP) Transformation back to time domain by inverse Fourier transformation Fit an autoregressive model (by Levinson Durbin recursion): yt =

ajyt−j + et

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 18

Introduction Model Building Timbre Features/Classification Results Summary 12

Feature Extraction

Mel Frequency Cepstral Coefficients (MFCC)

Logarithm of loudness compression Discrete cosine transformation (DCT) Considering first p DCT coefficients

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 19

Introduction Model Building Timbre Features/Classification Results Summary 13

Clustering

Many features times many time frames (windows) of the signal result in too many features. Different tones have different lengths, i.e. different numbers of windows are used There might be silence (or noise such as breathing) at the start / end of a tone Hence clustering the found (vectors of) coefficients of all windows using Kmeans Number of clusters: 3-4, motivated by different phases of a tone: attack, (sustain), decay, silence/noise. Use cluster centroids as the (only) features for the classification task.

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 20

Introduction Model Building Timbre Features/Classification Results Summary 14

Hierarchical Clustering

223 224 227 230 221 222 229 226 225 228 212 215 213 214 218 220 217 216 219 246 243 242 244 245 232 234 231 233 239 240 241 235 236 237 238 264 265 266 253 254 259 260 261 263 262 257 256 258 249 247 248 250 251 252 255 274 275 276 277 278 296 297 294 295 298 299 269 270 267 268 273 271 272 279 282 281 283 280 284 285 286 289 290 292 293 291 287 288 208 211 209 210 199 202 201 204 206 207 200 203 205 197 198 182 180 181 183 185 195 196 193 194 190 191 188 189 192 187 184 186 165 168 169 166 167 174 170 171 172 173 179 175 178 176 177 147 148 149 157 155 156 153 158 154 151 150 152 159 160 161 164 162 163 140 139 141 143 144 145 146 138 142 137 133 136 132 134 135 128 130 129 131 4 5 6 7 8 9 10 11 14 12 13 17 15 16 19 21 18 20 99 98 100 97 93 94 95 96 89 90 91 88 92 115 118 116 114 117 126 127 119 120 123 121 122 124 125 110 111 112 113 108 109 106 107 104 105 101 102 103 3 48 49 50 53 46 45 47 43 44 64 63 65 66 60 61 6259 57 58 51 54 52 55 56 86 87 84 85 82 83 81 79 80 68 67 69 74 70 71 76 77 78 75 72 73 22 25 23 24 26 27 28 29 30 33 34 31 32 36 37 40 38 39 35 41 42 1 2 567 557 558 575 576 577 578 555 556 553 549 550 554 551 552 563 564 572 565 566 569 570 571 559 560 561 573 562 568 574 530 528 529 517 521 526 527 523 520 531 524 522 525 541 542 543 544 547 548 545 546 535 536 532 533 534 537 538 539 540 484 485 486 487 488 482 483 496 506 494 502 500 501 515 516 510 512 511 509 513 514 507 508 518 519 491 492 499 503 504 505 498 495 497493 490 480 478 477 479 481 489 475 476 472 473 474 470 471 469 467 468 465 466 458 460 457 459 464 463 461 462 452 442 443 426 432 427 436 434 435 433 437 428 429 430 431 438 441 444 439 440 445 446 447 448 454 451 449 450 453 455 456 384 381 380 382 378 379 389 386 387 383 390 391 385 388376 377 371 370 372 373 374 375 395 393 394 407 406 408 392 400 401 397 398 405 399 403 402 396 404 425 421 419 420 422 423 424 411 414 418 412 409 410 416 417 413 415 369 360 356 357 358 359 362 361 364 365 363 368 366 367342 340 336 338 343 344 337 339 341 345 346 347 348 352 351 350 353 354 349 355 327 335 333 330 322 329 319 320 326 328 321 325 318 323 324 332 331 334 316 317 312 315 313 314300 305 303 306 301 302 309 310 311 304 307 308 10 20 30 40

Cluster Dendrogram Piano MFCCs

Height

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 21

Introduction Model Building Timbre Features/Classification Results Summary 15

Kmeans Clustering

200 300 400 500 600

Clusterassignment of DCT frames

DCT frame number Noise Decay Sustain Attack

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 22

Introduction Model Building Timbre Features/Classification Results Summary 16

Classification

Support Vector Machines, tried kernels: linear K(xi, xj) = x′

i xj

polynomial K(xi, xj) = (γx′

i xj + r)d, γ > 0

rbf K(xi, xj) = exp(−γxi − xj2) sigmoid K(xi, xj) = tanh(γx′

i xj + r) (extremely bad)

Linear Discriminant Analysis Random Forests some more not reported

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 23

Introduction Model Building Timbre Features/Classification Results Summary 17

Software

Port of functions from Matlab package rastamat to R Some more speech processing functions implemented to be published in package tuneR SVM implementation from R package kernlab and (the not yet published) classifieR for optimization and validation of classification results

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 24

Introduction Model Building Timbre Features/Classification Results Summary 18

Results

all results based on a doubled 5-fold crossvalidation (inner loop for parameter optimization, outer loop for assessment) 59 classes LPC coefficients: All misclassification rates > 85%. PLP coefficients: classif. parameter error std error SVM-Poly γ = 1.4, d = 3 0.33 0.03 SVM-RBF γ = 1.4, σ = 0.023 0.44 0.03 SVM-Lin γ = 1.5 0.51 0.03 RandFor U = 1500, V = 3 0.32 0.03 LDA 0.55 0.02

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 25

Introduction Model Building Timbre Features/Classification Results Summary 19

Results

MFCC + PLP classif. parameter error std error SVM-Poly γ = 1.4, d = 2 0.18 0.02 SVM-RBF γ = 1.5, σ = 0.007 0.23 0.02 SVM-Lin γ = 0.6 0.18 0.03 RandFor U = 1000, V = 6 0.22 0.03 LDA 0.28 0.02

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 26

Introduction Model Building Timbre Features/Classification Results Summary 20

Summary

We are working on 59 classes, i.e. guessing implies misclassification error of 0.98 Best misclassification rate: 0.18 (comparable to what trained humans can archive) It turns out that the choice and construction of appropriate variables is (as in so many other classification tasks) much more important than the particular classification method that is finally used.

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 27

Introduction Model Building Timbre Features/Classification Results Summary 21

References I

von Ameln, F. (2001): Blind source separation in der Praxis. Diploma Thesis, Fachbereich Statistik, Universit¨ at Dortmund, Germany. Cano, P., Loscos, A., Bonada, J. (1999): Score-Performance Matching using

HMMs. In: Proceedings of the International Computer Music Conference.

Beijing, China. Cemgil, T., Desain, P., Kappen, B. (2000): Rhythm Quantization for

Transcription. Computer Music Journal 24 (2), 60–76.

Ellis, D. P. W. (2005), PLP and RASTA (and MFCC, and inversion) in Matlab, URL http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/ Garczarek, U., Weihs, C., Ligges, U. (2003): Prediction of Notes from Vocal Time Series. Technical Report 1/2003, SFB475, Department of Statistics, University of Dortmund. http://www.sfb475.uni-dortmund.de. Hastie, T. & Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer, New York. Hsu, C.-W. & Chang, C.-C. & Lin, C.-J. (2008) A Practical Guide to Support Vector Classification National Taiwan University, Taipei, URL http://www.csie.ntu.edu.tw/ cjlin

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 28

Introduction Model Building Timbre Features/Classification Results Summary 22

References II

Karatzoglou, A. & Smola, A. & Hornik, K. & Zeileis, A. (2004) kernlab – An S4 Package for Kernel Methods in R Journal of Statistical Software Vol. 11, No. 9, pages 1–20 URL http://www.jstatsoft.org/v11/i09/ Kleber, B. (2002): Evaluation von Stimmqualit¨ at in westlichem, klassischen

Gesang. Diploma Thesis, Fachbereich Psychologie, Universit¨

at Konstanz. Ligges, U. (2006): Transkription monophoner Gesangszeitreihen. Dissertation, Fachbereich Statistik, Universit¨ at Dortmund, http://hdl.handle.net/2003/22521. Ligges, U., Weihs, C., Hasse-Becker, P. (2002): Detection of Locally Stationary Segments in Time Series. In: W. H¨ ardle And B. R¨

nz (Eds.): CompStat2002 –

Proceedings in Computational Statistics – 15th Symposium held in Berlin,

Germany. Physika Verlag, Heidelberg, 285–290.

Nienhuys, H.-W., Nieuwenhuizen, J., et al. (2004): GNU LilyPond – The Music

Typesetter. Free Software Foundation, http://www.lilypond.org, Version

2.0.3. Opolko, F. & Wapnick, J. (1987) McGill University master samples (CDs) R Development Core Team (2009): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, ISBN 3-900051-00-3, http://www.R-project.org.

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009

SLIDE 29

Introduction Model Building Timbre Features/Classification Results Summary 23

References III

Reuter, C. (2002): Klangfarbe und Instrumentation – Geschichte – Ursachen –

Wirkung. Peter Lang, Frankfurt/M.

Roever, C. (2003), Musikinstrumentenerkennung mit Hilfe der Hough-Transformation, URL http://www.aei.mpg.de/ chroev/publications/RoeverDiplom.pdf Rosell, M. (2006) An Introduction to Front-End Processing and Acoustic Features for Automatic Speech Recognition URL www.nada.kth.se/ rosell/courses/rosell acoustic features.pdf Weihs, C., Berghoff, S., Hasse-Becker, P., Ligges, U. (2001): Assessment of Purity of Intonation in Singing Presentations by Discriminant Analysis. In: J. Kunert And G. Trenkler (Eds.): Mathematical Statistics and Biometrical

Applications. Josef Eul, Bergisch-Gladbach, K¨
ln, 395–410.

Weihs, C., Ligges, U. (2003): Automatic Transcription of Singing Performances. Bulletin of the International Statistical Institute, 54th Session, Proceedings, Volume LX, Book 2, 507–510. Weihs, C., Ligges, U., G¨ uttner, J., Hasse-Becker, P., Berghoff, S. (2003): Classification and Clustering of Vocal Performances. In: M. Schader, W. Gaul and M. Vichi (Eds.): Between Data Science and Applied Data Analysis. Springer, Berlin, 118–127.

Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009