Department of Statistics SVM based Classification of Instruments - - - PowerPoint PPT Presentation
Department of Statistics SVM based Classification of Instruments - - - PowerPoint PPT Presentation
Department of Statistics SVM based Classification of Instruments - Timbre Analysis Uwe Ligges and Sebastian Krey Department of Statistics, TU Dortmund Reisensburg, Statistical Computing 2009 Introduction Model Building Timbre
Introduction Model Building Timbre Features/Classification Results Summary 2
Introduction
Why Timbre Analysis? Why classification of voices or instruments?
Timbre generation
- bjective criteria for the assessment of the quality of
vocal performance support for singing teachers and students who try to improve voices derive properties related to performance quality aspects
- f single tones like solidity / softness / brilliance of tones
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 3
Introduction
Music Recommender Systems
quest for better features widely used on server infrastructure that support music listeners who download music from the web to their home computers and even their mobile devices.
Speech Recognition
where timbre should not make a difference
Hearing Aids (and other audio compression tasks)
‘Vowel Classification by a Perceptually Motivated Neurophysiologically Parameterized Auditory Model’ (Szepannek et al., 2006) perception analysis
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 4
Introduction
automatic transcription of (polyphonic) music
- f interest for music publishers, music amateurs, and
scientists (particularly those working in music psychology) parts of transcription algorithms heavily used in music recommender systems
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 5
Pitch estimation
Several methods for pitch estimation (f0 tracking, ...) have been proposed: in time domain (such as a model that follows shortly) in frequency domain (such as our heuristical proposal) hybrid methods any combinations with, e.g., HMMs none of them works really well on singing data none of them works on polyphonic data
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt = cos [2πtf0 + φ ] + ǫt
f0 = fundamental frequency, the parameter of interest ǫt = error t ∈
S , 1 S , . . . , T−1 S
- time, no. of observations T
φ = phase displacement
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
cos [2πtf0(h ) + φh] + ǫt
f0 = fundamental frequency, the parameter of interest ǫt = error t ∈
S , 1 S , . . . , T−1 S
- time, no. of observations T
φh = phase displacement of h-th partial H = no. of partials in the model
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
Bh cos [2πtf0(h ) + φh] + ǫt
f0 = fundamental frequency, the parameter of interest ǫt = error t ∈
S , 1 S , . . . , T−1 S
- time, no. of observations T
φh = phase displacement of h-th partial H = no. of partials in the model Bh = amplitude of h-th partial
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
Bh cos [2πtf0(h + δh) + φh] + ǫt
f0 = fundamental frequency, the parameter of interest ǫt = error t ∈
S , 1 S , . . . , T−1 S
- time, no. of observations T
φh = phase displacement of h-th partial H = no. of partials in the model Bh = amplitude of h-th partial δh = frequency displacement of h-th partial where δ1 := 0
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
I
- i=0
Φi(t)Bh,i cos [2πtf0(h + δh) + φh ] + ǫt
Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆
2∆
- 1[(i−1)∆,(i+1)∆](t) i-th basis function
defined on windows with 50% overlap, ∆ := T−1
I
, 1 indicator function, S sampling rate
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
I
- i=0
Φi(t)Bh,i cos [2πtf0(h + δh) + φh +(h + δh)Av sin(2πfvt + φv)] + ǫt
Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆
2∆
- 1[(i−1)∆,(i+1)∆](t) i-th basis function
defined on windows with 50% overlap, ∆ := T−1
I
, 1 indicator function, S sampling rate fv = frequency of vibrato Av = amplitude of vibrato φv = phase displacement of vibrato
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 6
Pitch estimation model (monophonic)
yt =
H
- h=1
I
- i=0
Φi(t)Bh,i cos [2πtf0(h + δh) + φh +(h + δh)Av sin(2πfvt + φv)] + ǫt
Bh,i = amplitude of h-th partial for i-th basis function i = index of I + 1 basis functions Φi(t) := cos2 π tS−i∆
2∆
- 1[(i−1)∆,(i+1)∆](t) i-th basis function
defined on windows with 50% overlap, ∆ := T−1
I
, 1 indicator function, S sampling rate fv = frequency of vibrato Av = amplitude of vibrato φv = phase displacement of vibrato 5 + 3H parameters to estimate, but H might be > 10
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 7
Pitch estimation model (POLYphonic)
yt =
J
- j=1
H
- h=1
I
- i=0
Φi,j(t)Bh,i,j cos [2πtf0,j(hj + δh,j) + φh,j +(hj + δh,j)Av,j sin(2πfv,jt + φv,j)] + ǫt Joint work in progress (?) with Katrin Sommer, Claus Weihs; cooperation with Technical University of Tampere. J number of polyphonic tones Identifiability ?!
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 8
The Timbre Problem
Timbre Classification
Joint work with Sebastian Krey Specific task: Classification of instruments based on a given audio track of one tone Data: McGill Instrument Database, 38 instruments played in 59 ways (e.g. bowed vs. pizz.), each with 6-88 differently pitched tones, altogether 1976 wave files (44100 Hertz, 16 bit, 3-5 seconds each)
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 9
Let’s start
Pre-emphasis filtering to increase higher partials: yt = xt − 0.97xt−1 Short Time Fourier Transformation (on overlapping windows): F(t, k) =
N−M
- j=1−M
w(j − t)xj exp
- −2iπj k
N
- Hamming windows (width: 25ms, overlap: 10ms):
w(t) =
- 0.54 − 0.46 cos
2πt
T
- ,
−T
2 ≤ t ≤ T 2
- therwise
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 10
Let’s start
Mel scale: Transformation of FFT frequencies to Mel scale in order to model the emotional sense of the human ear (better resolution of human ear above 1 kHz, for example): Mel(hz) = 2595 log10
- 1 + hz
700
- Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis
Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 11
Feature Extraction
Using features pretty well known from speech recognition, e.g.: (Perceptive) Linear Predictive Coding (LPC/PLP)
Filter even more in order to get a somehow uniform loudness impression on the whole frequency range (PLP) Loudness compression by looking at cubic roots of amplitudes (PLP) Transformation back to time domain by inverse Fourier transformation Fit an autoregressive model (by Levinson Durbin recursion): yt =
p
- j=1
ajyt−j + et
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 12
Feature Extraction
Mel Frequency Cepstral Coefficients (MFCC)
Logarithm of loudness compression Discrete cosine transformation (DCT) Considering first p DCT coefficients
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 13
Clustering
Many features times many time frames (windows) of the signal result in too many features. Different tones have different lengths, i.e. different numbers of windows are used There might be silence (or noise such as breathing) at the start / end of a tone Hence clustering the found (vectors of) coefficients of all windows using Kmeans Number of clusters: 3-4, motivated by different phases of a tone: attack, (sustain), decay, silence/noise. Use cluster centroids as the (only) features for the classification task.
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 14
Hierarchical Clustering
223 224 227 230 221 222 229 226 225 228 212 215 213 214 218 220 217 216 219 246 243 242 244 245 232 234 231 233 239 240 241 235 236 237 238 264 265 266 253 254 259 260 261 263 262 257 256 258 249 247 248 250 251 252 255 274 275 276 277 278 296 297 294 295 298 299 269 270 267 268 273 271 272 279 282 281 283 280 284 285 286 289 290 292 293 291 287 288 208 211 209 210 199 202 201 204 206 207 200 203 205 197 198 182 180 181 183 185 195 196 193 194 190 191 188 189 192 187 184 186 165 168 169 166 167 174 170 171 172 173 179 175 178 176 177 147 148 149 157 155 156 153 158 154 151 150 152 159 160 161 164 162 163 140 139 141 143 144 145 146 138 142 137 133 136 132 134 135 128 130 129 131 4 5 6 7 8 9 10 11 14 12 13 17 15 16 19 21 18 20 99 98 100 97 93 94 95 96 89 90 91 88 92 115 118 116 114 117 126 127 119 120 123 121 122 124 125 110 111 112 113 108 109 106 107 104 105 101 102 103 3 48 49 50 53 46 45 47 43 44 64 63 65 66 60 61 6259 57 58 51 54 52 55 56 86 87 84 85 82 83 81 79 80 68 67 69 74 70 71 76 77 78 75 72 73 22 25 23 24 26 27 28 29 30 33 34 31 32 36 37 40 38 39 35 41 42 1 2 567 557 558 575 576 577 578 555 556 553 549 550 554 551 552 563 564 572 565 566 569 570 571 559 560 561 573 562 568 574 530 528 529 517 521 526 527 523 520 531 524 522 525 541 542 543 544 547 548 545 546 535 536 532 533 534 537 538 539 540 484 485 486 487 488 482 483 496 506 494 502 500 501 515 516 510 512 511 509 513 514 507 508 518 519 491 492 499 503 504 505 498 495 497493 490 480 478 477 479 481 489 475 476 472 473 474 470 471 469 467 468 465 466 458 460 457 459 464 463 461 462 452 442 443 426 432 427 436 434 435 433 437 428 429 430 431 438 441 444 439 440 445 446 447 448 454 451 449 450 453 455 456 384 381 380 382 378 379 389 386 387 383 390 391 385 388376 377 371 370 372 373 374 375 395 393 394 407 406 408 392 400 401 397 398 405 399 403 402 396 404 425 421 419 420 422 423 424 411 414 418 412 409 410 416 417 413 415 369 360 356 357 358 359 362 361 364 365 363 368 366 367342 340 336 338 343 344 337 339 341 345 346 347 348 352 351 350 353 354 349 355 327 335 333 330 322 329 319 320 326 328 321 325 318 323 324 332 331 334 316 317 312 315 313 314300 305 303 306 301 302 309 310 311 304 307 308 10 20 30 40
Cluster Dendrogram Piano MFCCs
Height
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 15
Kmeans Clustering
- 100
200 300 400 500 600
Clusterassignment of DCT frames
DCT frame number Noise Decay Sustain Attack
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 16
Classification
Support Vector Machines, tried kernels: linear K(xi, xj) = x′
i xj
polynomial K(xi, xj) = (γx′
i xj + r)d, γ > 0
rbf K(xi, xj) = exp(−γxi − xj2) sigmoid K(xi, xj) = tanh(γx′
i xj + r) (extremely bad)
Linear Discriminant Analysis Random Forests some more not reported
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 17
Software
Port of functions from Matlab package rastamat to R Some more speech processing functions implemented to be published in package tuneR SVM implementation from R package kernlab and (the not yet published) classifieR for optimization and validation of classification results
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 18
Results
all results based on a doubled 5-fold crossvalidation (inner loop for parameter optimization, outer loop for assessment) 59 classes LPC coefficients: All misclassification rates > 85%. PLP coefficients: classif. parameter error std error SVM-Poly γ = 1.4, d = 3 0.33 0.03 SVM-RBF γ = 1.4, σ = 0.023 0.44 0.03 SVM-Lin γ = 1.5 0.51 0.03 RandFor U = 1500, V = 3 0.32 0.03 LDA 0.55 0.02
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 19
Results
MFCC + PLP classif. parameter error std error SVM-Poly γ = 1.4, d = 2 0.18 0.02 SVM-RBF γ = 1.5, σ = 0.007 0.23 0.02 SVM-Lin γ = 0.6 0.18 0.03 RandFor U = 1000, V = 6 0.22 0.03 LDA 0.28 0.02
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 20
Summary
We are working on 59 classes, i.e. guessing implies misclassification error of 0.98 Best misclassification rate: 0.18 (comparable to what trained humans can archive) It turns out that the choice and construction of appropriate variables is (as in so many other classification tasks) much more important than the particular classification method that is finally used.
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 21
References I
von Ameln, F. (2001): Blind source separation in der Praxis. Diploma Thesis, Fachbereich Statistik, Universit¨ at Dortmund, Germany. Cano, P., Loscos, A., Bonada, J. (1999): Score-Performance Matching using
- HMMs. In: Proceedings of the International Computer Music Conference.
Beijing, China. Cemgil, T., Desain, P., Kappen, B. (2000): Rhythm Quantization for
- Transcription. Computer Music Journal 24 (2), 60–76.
Ellis, D. P. W. (2005), PLP and RASTA (and MFCC, and inversion) in Matlab, URL http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/ Garczarek, U., Weihs, C., Ligges, U. (2003): Prediction of Notes from Vocal Time Series. Technical Report 1/2003, SFB475, Department of Statistics, University of Dortmund. http://www.sfb475.uni-dortmund.de. Hastie, T. & Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer, New York. Hsu, C.-W. & Chang, C.-C. & Lin, C.-J. (2008) A Practical Guide to Support Vector Classification National Taiwan University, Taipei, URL http://www.csie.ntu.edu.tw/ cjlin
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 22
References II
Karatzoglou, A. & Smola, A. & Hornik, K. & Zeileis, A. (2004) kernlab – An S4 Package for Kernel Methods in R Journal of Statistical Software Vol. 11, No. 9, pages 1–20 URL http://www.jstatsoft.org/v11/i09/ Kleber, B. (2002): Evaluation von Stimmqualit¨ at in westlichem, klassischen
- Gesang. Diploma Thesis, Fachbereich Psychologie, Universit¨
at Konstanz. Ligges, U. (2006): Transkription monophoner Gesangszeitreihen. Dissertation, Fachbereich Statistik, Universit¨ at Dortmund, http://hdl.handle.net/2003/22521. Ligges, U., Weihs, C., Hasse-Becker, P. (2002): Detection of Locally Stationary Segments in Time Series. In: W. H¨ ardle And B. R¨
- nz (Eds.): CompStat2002 –
Proceedings in Computational Statistics – 15th Symposium held in Berlin,
- Germany. Physika Verlag, Heidelberg, 285–290.
Nienhuys, H.-W., Nieuwenhuizen, J., et al. (2004): GNU LilyPond – The Music
- Typesetter. Free Software Foundation, http://www.lilypond.org, Version
2.0.3. Opolko, F. & Wapnick, J. (1987) McGill University master samples (CDs) R Development Core Team (2009): R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, ISBN 3-900051-00-3, http://www.R-project.org.
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009
Introduction Model Building Timbre Features/Classification Results Summary 23
References III
Reuter, C. (2002): Klangfarbe und Instrumentation – Geschichte – Ursachen –
- Wirkung. Peter Lang, Frankfurt/M.
Roever, C. (2003), Musikinstrumentenerkennung mit Hilfe der Hough-Transformation, URL http://www.aei.mpg.de/ chroev/publications/RoeverDiplom.pdf Rosell, M. (2006) An Introduction to Front-End Processing and Acoustic Features for Automatic Speech Recognition URL www.nada.kth.se/ rosell/courses/rosell acoustic features.pdf Weihs, C., Berghoff, S., Hasse-Becker, P., Ligges, U. (2001): Assessment of Purity of Intonation in Singing Presentations by Discriminant Analysis. In: J. Kunert And G. Trenkler (Eds.): Mathematical Statistics and Biometrical
- Applications. Josef Eul, Bergisch-Gladbach, K¨
- ln, 395–410.
Weihs, C., Ligges, U. (2003): Automatic Transcription of Singing Performances. Bulletin of the International Statistical Institute, 54th Session, Proceedings, Volume LX, Book 2, 507–510. Weihs, C., Ligges, U., G¨ uttner, J., Hasse-Becker, P., Berghoff, S. (2003): Classification and Clustering of Vocal Performances. In: M. Schader, W. Gaul and M. Vichi (Eds.): Between Data Science and Applied Data Analysis. Springer, Berlin, 118–127.
Uwe Ligges and Sebastian Krey: SVM based Classification of Instruments - Timbre Analysis Reisensburg, Statistical Computing 2009