[PPT] - Preeti Rao 2 nd CompMusic Workshop, Istanbul 2012 o Music signal PowerPoint Presentation

SLIDE 1

Preeti Rao 2nd CompMusic Workshop, Istanbul 2012

SLIDE 2

Music signal characteristics
Perceptual attributes and acoustic properties
Signal representations for pitch detection
STFT
Sinusoidal model
Pitch detection algorithms
Polyphonic context and predominant pitch tracking
Applications in MIR

2

SLIDE 3

WiSSAP 2007

*The Physics Classroom:http://www.glenbrook.k12.il.us/gbssci/ phys/Class/sound/u11l2a.html

Digital audio format: PCM

Sampling rate: 44.1 kHz, 22.05 kHz
Amplitude resolution: 16 bits/sample

SLIDE 4

Department of Electrical Engineering , IIT Bombay

Interesting sounds are typically coded in the form of a temporal sequence of “atomic sound events”. E.g. speech -> a sequence of phones music -> an evolving pattern of notes An atomic sound event, or a single gestalt, can be a complex acoustical signal described by a set of temporal and spectral properties => an evoked sensation.

SLIDE 5

Department of Electrical Engineering , IIT Bombay

A sound of given frequency components and sound pressure levels leads to perceived sensations that can be distinguished in terms of:

loudness

<-- intensity

pitch

<-- fundamental frequency

timbre (“quality” or “colour”)

<--ther spectro-temporal properties

SLIDE 6

Department of Electrical Engineering , IIT Bombay

T0 =

3.3 msec

T0 = 10 msec low pitch tone high pitch tone

Frequency = 100 Hz Frequency = 300 Hz

1 Hertz = 1 vibration/sec

SLIDE 7

Department of Electrical Engineering , IIT Bombay

Musical pitch scale

low pitch high pitch

semitone = 21/12

SLIDE 8

Department of Electrical Engineering , IIT Bombay

The construction of a musical scale is based on two

assumptions about the human hearing process:

The ear is sensitive to ratios of fundamental frequencies (pitches),

not so much to absolute pitch.

The preferred “musical intervals”, i.e. those perceived to be most

consonant, are the ratios of small whole numbers.

A musical sound is typically comprised of several frequencies.

The frequencies are evident if we observe the “spectrum” of the sound

SLIDE 9

Department of Electrical Engineering , IIT Bombay

300 Hz 600 Hz 900 Hz 300 Hz + 600Hz 300 Hz + 600Hz + 900Hz

SLIDE 10

50

0.6

0.7 500 0.8

( )

t x1 ) (ms t ) (Hz f

( )

f X1

Sound “atoms” : Single tone signal

SLIDE 11

500 0.2

0.5

0.7 50

( )

t x2 ) (ms t ) (Hz f

( )

f X 2

Non-tonal Signal

SLIDE 12

500 0.2 1000

0.4

0.5 50

( )

t x3 ) (ms t ) (Hz f

( )

f X 3

Complex tone signal

SLIDE 13

250 800 1

0.3

0.3 50

( )

t x4 ) (ms t ) (Hz f

( )

f X 4

Bandpass noise signal

SLIDE 14

( )dB

f X1 ) (kHz f

20
70

5

( )

t x1

50

0.5

0.5

) (ms t

A flute note

SLIDE 15

We see that the distinctive signal characteristics are

more evident in the frequency domain.

The ear is a frequency analyzer. It represents a unique

combination of analysis and synthesis => we do not perceive spectral components but rather the composite sounds.

We observe that a single “note” is perceived as one

entity of well-defined subjective sensations. This is due to the spatial pattern recognition process achieved by the central auditory system.

15

SLIDE 16

Major dimensions of music for retrieval are melody, rhythm, harmony and timbre.

Melody, harmony -> based on pitch content
Rhythm -> based on timing information
Timbre -> relates to instrumentation, texture

A representation of these high-level attributes can be

btained from pitch, timing and spectro-temporal

information extracted by audio signal analysis. Representations are then compared via a similarity measure to achieve retrieval.

16

SLIDE 17

The temporal pattern of frame-level features can offer

important cues to signal identity

17

Feature Extraction Texture windows Analysis windows Frame-level features Feature summary Feature vector Audio signal <= duration: 50 – 100 ms <= duration: 0.5 – 1.0 s

M. F. Martin and J. Breebaart, "Features

for Audio and Music Classification," in Proc.ISMIR, 2003.

SLIDE 18

frequency/note time

Melody: pitch related feature

Melody is the temporal sequence of notes usually played by a single instrument (fixed timbre). The discrete notes (pitches) are typically selected from a musical scale.

SLIDE 19

19

Typical implementation:
Pitch detection is carried out on the audio signal at

uniformly spaced intervals

The pitch sequence is segmented into notes (regions of

relatively steady pitch)

Notes are labeled
Note patterns are matched to determine melodic

similarity

Challenges:
Note segmentation can be a difficult task
Pitch detection in polyphonic music is tough

SLIDE 20

Department of Electrical Engineering , IIT Bombay

Spectrum Waveform

“Schroeder histogram” PDA

Monophonic Signal: cues to perceived pitch

A. de Cheveigne. Multiple F0
estimation. In D.-L. Wang and

G.J. Brown, editors, Computational Auditory Scene Analysis : Principles, Algorithms and Applications, IEEE Press / Wiley, 2006.

SLIDE 21

Time (Lag) domain: maximise autocorrelation

value

Frequency domain: minimise error between

estimated and predicted harmonic structures

Other

21

SLIDE 22

22

SLIDE 23

Department of Electrical Engineering , IIT Bombay

Music and speech signals are typically time-varying in nature => a time-frequency representation is required to visualize signal characteristics. The short-time Fourier transform (STFT) affords such a representation based on an assumption of signal quasi-

stationarity. The window shape dictates the time and frequency

resolution trade-off.

∑ ∑ ∑ ∑

∞ ∞ ∞ ∞ −∞ −∞ −∞ −∞ = = = = − − − −

− − − − = = = =

m m j S

e m n w m x n X

ω ω ω ω

ω ω ω ω ) ( ) ( ) , (

SLIDE 24

ω

ω ( , ) X n

π

w(n-m) x(m) x(m)w(n-m) DFT

SLIDE 25

SLIDE 26

=

Φ +

∑

[ ] 1

ˆ[ ]= [ ]cos [ ] [ ]

I t i i i

x t a t t e t

[ ]

i

a t

i

Φ [ ] t [ ] I t

amplitude variation of ith sinusoidal component (“partial”)
total phase (represents both frequency and phase variation)
Number of partials, can vary with time

ω Φ = + ϕ [ ] [ ] [ ]

i i i

t t t t

ω ϕ { , , }

i i i l

a

Model parameters to be estimated:

SLIDE 27

DFT

Peak detection Peak tracking Additive synthesis

Window Sinusoid parameters Residual Audio signal

Tonal component

x _ +

ω ϕ { , , }

i i i l

a

For the smooth evolution of the signal, sine components are detected in each frame and linked to tracks from the previous frame based on frequency proximity.

Σ

SLIDE 28

500 1000 1500 2000 2500 3000

50
40
30
20
10

10 20 30 40 50

Frequency (Hz) M agnitude (dB ) Spectral magnitude Fixed threshold (MaxPeak - 40 dB) Final peaks picked

500 1000 1500 2000 2500 3000

50
40
30
20
10

10 20 30 40 50

M a g n itu d e (d B ) Frequency (Hz) Spectral magnitude Envelope - 20 dB Envelope - 25 dB Envelope - 30 dB

SLIDE 29

Department of Electrical Engineering , IIT Bombay

Match spectrum around peak with that of ideal sinusoid. Apply threshold to the error.

SLIDE 30

track born track dies sine peak Frequency Time D C B A 0 1 2 3 4

Peak tracking

SLIDE 31

Time (sec) Frequency (Hz)

5 10 15 20 500 1000 1500 2000

Ghe Na Tun

Tabla (percussion) Tanpura (drone) Singer (main melody) Harmonium (secondary melody)

SLIDE 32

Department of Electrical Engineering , IIT Bombay

Input : magnitudes + locations of

sinusoids

For a range of trial fundamentals,

generate predicted harmonics

Minimise TWM error w.r.t. trial

fundamentals

p m m p total

Err Err Err N K

→ →

= + ρ

200 100 300 400 500 600 700 800 100 200 375 420 700 800 Nearest Neighbour Matching Predicted Components Measured Components a b

SLIDE 33

Department of Electrical Engineering , IIT Bombay

SLIDE 34

Department of Electrical Engineering , IIT Bombay

j p E(p,j) E(p',j+1) W(p,p')

p → Pitch candidates, j → Frame (time instant) E → Measurement cost (local), W → Smoothness cost

Minimize the Global transition cost over the singing spurt

SLIDE 35

Department of Electrical Engineering , IIT Bombay

SLIDE 36

Signal representation Multi-F0 analysis Predominant-F0 trajectory extraction Singing voice detection Polyphonic audio signal Voice F0 contour

SLIDE 37

37

SLIDE 38

38

“Pitch class profile”

Pitch histogram
Similarity measure involves match

between histograms

SLIDE 39

SLIDE 40

Positive Positive Positive Positive phrases phrases phrases phrases Negative Negative Negative Negative phrase phrase phrase phrase

SLIDE 41

SLIDE 42

Positive phrases Negative phrase Detects phrases melodically similar to ‘Guru Bina’ pitch contour Emphatic beat sam Swaras: S S N R

SLIDE 43

43

SLIDE 44

Signal representation Multi-F0 analysis Predominant-F0 trajectory extraction Singing voice detection Polyphonic audio signal Voice F0 contour

SLIDE 45

Department of Electrical Engineering , IIT Bombay

Input : magnitudes + locations of

sinusoids

For a range of trial fundamentals,

generate predicted harmonics

Minimise TWM error w.r.t. trial

fundamentals

p m m p total

Err Err Err N K

→ →

= + ρ

200 100 300 400 500 600 700 800 100 200 375 420 700 800 Nearest Neighbour Matching Predicted Components Measured Components a b

SLIDE 46

Department of Electrical Engineering , IIT Bombay

Predicted to measured error
Significant term : Δf / (f)p
Δf = frequency mismatch error
f = partial frequency
Measured to predicted error

N p p n p m n n n n n 1 max

a Err f (f ) ( ) [q f (f ) r] A

− − → =

= ∆ ⋅ + × ∆ ⋅ −

∑

K p p k m p k k k k n 1 max

a Err f (f ) ( ) [q f (f ) r] A

− − → =

= ∆ ⋅ + × ∆ ⋅ −

∑

SLIDE 47

Melody detection system [1]

SLIDE 48

Department of Electrical Engineering , IIT Bombay

F0 search range (male/female)
p, q, r
ρ (male/female)
Window length (pitch range and rate of variation)
Smoothness cost parameter (rate of pitch variation)
Voicing threshold

SLIDE 49

Department of Electrical Engineering , IIT Bombay

Window length is an analysis parameter that

influences the accuracy of sinusoidal modeling of the signal

Closely-spaced components in the polyphony =>

need for higher frequency resolution = longer windows

Pitch variation with time can be rapid in
rnamented regions => need for better time

resolution = shorter windows

SLIDE 50

Easily computable measures for adapting window length
Signal sparsity : a sparse spectrum is more “concentrated” =>

better represented sinusoidal components

Window length selection (20, 30, 40 ms) based on maximizing

signal sparsity

SLIDE 51

1. V. Rao and P. Rao, “Vocal melody extraction in the presence of

pitched accompaniment in polyphonic music,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 2145–2154, Nov. 2010.

2. V. Rao, P. Gaddipati and P. Rao, “Signal-driven window

adaptation for sinusoid identification in polyphonic music,” IEEE Transactions on Audio, Speech, and Language Processing,

Jan. 2012.

51