[PPT] - Lecture 14: LPC speech synthesis and autocorrelation- based pitch PowerPoint Presentation

SLIDE 1

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking

ECE 417, Multimedia Signal Processing October 10, 2019

SLIDE 2

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 3

The LPC-10 speech synthesis model

SLIDE 4

The LPC-10 Speech Coder: Transmitted Parameters

Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second

Pitch: 7 bits/frame (127 distinguishable non-zero pitch periods)
Energy: 5 bits/frame (32 levels, on a logRMS scale)
10 linear predictive coefficients (LPC): 41 bits/frame
Synchronization: 1 bit/frame

SLIDE 5

The LPC-10 speech synthesis model

𝐼(𝑓$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

𝑡[𝑜]

𝑓 𝑜 = ,

./0

𝜀 𝑜 − 𝑞𝑄 𝑓 𝑜 ~𝒪 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

𝐻

Gain= 𝑓;<=>?@

SLIDE 6

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 7

Autocorrelation is maximum at n=0

𝑠

BB 𝑜 =

,

C./0

𝑦 𝑛 𝑦[𝑛 − 𝑜]

SLIDE 8

Autocorrelation is maximum at n=0

𝑠

BB 𝑜 =

,

C./0

𝑦 𝑛 𝑦[𝑛 − 𝑜] = 𝑦 𝑜 ∗ 𝑦 −𝑜 = ℱ/H 𝑌 𝜕

K

= 1 2𝜌 N

/O O

𝑌 𝜕

K 𝑓$%P𝑒𝜕

Notice that, for n=0, this becomes just Parseval’s theorem: 𝑠

BB 0 =

,

C./0

𝑦K 𝑛 = 1 2𝜌 N

/O O

𝑌 𝜕

K 𝑒𝜕

But since 𝑌 𝜕

K is positive and real, any value of 𝑓$%P that is NOT positive and

real will reduce the value of the integral! 𝑠

BB 𝑜 = 1

2𝜌 N

/O O

𝑌 𝜕

K 𝑓$%P𝑒𝜕 ≤ 1

2𝜌 N

/O O

𝑌 𝜕

K 𝑒𝜕 = 𝑠 BB 0

SLIDE 9

Example of an autocorrelation function computed from file0.wav, “Four score and seven years ago…”

SLIDE 10

Autocorrelation of a periodic signal

Suppose x[n] is periodic, 𝑦[𝑜] = 𝑦[𝑜 − 𝑄]. Then the autocorrelation is also periodic: 𝑠

BB 𝑄 =

,

C./0

𝑦 𝑛 𝑦[𝑛 − 𝑄] = ,

C./0

𝑦K 𝑛 = 𝑠

BB 0

SLIDE 11

Autocorrelation of a periodic signal is periodic

Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples

SLIDE 12

Autocorrelation pitch tracking

Compute the autocorrelation
Find the pitch period:

𝑄 = argmax

XYZ[\C\XY]^

𝑠

BB[𝑛]

The search limits, 𝑄

?_` and 𝑄 ?ab, are

important for good performance:

𝑄

?_` corresponds to a high woman’s pitch,

about 𝐺

@/𝑄 ?_` ≈ 250 Hz

𝑄

?ab corresponds to a low man’s pitch,

about 𝐺

@/𝑄 ?ab ≈ 80 Hz

𝑄?_` 𝑄?ab

SLIDE 13

The LPC-10 speech synthesis model

𝐼(𝑓$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

𝑡[𝑜]

𝑓 𝑜 = ,

./0

𝜀 𝑜 − 𝑞𝑄 𝑓 𝑜 ~𝒪 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

𝐻

Gain= 𝑓;<=>?@

SLIDE 14

The voiced/unvoiced decision

𝑦[𝑜] voiced: 𝑠

BB 𝑄 ≈ 𝑠 BB 0

𝑦[𝑜] unvoiced (white noise): 𝑠

BB 𝑜 ≈ 𝜀[𝑜],

which means that 𝑠

BB 𝑄 ≪ 𝑠 BB 0

So a reasonable V/UV decision is:

ijj X

ijj k ≥ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒: say the frame is voiced.

ijj X

ijj k < 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒: say the frame is

unvoiced. Setting threshold~0.25 works reasonably well.

voiced: 𝑦 𝑜 + 𝑄 ≈ 𝑦 𝑜 unvoiced: E[𝑦 𝑛 𝑦 𝑛 − 𝑜 ] ≈ 𝜀[𝑜]

SLIDE 15

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 16

Inter-frame interpolation of pitch contours

We don’t want the pitch period to change suddenly at frame boundaries; it sounds weird.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

SLIDE 17

Inter-frame interpolation of pitch contours

Linear interpolation sounds much

better. We can accomplish linear

interpolation using a formula like 𝑄 𝑜 = (1 − 𝑔)𝑄

u + 𝑔𝑄 uvH

Where

𝑄

u is the pitch period in frame t

𝑔 =

P/u@ @

is how far sample n is from the beginning of frame t

S is the frame-skip.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

SLIDE 18

Inter-frame interpolation of energy

Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. log 𝑆𝑁𝑇u = log 1 𝑀 ,

P.u@ u@v}/H

𝑦K[𝑜]

SLIDE 19

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 20

The LPC-10 speech synthesis model

𝐼(𝑓$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

𝑡[𝑜]

𝑓 𝑜 = ,

./0

𝜀 𝑜 − 𝑞𝑄 𝑓 𝑜 ~𝒪 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced vs. Unvoiced

G

𝐻

Gain= 𝑓;<=>?@

SLIDE 21

Unvoiced speech: e[n]=white noise

Use zero-mean, unit-variance

Gaussian white noise

The choice, to use “unvoiced

speech,” is communicated by the special code word “P=0”

By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756

SLIDE 22

Voiced speech: e[n]=pulse train

The basic idea:

𝑓 𝑜 = ,

./0

𝜀 𝑜 − 𝑞𝑄

Modification #1: in order for the

RMS to equal 1.0, we need to scale each pulse by 𝑄: 𝑓 𝑜 = 𝑄 ,

./0

𝜀 𝑜 − 𝑞𝑄

SLIDE 23

Modification #2: the first pulse is not at n=0

Pitch period = 80 samples ⇒ first pulse in frame 31 can’t occur until the 70th sample of the frame

30

SLIDE 24

A mechanism for keeping track of pitch phase from one frame to the next

Start out, at the beginning of the speech, with a pitch phase equal to zero,

𝜒 0 = 0

For every sample thereafter:
If the sample is unvoiced (P[n]=0), don’t increment the pitch phase
If the sample is voiced (P[n]>0), then increment the pitch phase

𝜒 𝑜 = 𝜒 𝑜 − 1 + 2𝜌 𝑄[𝑜]

Every time the phase passes a multiple of 2𝜌, output a pitch pulse

𝑓 𝑜 = € 𝑄 𝜒 𝑜 2𝜌 − 𝜒 𝑜 − 1 2𝜌 > 0 𝑓𝑚𝑡𝑓

SLIDE 25

The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2𝜌-level

30

Sample Number, n Phase 𝜒 𝑜 𝜒 𝑜 2𝜌 4𝜌 6𝜌 8𝜌 𝑓 𝑜

SLIDE 26

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 27

Speech is predictable

Speech is not just white noise and

pulse train. In fact, each sample is highly predictable from previous samples. 𝑦[𝑜] ≈ ,

C.H Hk

𝛽C𝑦[𝑜 − 𝑛]

In fact, the pitch pulses are the
nly major exception to this

predictability!

SLIDE 28

Linear predictive coding (LPC)

The LPC idea:

1. Model the excitation as error

𝑓 𝑜 = 𝑦 𝑜 − ,

C.H Hk

𝛽C𝑦[𝑜 − 𝑛]

2. Force the coefficients 𝛽C to

explain as much as they can, so that 𝑓 𝑜 is as close to zero as possible.

𝑓 𝑜 𝑦 𝑜

SLIDE 29

Linear predictive coding (LPC)

𝜁 = 𝐹 𝑓K[𝑜] = 𝐹 𝑦 𝑜 − ,

‡.H Hk

𝛽‡𝑦[𝑜 − 𝑗]

K

𝜖𝜁 𝜖𝛽$ = −2𝐹 𝑦 𝑜 − 𝑘 𝑦 𝑜 − ,

‡.H Hk

𝛽‡𝑦 𝑜 − 𝑗 Setting

‹Œ ‹•Ž = 0 gives

𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜] = ,

‡.H Hk

𝛽‡𝐹 𝑦 𝑜 − 𝑘 𝑦[𝑜 − 𝑗]

𝑆BB 𝑘 𝑆BB |𝑗 − 𝑘|

SLIDE 30

Linear predictive coding (LPC)

So we have a set of linked equations, for 1 ≤ 𝑘 ≤ 10: 𝑆BB 𝑘 = ,

‡.H Hk

𝛽‡𝑆BB |𝑗 − 𝑘|

We can write these 10 equations as a 10x10 matrix equation: ⃗

𝛿 = 𝑆 ⃗ 𝛽

…which immediately gives the solution: ⃗

𝛽 = 𝑆/H ⃗ 𝛿

…where

⃗ 𝛿 = 𝑆BB 1 ⋮ 𝑆BB 10 , 𝑆 = 𝑆BB 0 𝑆BB 1 ⋯ 𝑆BB 1 𝑆BB 0 ⋯ ⋮ ⋮ 𝑆BB 0 , ⃗ 𝛽 = 𝛽H ⋮ 𝛽Hk

SLIDE 31

Outline

The LPC-10 speech synthesis model
Autocorrelation-based pitch tracking
Inter-frame interpolation of pitch and energy contours
The LPC-10 excitation model: white noise, pulse train
Linear predictive coding: how to find the coefficients
Linear predictive coding: how to make sure the coefficients are stable

SLIDE 32

Speech -> Excitation -> Speech

Now that we know how to find the LPC coefficients, we can imagine an end-to-end LPC analysis-by-synthesis:

LPC synthesis 𝑡[𝑜] 𝑓[𝑜] Model excitation using pulse train and white noise LPC analysis 𝑓[𝑜] 𝑦[𝑜]

𝑓 𝑜 = 𝑦 𝑜 − ,

C.H Hk

𝛽C𝑦[𝑜 − 𝑛] 𝑡 𝑜 = 𝑓 𝑜 + ,

C.H Hk

𝛽C𝑡[𝑜 − 𝑛]

SLIDE 33

The LPC Analysis Filter

The LPC Analysis Filter is an all-zeros filter (FIR = finite impulse response): 𝑓 𝑜 = 𝑦 𝑜 − ,

C.H Hk

𝛽C𝑦 𝑜 − 𝑛 ↔ 𝐹 𝑨 = 𝐵 𝑨 𝑌(𝑨) …where… 𝐵 𝑨 = 1 − ,

C.H Hk

𝛽C𝑨/C It’s often useful to write this as a polynomial 𝐵 𝑨 = 𝑏k + 𝑏H𝑨/H + ⋯ where 𝑏$ = ˜ 1 𝑘 = 0 −𝛽$ 1 ≤ 𝑘 ≤ 10

SLIDE 34

The LPC Synthesis Filter

The LPC Synthesis Filter is an all-poles filter (IIR = infinite impulse response): 𝑡 𝑜 = 𝑓 𝑜 + ,

C.H Hk

𝛽C𝑡 𝑜 − 𝑛 ↔ 𝑇 𝑨 = 𝐼 𝑨 𝐹(𝑨) …where… 𝐼 𝑨 = 1 𝐵(𝑨) = 1 1 − ∑C.H

Hk

𝛽C𝑨/C

SLIDE 35

Speech -> Excitation -> Speech

1 𝐵(𝑨) 𝑡[𝑜] 𝑓[𝑜] Excitation Model 𝐵 𝑨 𝑓[𝑜] 𝑦[𝑜]

SLIDE 36

The Stability Problem

The analysis filter is guaranteed to be stable, as long as the coefficients are
finite. Suppose you know that |𝑦 𝑜 | ≤ 𝑌?ab, and |𝛽C| ≤ 𝛽?ab. Then,

even in the worst possible case, 𝑓 𝑜 ≤ 11𝛽?ab𝑌?ab.

The synthesis filter has no such guarantee. For example, suppose 𝑓 𝑜 is

just a delta function (𝑓 𝑜 = 𝜀 𝑜 ), and suppose all of the 𝛽C = 0 except the first one, 𝛽H = 1. 1. Then 𝑡 𝑜 = 𝜀 𝑜 + 1. 1𝑡[𝑜 − 1] = 1. 1 P Which overflows your 16-bit sample buffer after only 110 samples. Your

utput will be full of NaNs, and you’ll be saying “What happened…?”

SLIDE 37

How to Guarantee Stability

Fortunately, the LPC synthesis filter is causal, so it’s easy to guarantee stability. We just need to make sure that all of the poles have magnitude less than 1: |𝑠

›| < 1

We find the poles like this: 𝐼 𝑨 = 1 𝐵(𝑨) = 1 ∑C.k

Hk

𝑏C𝑨/C = 1 ∏›.H

Hk

1 − 𝑠

›𝑨/H

in other words, 𝑠

› = 𝑠𝑝𝑝𝑢𝑡(𝐵 𝑨 )

…which you can do using np.roots, if you define the polynomial correctly. Then you just truncate the magnitude, 𝑠

› ← min 𝑠 › , 0.999 𝑓$∡i¢

…and then use np.poly to convert back from roots to polynomial.