Lecture 14: LPC speech synthesis and autocorrelation- based pitch - - PowerPoint PPT Presentation

โ–ถ
lecture 14 lpc speech synthesis and autocorrelation based
SMART_READER_LITE
LIVE PREVIEW

Lecture 14: LPC speech synthesis and autocorrelation- based pitch - - PowerPoint PPT Presentation

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking ECE 417, Multimedia Signal Processing October 10, 2019 Outline The LPC-10 speech synthesis model Autocorrelation-based pitch tracking Inter-frame


slide-1
SLIDE 1

Lecture 14: LPC speech synthesis and autocorrelation- based pitch tracking

ECE 417, Multimedia Signal Processing October 10, 2019

slide-2
SLIDE 2

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-3
SLIDE 3

The LPC-10 speech synthesis model

slide-4
SLIDE 4

The LPC-10 Speech Coder: Transmitted Parameters

Each frame is 54 bits, and is used to synthesize 22.5ms of speech. (54 bits/frame)/(0.0225 seconds/frame)=2400 bits/second

  • Pitch: 7 bits/frame (127 distinguishable non-zero pitch periods)
  • Energy: 5 bits/frame (32 levels, on a logRMS scale)
  • 10 linear predictive coefficients (LPC): 41 bits/frame
  • Synchronization: 1 bit/frame
slide-5
SLIDE 5

The LPC-10 speech synthesis model

๐ผ(๐‘“$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = ,

  • ./0

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

๐ป

Gain= ๐‘“;<=>?@

slide-6
SLIDE 6

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-7
SLIDE 7

Autocorrelation is maximum at n=0

๐‘ 

BB ๐‘œ =

,

C./0

๐‘ฆ ๐‘› ๐‘ฆ[๐‘› โˆ’ ๐‘œ]

slide-8
SLIDE 8

Autocorrelation is maximum at n=0

๐‘ 

BB ๐‘œ =

,

C./0

๐‘ฆ ๐‘› ๐‘ฆ[๐‘› โˆ’ ๐‘œ] = ๐‘ฆ ๐‘œ โˆ— ๐‘ฆ โˆ’๐‘œ = โ„ฑ/H ๐‘Œ ๐œ•

K

= 1 2๐œŒ N

/O O

๐‘Œ ๐œ•

K ๐‘“$%P๐‘’๐œ•

Notice that, for n=0, this becomes just Parsevalโ€™s theorem: ๐‘ 

BB 0 =

,

C./0

๐‘ฆK ๐‘› = 1 2๐œŒ N

/O O

๐‘Œ ๐œ•

K ๐‘’๐œ•

But since ๐‘Œ ๐œ•

K is positive and real, any value of ๐‘“$%P that is NOT positive and

real will reduce the value of the integral! ๐‘ 

BB ๐‘œ = 1

2๐œŒ N

/O O

๐‘Œ ๐œ•

K ๐‘“$%P๐‘’๐œ• โ‰ค 1

2๐œŒ N

/O O

๐‘Œ ๐œ•

K ๐‘’๐œ• = ๐‘  BB 0

slide-9
SLIDE 9

Example of an autocorrelation function computed from file0.wav, โ€œFour score and seven years agoโ€ฆโ€

slide-10
SLIDE 10

Autocorrelation of a periodic signal

Suppose x[n] is periodic, ๐‘ฆ[๐‘œ] = ๐‘ฆ[๐‘œ โˆ’ ๐‘„]. Then the autocorrelation is also periodic: ๐‘ 

BB ๐‘„ =

,

C./0

๐‘ฆ ๐‘› ๐‘ฆ[๐‘› โˆ’ ๐‘„] = ,

C./0

๐‘ฆK ๐‘› = ๐‘ 

BB 0

slide-11
SLIDE 11

Autocorrelation of a periodic signal is periodic

Pitch period = 9ms = 99 samples Pitch period = 9ms = 99 samples

slide-12
SLIDE 12

Autocorrelation pitch tracking

  • Compute the autocorrelation
  • Find the pitch period:

๐‘„ = argmax

XYZ[\C\XY]^

๐‘ 

BB[๐‘›]

  • The search limits, ๐‘„

?_` and ๐‘„ ?ab, are

important for good performance:

  • ๐‘„

?_` corresponds to a high womanโ€™s pitch,

about ๐บ

@/๐‘„ ?_` โ‰ˆ 250 Hz

  • ๐‘„

?ab corresponds to a low manโ€™s pitch,

about ๐บ

@/๐‘„ ?ab โ‰ˆ 80 Hz

๐‘„?_` ๐‘„?ab

slide-13
SLIDE 13

The LPC-10 speech synthesis model

๐ผ(๐‘“$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = ,

  • ./0

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced (P>0) vs. Unvoiced (P=0)

G

๐ป

Gain= ๐‘“;<=>?@

slide-14
SLIDE 14

The voiced/unvoiced decision

  • ๐‘ฆ[๐‘œ] voiced: ๐‘ 

BB ๐‘„ โ‰ˆ ๐‘  BB 0

  • ๐‘ฆ[๐‘œ] unvoiced (white noise): ๐‘ 

BB ๐‘œ โ‰ˆ ๐œ€[๐‘œ],

which means that ๐‘ 

BB ๐‘„ โ‰ช ๐‘  BB 0

So a reasonable V/UV decision is:

  • ijj X

ijj k โ‰ฅ ๐‘ขโ„Ž๐‘ ๐‘“๐‘กโ„Ž๐‘๐‘š๐‘’: say the frame is voiced.

  • ijj X

ijj k < ๐‘ขโ„Ž๐‘ ๐‘“๐‘กโ„Ž๐‘๐‘š๐‘’: say the frame is

unvoiced. Setting threshold~0.25 works reasonably well.

voiced: ๐‘ฆ ๐‘œ + ๐‘„ โ‰ˆ ๐‘ฆ ๐‘œ unvoiced: E[๐‘ฆ ๐‘› ๐‘ฆ ๐‘› โˆ’ ๐‘œ ] โ‰ˆ ๐œ€[๐‘œ]

slide-15
SLIDE 15

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-16
SLIDE 16

Inter-frame interpolation of pitch contours

We donโ€™t want the pitch period to change suddenly at frame boundaries; it sounds weird.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

slide-17
SLIDE 17

Inter-frame interpolation of pitch contours

Linear interpolation sounds much

  • better. We can accomplish linear

interpolation using a formula like ๐‘„ ๐‘œ = (1 โˆ’ ๐‘”)๐‘„

u + ๐‘”๐‘„ uvH

Where

  • ๐‘„

u is the pitch period in frame t

  • ๐‘” =

P/u@ @

is how far sample n is from the beginning of frame t

  • S is the frame-skip.

Pitch Period Sample Number, n Frame Boundary Frame Boundary Frame Boundary Frame Boundary

slide-18
SLIDE 18

Inter-frame interpolation of energy

Linear interpolation is also useful for energy, EXCEPT: it sounds better if we interpolate log energy, not energy. log ๐‘†๐‘๐‘‡u = log 1 ๐‘€ ,

P.u@ u@v}/H

๐‘ฆK[๐‘œ]

slide-19
SLIDE 19

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-20
SLIDE 20

The LPC-10 speech synthesis model

๐ผ(๐‘“$%)

Vocal Tract: Modeled by an LPC synthesis Filter.

๐‘ก[๐‘œ]

๐‘“ ๐‘œ = ,

  • ./0

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„ ๐‘“ ๐‘œ ~๐’ช 0,1 Voiced Speech, pitch period P Unvoiced Speech Binary Control Switch: Voiced vs. Unvoiced

G

๐ป

Gain= ๐‘“;<=>?@

slide-21
SLIDE 21

Unvoiced speech: e[n]=white noise

  • Use zero-mean, unit-variance

Gaussian white noise

  • The choice, to use โ€œunvoiced

speech,โ€ is communicated by the special code word โ€œP=0โ€

By Morn - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index. php?curid=24084756

slide-22
SLIDE 22

Voiced speech: e[n]=pulse train

  • The basic idea:

๐‘“ ๐‘œ = ,

  • ./0

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„

  • Modification #1: in order for the

RMS to equal 1.0, we need to scale each pulse by ๐‘„: ๐‘“ ๐‘œ = ๐‘„ ,

  • ./0

๐œ€ ๐‘œ โˆ’ ๐‘ž๐‘„

slide-23
SLIDE 23

Modification #2: the first pulse is not at n=0

Pitch period = 80 samples โ‡’ first pulse in frame 31 canโ€™t occur until the 70th sample of the frame

30

slide-24
SLIDE 24

A mechanism for keeping track of pitch phase from one frame to the next

  • Start out, at the beginning of the speech, with a pitch phase equal to zero,

๐œ’ 0 = 0

  • For every sample thereafter:
  • If the sample is unvoiced (P[n]=0), donโ€™t increment the pitch phase
  • If the sample is voiced (P[n]>0), then increment the pitch phase

๐œ’ ๐‘œ = ๐œ’ ๐‘œ โˆ’ 1 + 2๐œŒ ๐‘„[๐‘œ]

  • Every time the phase passes a multiple of 2๐œŒ, output a pitch pulse

๐‘“ ๐‘œ = โ‚ฌ ๐‘„ ๐œ’ ๐‘œ 2๐œŒ โˆ’ ๐œ’ ๐‘œ โˆ’ 1 2๐œŒ > 0 ๐‘“๐‘š๐‘ก๐‘“

slide-25
SLIDE 25

The pitch phase method: generate an excitation pulse whenever pitch phase crosses a 2๐œŒ-level

30

Sample Number, n Phase ๐œ’ ๐‘œ ๐œ’ ๐‘œ 2๐œŒ 4๐œŒ 6๐œŒ 8๐œŒ ๐‘“ ๐‘œ

slide-26
SLIDE 26

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-27
SLIDE 27

Speech is predictable

  • Speech is not just white noise and

pulse train. In fact, each sample is highly predictable from previous samples. ๐‘ฆ[๐‘œ] โ‰ˆ ,

C.H Hk

๐›ฝC๐‘ฆ[๐‘œ โˆ’ ๐‘›]

  • In fact, the pitch pulses are the
  • nly major exception to this

predictability!

slide-28
SLIDE 28

Linear predictive coding (LPC)

The LPC idea:

  • 1. Model the excitation as error

๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ ,

C.H Hk

๐›ฝC๐‘ฆ[๐‘œ โˆ’ ๐‘›]

  • 2. Force the coefficients ๐›ฝC to

explain as much as they can, so that ๐‘“ ๐‘œ is as close to zero as possible.

๐‘“ ๐‘œ ๐‘ฆ ๐‘œ

slide-29
SLIDE 29

Linear predictive coding (LPC)

๐œ = ๐น ๐‘“K[๐‘œ] = ๐น ๐‘ฆ ๐‘œ โˆ’ ,

โ€ก.H Hk

๐›ฝโ€ก๐‘ฆ[๐‘œ โˆ’ ๐‘—]

K

๐œ–๐œ ๐œ–๐›ฝ$ = โˆ’2๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ ๐‘œ โˆ’ ,

โ€ก.H Hk

๐›ฝโ€ก๐‘ฆ ๐‘œ โˆ’ ๐‘— Setting

โ€นล’ โ€นโ€ขลฝ = 0 gives

๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ[๐‘œ] = ,

โ€ก.H Hk

๐›ฝโ€ก๐น ๐‘ฆ ๐‘œ โˆ’ ๐‘˜ ๐‘ฆ[๐‘œ โˆ’ ๐‘—]

๐‘†BB ๐‘˜ ๐‘†BB |๐‘— โˆ’ ๐‘˜|

slide-30
SLIDE 30

Linear predictive coding (LPC)

So we have a set of linked equations, for 1 โ‰ค ๐‘˜ โ‰ค 10: ๐‘†BB ๐‘˜ = ,

โ€ก.H Hk

๐›ฝโ€ก๐‘†BB |๐‘— โˆ’ ๐‘˜|

  • We can write these 10 equations as a 10x10 matrix equation: โƒ—

๐›ฟ = ๐‘† โƒ— ๐›ฝ

  • โ€ฆwhich immediately gives the solution: โƒ—

๐›ฝ = ๐‘†/H โƒ— ๐›ฟ

  • โ€ฆwhere

โƒ— ๐›ฟ = ๐‘†BB 1 โ‹ฎ ๐‘†BB 10 , ๐‘† = ๐‘†BB 0 ๐‘†BB 1 โ‹ฏ ๐‘†BB 1 ๐‘†BB 0 โ‹ฏ โ‹ฎ โ‹ฎ ๐‘†BB 0 , โƒ— ๐›ฝ = ๐›ฝH โ‹ฎ ๐›ฝHk

slide-31
SLIDE 31

Outline

  • The LPC-10 speech synthesis model
  • Autocorrelation-based pitch tracking
  • Inter-frame interpolation of pitch and energy contours
  • The LPC-10 excitation model: white noise, pulse train
  • Linear predictive coding: how to find the coefficients
  • Linear predictive coding: how to make sure the coefficients are stable
slide-32
SLIDE 32

Speech -> Excitation -> Speech

Now that we know how to find the LPC coefficients, we can imagine an end-to-end LPC analysis-by-synthesis:

LPC synthesis ๐‘ก[๐‘œ] ๐‘“[๐‘œ] Model excitation using pulse train and white noise LPC analysis ๐‘“[๐‘œ] ๐‘ฆ[๐‘œ]

๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ ,

C.H Hk

๐›ฝC๐‘ฆ[๐‘œ โˆ’ ๐‘›] ๐‘ก ๐‘œ = ๐‘“ ๐‘œ + ,

C.H Hk

๐›ฝC๐‘ก[๐‘œ โˆ’ ๐‘›]

slide-33
SLIDE 33

The LPC Analysis Filter

The LPC Analysis Filter is an all-zeros filter (FIR = finite impulse response): ๐‘“ ๐‘œ = ๐‘ฆ ๐‘œ โˆ’ ,

C.H Hk

๐›ฝC๐‘ฆ ๐‘œ โˆ’ ๐‘› โ†” ๐น ๐‘จ = ๐ต ๐‘จ ๐‘Œ(๐‘จ) โ€ฆwhereโ€ฆ ๐ต ๐‘จ = 1 โˆ’ ,

C.H Hk

๐›ฝC๐‘จ/C Itโ€™s often useful to write this as a polynomial ๐ต ๐‘จ = ๐‘k + ๐‘H๐‘จ/H + โ‹ฏ where ๐‘$ = หœ 1 ๐‘˜ = 0 โˆ’๐›ฝ$ 1 โ‰ค ๐‘˜ โ‰ค 10

slide-34
SLIDE 34

The LPC Synthesis Filter

The LPC Synthesis Filter is an all-poles filter (IIR = infinite impulse response): ๐‘ก ๐‘œ = ๐‘“ ๐‘œ + ,

C.H Hk

๐›ฝC๐‘ก ๐‘œ โˆ’ ๐‘› โ†” ๐‘‡ ๐‘จ = ๐ผ ๐‘จ ๐น(๐‘จ) โ€ฆwhereโ€ฆ ๐ผ ๐‘จ = 1 ๐ต(๐‘จ) = 1 1 โˆ’ โˆ‘C.H

Hk

๐›ฝC๐‘จ/C

slide-35
SLIDE 35

Speech -> Excitation -> Speech

1 ๐ต(๐‘จ) ๐‘ก[๐‘œ] ๐‘“[๐‘œ] Excitation Model ๐ต ๐‘จ ๐‘“[๐‘œ] ๐‘ฆ[๐‘œ]

slide-36
SLIDE 36

The Stability Problem

  • The analysis filter is guaranteed to be stable, as long as the coefficients are
  • finite. Suppose you know that |๐‘ฆ ๐‘œ | โ‰ค ๐‘Œ?ab, and |๐›ฝC| โ‰ค ๐›ฝ?ab. Then,

even in the worst possible case, ๐‘“ ๐‘œ โ‰ค 11๐›ฝ?ab๐‘Œ?ab.

  • The synthesis filter has no such guarantee. For example, suppose ๐‘“ ๐‘œ is

just a delta function (๐‘“ ๐‘œ = ๐œ€ ๐‘œ ), and suppose all of the ๐›ฝC = 0 except the first one, ๐›ฝH = 1. 1. Then ๐‘ก ๐‘œ = ๐œ€ ๐‘œ + 1. 1๐‘ก[๐‘œ โˆ’ 1] = 1. 1 P Which overflows your 16-bit sample buffer after only 110 samples. Your

  • utput will be full of NaNs, and youโ€™ll be saying โ€œWhat happenedโ€ฆ?โ€
slide-37
SLIDE 37

How to Guarantee Stability

Fortunately, the LPC synthesis filter is causal, so itโ€™s easy to guarantee stability. We just need to make sure that all of the poles have magnitude less than 1: |๐‘ 

โ€บ| < 1

We find the poles like this: ๐ผ ๐‘จ = 1 ๐ต(๐‘จ) = 1 โˆ‘C.k

Hk

๐‘C๐‘จ/C = 1 โˆโ€บ.H

Hk

1 โˆ’ ๐‘ 

โ€บ๐‘จ/H

in other words, ๐‘ 

โ€บ = ๐‘ ๐‘๐‘๐‘ข๐‘ก(๐ต ๐‘จ )

โ€ฆwhich you can do using np.roots, if you define the polynomial correctly. Then you just truncate the magnitude, ๐‘ 

โ€บ โ† min ๐‘  โ€บ , 0.999 ๐‘“$โˆกiยข

โ€ฆand then use np.poly to convert back from roots to polynomial.