? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 - - PDF document

message sound message p wolf sound p sound wolf x p wolf
SMART_READER_LITE
LIVE PREVIEW

? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 - - PDF document

9/4/19 Speech Hynek Hermansky Elecrical and Computer Engineering Hackerman 324Fp ? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 9/4/19 P(sound| wolf) no wolf wolf loudness timbre (sound color) More


slide-1
SLIDE 1

9/4/19 1

Speech

Hynek Hermansky Elecrical and Computer Engineering Hackerman 324Fp Message Message sound

?

P(wolf|sound) ≈ P(sound| wolf) x P(wolf)

slide-2
SLIDE 2

9/4/19 2

P(sound| wolf) no wolf wolf loudness timbre (sound “color”)

More dimensions of the sound – better chance to recognize it

Pristerodon

200 000 000 years

Homo sapiens Evolution of hearing

Environment (survival)

We hear to survive

…. sensory neurons are adapted to the statistical properties of the signals to which they are exposed.

Simoncelli and Olshausen

Human speech evolved to fit properties of human hearing

200 000 years

Evolution of speech

Hearing (communication)

We speak to hear

We speak in order to be heard and need to be heard in order to be understood.

Jakobson and Waugh p.95

slide-3
SLIDE 3

9/4/19 3

Human vocal tract

means for generation many different sounds (many dimensions)

nasal cavity mouth teeth lips tongue lungs velum larynx

breathing eating biting

speaking?

P(o|x1,x2)= P(x1|o)P(x2|o) P(o) / P(x1)P(x2)

When more than one signal (e.g., audio and visual)

P(wolf|word) ≈ P(word| wolf) x P(wolf) Message Message word

slide-4
SLIDE 4

9/4/19 4

McGurk effect acoustic /ba/ and visual /ga/ yields /da/ or /tha/

HEARING

slide-5
SLIDE 5

9/4/19 5

Physiology of Hearing

inner ear basilar membrane tectorial membrane hairs

  • val

window round window middle ear stirrup anvil hammer

  • uter ear

eardrum to higher processing levels

slide-6
SLIDE 6

9/4/19 6

Basilar membrane as a mechanical frequency analyzer

0.05 mm 0.5 mm stiff basal end pliable apical end 500 Hz 100 Hz

basilar membrane movements => bending of hair cells => electrical pulses

inner hair cells ~ 40 hairs/cell ~ 140 hairs/cell

  • uter hair cells

auditory nerve fiber auditory nerve fiber tectorial membrane basilar membrane

tunnel of corti

inner hair cells – firmly connected only to basilar membrane - information

  • uter hair cells – firmly connected to both the

tectorial and the basilar membranes - govern cochlear mechanics (cochlear amplifier - positive feedback)

inner ear middl e ear

  • uter

ear

  • rgan of Corti
slide-7
SLIDE 7

9/4/19 7

sensory

  • rgan

~1 ms ~100 ms inter-spike interval ~100,000,000

up to 10, 000,000 active in a given task

~100,000 number of spiking neurons

bottom-up connections top-down connections

speech signal

message? who ? where from?

  • massive increase in number of neurons from

lower processing levels to cortex

  • decrease in average spiking rates from

periphery to cortex

  • spikes in cortex are sparse (< 5% of cortical

neurons active at any moment)

Hromadka et al PLOS Biology 2008

slide-8
SLIDE 8

9/4/19 8

BASE APEX

processing stages

TONOTOPY different frequencies excite different parts

  • f the cortex

different frequencies excite different parts

  • f the cochleaa

base high frequencies apex low frequencies

Sensitivity of hearing

slide-9
SLIDE 9

9/4/19 9

threshpldth threshold target masker masked threshold

Simultaneous masking

Frequency selectivity of hearing (Critical bands of hearing)

18

slide-10
SLIDE 10

9/4/19 10

SPEAKING

nasal cavity mouth teeth lips tongue lungs velum larynx

breathing eating biting

speaking?

slide-11
SLIDE 11

9/4/19 11

motor control critical elements (tongue, lips, velum) shape of the whole vocal tract spectrum of speech signal (redundant contributions of movements of critical elements in different frequency bands) INFORMATION ABOUT TRACT SHAPES DISTRIBUTED IN FREQUENCY

intended speech sounds sluggishness of vocal organs produced speech sounds

movements of vocal organs are rather sluggish

INFORMATION ABOUT TRACT SHAPES DISTRIBUTED IN TIME from Sri Narajanan

slide-12
SLIDE 12

9/4/19 12

Linear model of speech production (Chiba and Kajiyama 1942) source filter filtered source signal message in movements of vocal tract modulator voiced or unvoiced carrier to make the tract movements audible message modulated carrier Carrier nature of speech (Dudley 1940)

slide-13
SLIDE 13

9/4/19 13

signal vocal tract shape contributions vocal source contributions

vocal source vocal tract speech signal

nasal cavity mouth teeth lips tongue lungs velum larynx

  • every change of the tract

shape shows at all frequencies of speech spectrum

  • tract shape changes do not

happen very fast

Redundant spread of information

BASE APEX

ear brain

Medial geniculate body Inferior colliculus Superior olive Cochlear nucleus Auditory nerve

  • frequency selective (about 20 bands)
  • sluggish (tenths of seconds)
slide-14
SLIDE 14

9/4/19 14

message coding introduce redundancies in frequency and in time speech signal noise decoding use redundancies for reliable extraction of the message message < 50 bps < 50 bps > 50 kbs PRODUCTION PERCEPTION TRANSMISSION

redundancy in frequency production: tract acoustics distributes the information to all frequencies of the speech spectrum perception: hearing selectivity allows for decoding the information in separate frequency bands redundancy in time production: tract sluggishness (coarticulation) distributes information about each speech sound in time perception: temporal sluggishness of hearing collect the information distributed in time

redundancies in time through sluggishness of a vocal tract redundancies in frequency through effect of tract movements on speech spectrum representation of speech sounds in frequency and in time intended sound sequence speech signal movements of vocal tract vocal tract physiology vocal tract acoustics time time

frequency

perceived sound sequence fusion of multiple streams cortical time- frequency filters corrupted speech signal representations of sound sequences in individual streams formation of spectral streams metacognitive performance monitoring time time time time

frequency frequency frequency frequency

periphery ~100 000 active neurons ~1000 Hz firing rates higher perceptual levels ~10 000 000 active neurons ~10 Hz firing rates

PRODUCTION PERCEPTION