Mobile Speech Processing David Huggins-Daines Language Technologies - - PowerPoint PPT Presentation

▶

Dec 27, 2022 839 likes •1.08k views

Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008 Outline Mobile Devices What are they? What would we like to do with them? Mobile Speech Applications

SLIDE 1

Mobile Speech Processing

David Huggins-Daines

Language Technologies Institute Carnegie Mellon University

September 19, 2008

SLIDE 2

Outline

◮ Mobile Devices

◮ What are they? ◮ What would we like to do with them?

◮ Mobile Speech Applications ◮ Mobile Speech Technologies ◮ Current Research

SLIDE 3

Mobile Devices

◮ What is a “mobile device”? ◮ A hammer is a device, and you can carry it around with you! ◮ But no, that’s not what we mean here

SLIDE 4

Mobile Devices

◮ What is a “mobile device”? ◮ A device that goes everywhere with you ◮ ... which provides some or all of the functions of computer ◮ ... and some things it doesn’t, such as a cell phone or GPS.

SLIDE 5

Speech on Mobile Devices

◮ Why do we care about speech processing on these devices?

◮ Because they are the future of computers ◮ Because speech is actually a useful way to interact with them,

unlike full-sized computers

◮ What kind of speech processing do we care about?

◮ Speech coding to improve voice quality for cellular and VoIP ◮ Speech recognition for hands-free input to apps ◮ Speech synthesis for eyes-free output from apps

◮ In some cases, speech is a natural and convenient modality ◮ In other cases, it is a necessity (e.g. in-car navigation)

SLIDE 6

Speech on Mobiles vs. Mobile Speech

◮ None of this necessarily implies doing actual speech processing

(aside from coding) on the device itself

◮ Telephone dialog systems are “mobile” by any definition

◮ Let’s Go - bus scheduling information ◮ HealthLine - medical information for rural health workers

◮ But all synthesis and recognition is done on a server

◮ This can be a good thing especially in the latter case ◮ You can’t run a speech recognizer on a Motofone or a Nokia

1010

◮ Speech processing on the device is useful for:

◮ Multimodal applications ◮ Disconnected applications ◮ Access to local data

SLIDE 7

Some Mobile Speech Applications

◮ GPS navigation

◮ Older systems used a small number of recorded prompts (“turn

left”, “100 metres”, etc)

◮ More recently, TTS has been used to speak street names ◮ Even more recently, ASR is used for input

◮ Voice dialing

◮ Old systems used DTW and required training ◮ Newer ones build models from your address book ◮ Cactus for iPhone - uses CMU Flite and Sphinx

◮ Voice-driven search (local, web, etc)

◮ Nuance, Vlingo, TellMe, Microsoft are all doing this

◮ Voice-to-text

◮ Typically server-based, requires a data connection ◮ “on-line”, ASR-based: Vlingo, Nuance ◮ “off-line”, human-assisted: SpinVox, Jott, ReQall

◮ Speech to Speech Translation

SLIDE 8

Mobile Speech Technologies

◮ Speech Coding

◮ Efficient digital representation of speech signals ◮ Fundamental for 2G and 3G cell networks and VoIP

◮ Speech Synthesis

◮ Speech output for commands, directions ◮ Text-to-speech for messages, books, other content

◮ Speech Recognition

◮ Command and control (“voice control”) ◮ Dictation (Speech-to-text for e-mail, SMS) ◮ Search input (questions, keywords) ◮ Dialogue

SLIDE 9

Speech Coding

◮ A fairly mature technology (started in the 1960s)

◮ Early versions were mostly for military applications ◮ Digital cell phone networks changed this dramatically

◮ Almost universally based on linear prediction and the

source-filter model.

◮ Each sample is a weighted sum of P previous samples. ◮ Weights are linear prediction coefficients (LPCs), and are

calculated to minimize mean squared error.

◮ Conveniently enough, this is actually a good model of the

frequency response of the vocal tract (given enough LPCs).

◮ An “excitation function” models the glottal source.

◮ Everything else is just tweaking

◮ Better excitation functions (CELP) ◮ Variable bit rates (AMR) ◮ Compression tricks (VAD + comfort noise)

SLIDE 10

Mobile Speech Synthesis

◮ Two traditional categories, one new one

◮ Synthesis by rule, e.g. formant synthesis ◮ Concatenative synthesis, e.g. diphone, unit selection ◮ Statistical-parametric synthesis (“HMM synthesis”)

◮ We have had very efficient (often hardware-based)

implementations of TTS for decades

◮ They sound terrible (but are often quite intelligible)

◮ The challenges for mobile devices are:

◮ Achieving natural-sounding speech ◮ Dealing with very large, irregular vocabularies ◮ Dealing with raw and diverse input text

SLIDE 11

Mobile Speech Synthesis

◮ Unit selection currently gives the most natural output ◮ But it is very ill-suited to mobile implementations

◮ Best systems use gigabytes of speech data ◮ But, you say... I have an 8GB microSD card in my phone! ◮ Search time: finding the right units of speech ◮ Access time: loading them from the storage medium

◮ Signal generation can also be time-consuming if not efficiently

implemented

◮ Some ways to improve efficiency:

◮ Compress the speech database ◮ Prune the speech database by discarding units that are

infrequently or never used

◮ Approximate search algorithms (much like ASR)

SLIDE 12

Mobile Speech Synthesis

◮ Statistical-parametric synthesis is quite promising

◮ Models are quite small (1-2MB) ◮ The search problem is nonexistent ◮ Parameter and waveform generation are the most time

consuming parts currently

◮ Requires higher dimensionality parameterizations than

concatenative synthesis

◮ Output parameters are smoothed using an iterative algorithm

(similar to EM)

◮ Waveform generation from mcep is much slower than LPC

◮ Dictionary compression and text normalization

◮ Dictionary can be compressed by building letter-to-sound

models and listing only the exceptions

◮ Efficient finite-state transducer representations can be created

for pronunciation and text processing rules

SLIDE 13

Mobile Speech Recognition

◮ Challenges for mobile devices are:

◮ Variable and noisy acoustic environments ◮ Large vocabularies ◮ Open domain dictation input

◮ As with speech synthesis, simple ASR is not very resource

intensive, although it has not been as widely implemented

◮ Even with large vocabularies, ASR can be done efficiently

◮ The most important factor is the complexity of the grammar ◮ Commercial systems achieve impressive performance based on

very constrained grammars

◮ Systems tend to be extensively tuned for a given application

SLIDE 14

Mobile Speech Recognition: Acoustic Issues

◮ How do you talk to a device?

◮ This depends on the application, user, and environment ◮ Acoustic feature vectors can look very different ◮ Microphones may not be optimized for all positions

◮ Noisy environments

◮ Mobile devices are more likely to be used in noisy environments ◮ Worse, they are more likely to be used in difficult ones ◮ Non-stationary noise, crosstalk, human babble ◮ Array processing is not well suited to handheld devices

◮ On the bright side:

◮ Usually a mobile device has only one user ◮ Speaker adaptation can improve acoustic modeling ◮ Speaker identification can be used to filter out babble and

crosstalk

SLIDE 15

Mobile Speech Recognition: Computational Issues

◮ Acoustic feature extraction

◮ Efficient, as long as it is implemented properly ◮ Fixed-point arithmetic, data-parallel processing

◮ Most processing time is consumed by, in roughly equal

amounts:

◮ Acoustic model evaluation ◮ Search (hypothesis generation and evaluation)

◮ These can be made computationally efficient but must also be

made memory efficient, search in particular.

◮ This necessarily involves tuning heuristics because a complete

solution is intractable.

SLIDE 16

Mobile Speech Recognition: Acoustic Modeling

◮ Exact acoustic model evaluation is intractable

P(o|si, λ) =

K

wik 1

(2π)D|Σik|

exp

D

−(od − µikd)2 2σ2

ikd

◮ Typical continuous-density acoustic model: ◮ 5000 tied states, each with ◮ 32 Gaussian densities, of ◮ 39 dimensions ◮ Complete evaluation of all log-likelihoods for one 10ms frame: ◮ 155000 log-additions ◮ 12480000 subtractions ◮ 12480000 multiplications

◮ That’s 2500 million operations per second!

◮ Your new MacBook Pro can do that, but just barely ◮ (yes, its video card can do it easily)

SLIDE 17

Mobile Speech Recognition: Acoustic Modeling

◮ How do we make this fast enough?

◮ Only evaluate densities for “active” phones in search ◮ Predict which densities will score highly using a smaller,

approximate model set, and only evaluate these ones

◮ Use fewer densities and: ◮ Share them between all HMM states (semi-continuous HMM) ◮ or all the states for some phonetic class (phonetically-tied

HMM)

◮ Make density computation faster by quantizing acoustic

features and parameters

◮ Skip some frames in the input, either by ◮ Blindly computing only multiples of N (usually 2 or 3) ◮ Detecting “interesting” regions in the input and only

computing densities there (landmark detection)

◮ Every ASR system in existence uses some combination of these ◮ However, too many approximations can make the system

slower

SLIDE 18

Mobile Speech Recognition: Search

◮ Search is not arithmetically intensive

◮ It largely consists of adding up scores and comparing them to

ther scores

◮ However it is very memory intensive ◮ The search module in an ASR system touches:

◮ Acoustic scores ◮ Language model scores ◮ Dictionary entries ◮ Viterbi path scores and backpointers ◮ Backpointer table entries

◮ In other words, pretty much every piece of memory except the

acoustic model parameters

◮ Worse yet, there are sequential dependencies between all these

memory accesses

SLIDE 19

Mobile Speech Recognition: Search

◮ Fundamentally, the speed of the recognizer is proportional to

the number of different hypotheses it considers at once

◮ Optimizing search is entirely devoted to reducing this number

without significantly affecting accuracy

◮ This includes:

◮ Careful tuning of various thresholds (beams) for word

transitions, phone transitions, etc.

◮ Absolute pruning - hard limits on words per frame ◮ Phonetic lookahead ◮ Language model lookahead (factorization / weight pushing)

◮ Finite-state transducer systems can be very fast

◮ Dictionary, grammar, and (part of) acoustic model are

composed into a single decoding network

◮ Determinization - allows exact language model search ◮ Minimization - merges common subpaths ◮ Weight pushing - more general kind of LM lookahead

SLIDE 20

Common Problems for Mobile Speech Processing

◮ Moore’s Law works differently for mobile devices

◮ Instead of getting faster, they get smaller and cheaper ◮ Storage gets bigger, RAM doesn’t ◮ Memory doesn’t get much faster

◮ Memory bandwidth is a major bottleneck

◮ Making things smaller almost always makes them faster ◮ Memory allocations can be very expensive (depending on the

perating system)

◮ Audio input quality is often much lower

◮ Typically 8kHz or 11kHz maximum sampling rate ◮ Dubious microphones

SLIDE 21

Current and Future Research

◮ Incorporating user feedback in multimodal (speech + touch)

applications

◮ Presenting information efficiently using speech synthesis ◮ Very low bitrate speech coding using ASR and TTS ◮ Distributed processing for mobile speech recognition ◮ Acoustic robustness for handheld mobile devices ◮ Voice and multimodal user interface design