Mobile Speech Processing David Huggins-Daines Language Technologies - - PowerPoint PPT Presentation
Mobile Speech Processing David Huggins-Daines Language Technologies - - PowerPoint PPT Presentation
Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008 Outline Mobile Devices What are they? What would we like to do with them? Mobile Speech Applications
Outline
◮ Mobile Devices
◮ What are they? ◮ What would we like to do with them?
◮ Mobile Speech Applications ◮ Mobile Speech Technologies ◮ Current Research
Mobile Devices
◮ What is a “mobile device”? ◮ A hammer is a device, and you can carry it around with you! ◮ But no, that’s not what we mean here
Mobile Devices
◮ What is a “mobile device”? ◮ A device that goes everywhere with you ◮ ... which provides some or all of the functions of computer ◮ ... and some things it doesn’t, such as a cell phone or GPS.
Speech on Mobile Devices
◮ Why do we care about speech processing on these devices?
◮ Because they are the future of computers ◮ Because speech is actually a useful way to interact with them,
unlike full-sized computers
◮ What kind of speech processing do we care about?
◮ Speech coding to improve voice quality for cellular and VoIP ◮ Speech recognition for hands-free input to apps ◮ Speech synthesis for eyes-free output from apps
◮ In some cases, speech is a natural and convenient modality ◮ In other cases, it is a necessity (e.g. in-car navigation)
Speech on Mobiles vs. Mobile Speech
◮ None of this necessarily implies doing actual speech processing
(aside from coding) on the device itself
◮ Telephone dialog systems are “mobile” by any definition
◮ Let’s Go - bus scheduling information ◮ HealthLine - medical information for rural health workers
◮ But all synthesis and recognition is done on a server
◮ This can be a good thing especially in the latter case ◮ You can’t run a speech recognizer on a Motofone or a Nokia
1010
◮ Speech processing on the device is useful for:
◮ Multimodal applications ◮ Disconnected applications ◮ Access to local data
Some Mobile Speech Applications
◮ GPS navigation
◮ Older systems used a small number of recorded prompts (“turn
left”, “100 metres”, etc)
◮ More recently, TTS has been used to speak street names ◮ Even more recently, ASR is used for input
◮ Voice dialing
◮ Old systems used DTW and required training ◮ Newer ones build models from your address book ◮ Cactus for iPhone - uses CMU Flite and Sphinx
◮ Voice-driven search (local, web, etc)
◮ Nuance, Vlingo, TellMe, Microsoft are all doing this
◮ Voice-to-text
◮ Typically server-based, requires a data connection ◮ “on-line”, ASR-based: Vlingo, Nuance ◮ “off-line”, human-assisted: SpinVox, Jott, ReQall
◮ Speech to Speech Translation
Mobile Speech Technologies
◮ Speech Coding
◮ Efficient digital representation of speech signals ◮ Fundamental for 2G and 3G cell networks and VoIP
◮ Speech Synthesis
◮ Speech output for commands, directions ◮ Text-to-speech for messages, books, other content
◮ Speech Recognition
◮ Command and control (“voice control”) ◮ Dictation (Speech-to-text for e-mail, SMS) ◮ Search input (questions, keywords) ◮ Dialogue
Speech Coding
◮ A fairly mature technology (started in the 1960s)
◮ Early versions were mostly for military applications ◮ Digital cell phone networks changed this dramatically
◮ Almost universally based on linear prediction and the
source-filter model.
◮ Each sample is a weighted sum of P previous samples. ◮ Weights are linear prediction coefficients (LPCs), and are
calculated to minimize mean squared error.
◮ Conveniently enough, this is actually a good model of the
frequency response of the vocal tract (given enough LPCs).
◮ An “excitation function” models the glottal source.
◮ Everything else is just tweaking
◮ Better excitation functions (CELP) ◮ Variable bit rates (AMR) ◮ Compression tricks (VAD + comfort noise)
Mobile Speech Synthesis
◮ Two traditional categories, one new one
◮ Synthesis by rule, e.g. formant synthesis ◮ Concatenative synthesis, e.g. diphone, unit selection ◮ Statistical-parametric synthesis (“HMM synthesis”)
◮ We have had very efficient (often hardware-based)
implementations of TTS for decades
◮ They sound terrible (but are often quite intelligible)
◮ The challenges for mobile devices are:
◮ Achieving natural-sounding speech ◮ Dealing with very large, irregular vocabularies ◮ Dealing with raw and diverse input text
Mobile Speech Synthesis
◮ Unit selection currently gives the most natural output ◮ But it is very ill-suited to mobile implementations
◮ Best systems use gigabytes of speech data ◮ But, you say... I have an 8GB microSD card in my phone! ◮ Search time: finding the right units of speech ◮ Access time: loading them from the storage medium
◮ Signal generation can also be time-consuming if not efficiently
implemented
◮ Some ways to improve efficiency:
◮ Compress the speech database ◮ Prune the speech database by discarding units that are
infrequently or never used
◮ Approximate search algorithms (much like ASR)
Mobile Speech Synthesis
◮ Statistical-parametric synthesis is quite promising
◮ Models are quite small (1-2MB) ◮ The search problem is nonexistent ◮ Parameter and waveform generation are the most time
consuming parts currently
◮ Requires higher dimensionality parameterizations than
concatenative synthesis
◮ Output parameters are smoothed using an iterative algorithm
(similar to EM)
◮ Waveform generation from mcep is much slower than LPC
◮ Dictionary compression and text normalization
◮ Dictionary can be compressed by building letter-to-sound
models and listing only the exceptions
◮ Efficient finite-state transducer representations can be created
for pronunciation and text processing rules
Mobile Speech Recognition
◮ Challenges for mobile devices are:
◮ Variable and noisy acoustic environments ◮ Large vocabularies ◮ Open domain dictation input
◮ As with speech synthesis, simple ASR is not very resource
intensive, although it has not been as widely implemented
◮ Even with large vocabularies, ASR can be done efficiently
◮ The most important factor is the complexity of the grammar ◮ Commercial systems achieve impressive performance based on
very constrained grammars
◮ Systems tend to be extensively tuned for a given application
Mobile Speech Recognition: Acoustic Issues
◮ How do you talk to a device?
◮ This depends on the application, user, and environment ◮ Acoustic feature vectors can look very different ◮ Microphones may not be optimized for all positions
◮ Noisy environments
◮ Mobile devices are more likely to be used in noisy environments ◮ Worse, they are more likely to be used in difficult ones ◮ Non-stationary noise, crosstalk, human babble ◮ Array processing is not well suited to handheld devices
◮ On the bright side:
◮ Usually a mobile device has only one user ◮ Speaker adaptation can improve acoustic modeling ◮ Speaker identification can be used to filter out babble and
crosstalk
Mobile Speech Recognition: Computational Issues
◮ Acoustic feature extraction
◮ Efficient, as long as it is implemented properly ◮ Fixed-point arithmetic, data-parallel processing
◮ Most processing time is consumed by, in roughly equal
amounts:
◮ Acoustic model evaluation ◮ Search (hypothesis generation and evaluation)
◮ These can be made computationally efficient but must also be
made memory efficient, search in particular.
◮ This necessarily involves tuning heuristics because a complete
solution is intractable.
Mobile Speech Recognition: Acoustic Modeling
◮ Exact acoustic model evaluation is intractable
P(o|si, λ) =
K
- k=1
wik 1
- (2π)D|Σik|
exp
D
- d=1
−(od − µikd)2 2σ2
ikd
◮ Typical continuous-density acoustic model: ◮ 5000 tied states, each with ◮ 32 Gaussian densities, of ◮ 39 dimensions ◮ Complete evaluation of all log-likelihoods for one 10ms frame: ◮ 155000 log-additions ◮ 12480000 subtractions ◮ 12480000 multiplications
◮ That’s 2500 million operations per second!
◮ Your new MacBook Pro can do that, but just barely ◮ (yes, its video card can do it easily)
Mobile Speech Recognition: Acoustic Modeling
◮ How do we make this fast enough?
◮ Only evaluate densities for “active” phones in search ◮ Predict which densities will score highly using a smaller,
approximate model set, and only evaluate these ones
◮ Use fewer densities and: ◮ Share them between all HMM states (semi-continuous HMM) ◮ or all the states for some phonetic class (phonetically-tied
HMM)
◮ Make density computation faster by quantizing acoustic
features and parameters
◮ Skip some frames in the input, either by ◮ Blindly computing only multiples of N (usually 2 or 3) ◮ Detecting “interesting” regions in the input and only
computing densities there (landmark detection)
◮ Every ASR system in existence uses some combination of these ◮ However, too many approximations can make the system
slower
Mobile Speech Recognition: Search
◮ Search is not arithmetically intensive
◮ It largely consists of adding up scores and comparing them to
- ther scores
◮ However it is very memory intensive ◮ The search module in an ASR system touches:
◮ Acoustic scores ◮ Language model scores ◮ Dictionary entries ◮ Viterbi path scores and backpointers ◮ Backpointer table entries
◮ In other words, pretty much every piece of memory except the
acoustic model parameters
◮ Worse yet, there are sequential dependencies between all these
memory accesses
Mobile Speech Recognition: Search
◮ Fundamentally, the speed of the recognizer is proportional to
the number of different hypotheses it considers at once
◮ Optimizing search is entirely devoted to reducing this number
without significantly affecting accuracy
◮ This includes:
◮ Careful tuning of various thresholds (beams) for word
transitions, phone transitions, etc.
◮ Absolute pruning - hard limits on words per frame ◮ Phonetic lookahead ◮ Language model lookahead (factorization / weight pushing)
◮ Finite-state transducer systems can be very fast
◮ Dictionary, grammar, and (part of) acoustic model are
composed into a single decoding network
◮ Determinization - allows exact language model search ◮ Minimization - merges common subpaths ◮ Weight pushing - more general kind of LM lookahead
Common Problems for Mobile Speech Processing
◮ Moore’s Law works differently for mobile devices
◮ Instead of getting faster, they get smaller and cheaper ◮ Storage gets bigger, RAM doesn’t ◮ Memory doesn’t get much faster
◮ Memory bandwidth is a major bottleneck
◮ Making things smaller almost always makes them faster ◮ Memory allocations can be very expensive (depending on the
- perating system)
◮ Audio input quality is often much lower
◮ Typically 8kHz or 11kHz maximum sampling rate ◮ Dubious microphones