Speech Recognition Leaps Forward

brandonb · on Aug 29, 2011

The really great thing about deep networks isn't that they're more accurate. It's that they're radically simpler.

Current speech recognizers are basically layer upon layer of tricks discovered by researchers over the course of decades. Chop up the input signal. Then take a Fourier transform. Take the log to even the signal out. Do another transform to de-correlate different components of the audio. Add noise to the input. Project down to a subspace. Switch objective functions halfway through training to trade off different kinds of errors. Use more Guassians here. Use fewer there. Pump it into a language model.

It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.

This paper is actually incremental, not a "leap forward." They've basically replaced two of the middle layers of a speech recognizer (the Gaussian mixture model and hidden Markov model) with a modified neural network. But the exciting thing is that the neural network can start there, and slowly eat its way toward the outer layers, replacing a big stack of hacks with one simple algorithm.

exit · on Aug 29, 2011

> It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

i'm not sure about this attitude. it reminds me of a quote by dijkstra:

"The question of whether Machines Can Think... is about as relevant as the question of whether Submarines Can Swim."

why demand that intelligence proceed from a single parsimonious gesture?

Dn_Ab · on Aug 29, 2011

I doubt brandonb thinks anything so simple. His post suggest a strong familiarity with the subject.

I believe his emphasis is that previously, much of the structure learning was not discovered by the algorithm but hand tuned and built in manually. The new algorithm would seem to require less work to implement because it uncovered on its own a fair bit of the complex bits that previously had to be tweaked in by experimentation. In this sense we can talk about which is smarter in terms of required hand holding and massaging.

The question of what really constitutes intelligence is orthogonal to his post.

jamesrcole · on Aug 29, 2011

I'm not sure I agree with that quote.

We can agree that what computers can do these days is quite different to human-style thinking.

But to imply what machines can do is somehow different from thinking, just as submarine propulsion is different from swimming, implies that thinking is different from computation.

Whether computation, a mechanical process, encompasses what we consider thinking is not a settled question, but I think there's a pretty good case for believing that computation does encompass thinking.

[EDIT: wording]

exit · on Sept 2, 2011

? you aren't disagreeing with the quote.

just as our sense of what swimming entails is wrongly constrained by our familiarity with specific implementations, beyond "moving about under water", we shouldn't limit "thinking" to mean "activity in a neural network", etc.

Jach · on Aug 29, 2011

I kind of second the earlier replier with not really liking the attitude here:

>It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

>The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.

Really, the brain works very much like a neural net? I was under the impression it was hacked together by the statistical process known as evolution stacked over many years... I'm wondering if this idea of "this time it's not a mere math hack!" is a case of the 'Lemon Glazing Fallacy': http://lesswrong.com/lw/vv/logical_or_connectionist_ai/

I do agree with you that it's hardly a leap forward. Marketing is fun.

bluekeybox · on Aug 29, 2011

> I was under the impression it was hacked together by the statistical process known as evolution stacked over many years...

The problem with that argument is that it comes close to Chomsky's concept of a "built-in" universal language, which has been demonstrated to be mostly wrong (some African languages differ wildly; attempts to produce a universal grammar mostly failed; humans who grow up in isolation never really acquire language).

Whatever evolution has bestowed upon our brain seems to be first of all flexibility and adaptability, not a series of inflexible statistical "hacks".

_delirium · on Aug 29, 2011

Refutations of Chomsky haven't really refuted a more general built-in-functionality argument, looked at on a wider plane than sentence parsing. Not many people in neuroscience seriously reject the idea in the bigger picture, e.g. the idea that there are parts of the brain hard-wired for vision processing, and that we seem to have evolved to be more adept at certain kinds of vision-processing tasks than others. It's even widely believed (though more controversially) that we have built-in face-recognition machinery, which operates differently from the general object-recognition machinery. Similarly, we seem to have wiring fairly specialized for tasks like "maintain balance" and "process vibrations in the ear".

bluekeybox · on Aug 29, 2011

> Nobody in neuroscience seriously rejects some variety of built-in functionality in the bigger picture

Absolutely correct, but I was using the more narrow interpretation of "human intelligence" as "traits that make us distinct from most other mammals". Most higher animals have specialized structures for vision processing and most of them are also good at "maintaining balance". What makes humans unique is the adaptability of our brains I was talking about and the ability to acquire different symbolic systems with relative ease (at least as young children).

romanows · on Aug 29, 2011

Did they actually replace the HMM with a neural network? I'm only going by the abstract, but it seems that they just replaced the usual Gaussian mixture models with their neural networks.

The link in the article didn't work for me; I'm referring to: https://research.microsoft.com/apps/pubs/default.aspx?id=153...

gdahl · on Aug 29, 2011

No, the HMM is not replaced in that work. The GMM is replaced, as you surmise. There are three problems with standard ASR: HMMs, GMMs, and n-gram language models. The GMM is the easiest to remove. Keeping the HMM allows simple, efficient decoding algorithms.

Aron · on Aug 29, 2011

It's always nice when you can shed a bunch of complexity with something simple, because then you can start adding complexity again.

bh42222 · on Aug 29, 2011

It works, and it's a marvel of engineering, but it's not "artificial intelligence."

Is sound like you don't like complex algorithms written by humans. But you do like a big bucket of "neural network"?

How is "The magical black box works somehow!" better than "We know exactly how this white box we built works."?

mattmanser · on Aug 29, 2011

Posing your question differently, you get a completely different understanding of why the black box is much more alluring.

Which is better, something that can learn or something that does exactly what we told it to?

kd1220 · on Aug 29, 2011

Now I pose another question to you. What is "learning?"

When you feed a bunch of input to an ANN, isn't it doing exactly what it's told? The ANN itself does not seek out knowledge, it only adjusts weights according to an algorithm depending on inputs.

The allure to having a fixed, visible algorithm is that you are aware of many of the limitations. Having a black box obscures those limitations, but I won't deny it's alluring in other ways.

bh42222 · on Aug 29, 2011

Or how about:

Which is better, something which can learn exactly as we programmed it to, or something which learns by some unknown way in which the big 'ol bucket of bytes orders itself?

Or, if you had a magic learning machine, wouldn't you want to know exactly what its learning mechanisms are?

brandonb · on Aug 29, 2011

The black box can adapt to new circumstances. A lot of the algorithms used today in speech recognition were established on much smaller data sets (thousands of hours of speech); the tradeoffs made then may not apply when you have 1000x the data. The more automatic the algorithm, the more it can change.

Existing speech recognizers aren't really a "white box". You can't look at a Gaussian Mixture Model and understand what it's doing.

The more you automate the whole training process, the more.

kondro · on Aug 29, 2011

Is it just me or does 18% seem like a high error rate - and this is after improvement?

I've used technologies (Nuance??) that have significantly lower errors rates than this, even for systems I have not trained personally. Is there something I'm missing?

romanows · on Aug 29, 2011

The difference in error rates is in large part due to the to the difference between dictated speech and spontaneous, informal conversational speech.

Switchboard (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...) is a set of telephone conversations between two people. Speakers tend to say a lot of "ums", abruptly restart an utterance in progress, talk past the telephone handset, etc. Dictated speech, especially when speakers know they're talking to a computer, has less acoustic and linguistic noise.

danmaz74 · on Aug 29, 2011

It will be very interesting to see how this approach will work with dictated speech.

Also let's not forget that the word-level error rate can be reduced by using statistical data about words sequences.

braindead_in · on Aug 29, 2011

Nuance is speaker dependent. You have to train the system to understand the speaker's voice. This is speaker independent which is much harder.

mikeash · on Aug 29, 2011

Nuance has speaker-independent systems as well. If you have an iPhone, try an app like Dragon Dictation or Siri. Neither one requires any training, and will start off with high-quality recognition right out of the box.

krmmalik · on Aug 29, 2011

I was going to ask the same thing. It doesn't seem like that significant an improvement but I'm no expert. Also I do wonder if the performance improvements are mostly due to gpu acceleration as opposed to a switch to a different software model.

huffo · on Aug 29, 2011

Benchmarks have always fascinated me :)

scq · on Aug 29, 2011

There are three types of lies: lies, damned lies, and benchmarks.

runjake · on Aug 29, 2011

The speech recognition on Windows Phone 7 is really, really, really good.

I suspect Bill Gates went on a chair-throwing rampage after that infamous speech recognition demo flop for Vista [1].

[1] http://video.google.com/videoplay?docid=-1123221217782777472

urlwolf · on Aug 29, 2011

Does anyone know if this will impact applications soon enough to matter to the typical startup that could benefit from better speech recognition?

hollerith · on Aug 29, 2011

Probably not. This web page is from the PR department of Microsoft Research. The probability is low enough even if it had come from researchers, not PR types.

stavros · on Aug 29, 2011

Hmm, what's stopping anyone from just implementing these solutions?

gdahl · on Aug 29, 2011

1. Nothing. People ARE implementing similar things. It takes time, effort, and lots of computation. 2. People often prefer to implement their own ideas and compete (especially researchers). 3. Potentially lack of patents might discourage other firms from doing it.

rubinelli · on Aug 29, 2011

Number one, patents. Number two, the cost of licensing a good training database.

hollerith · on Aug 29, 2011

The solutions are probably not as useful as the PR piece suggests.

kd1220 · on Aug 29, 2011

No. I worked at a small IVR systems company in 2000 and at Nuance in 2001. I also worked with the tech during my undergraduate years. My opinion on speech recognition is that it's very pie-in-the-sky and not yet ready for general applications. I don't say this because the technology itself isn't ready; it's that humans aren't ready for it.

Having stated my bias: Speech recognition systems are actually not that complex at their core. It's a blending of statistical models. Getting good data is a problem. You need a good acoustic model that's adapted to your users and the environment in which they will be using your application. Everything from the fluency of speakers, to physical environment, to the characteristics of the channel over which the speech is sent needs to be considered.

If you have a good acoustic model, now you have to worry about your language model. Are you going to try to accept all words in a language, or just restrict your users to a particular domain of language? If you have a good language model, then you need to worry about the dialog management. How do you keep context in a conversation? It's not an easy problem.

The primary problem with speech recognition systems is that human beings set their expectations of them too high. It's a psychological factor. When those expectations are not met, the user is frustrated and angry. Consider this. Whenever you call AT&T, your health insurance company, or credit card company, do you enjoy the experience of the IVR system that routes your call? Probably not. You probably don't even talk to it and resort to pressing the buttons instead. Unfortunately that's the experience most people have with speech recognition. I think it's the worst possible application of it.

If you're making a small, toy application whose vocabulary is pretty restricted and whose functionality set is small, then you're probably okay. If you venture into full dialog/anything-goes type applications, the chances are high that your app will be a bomb.

These researchers can swap out all the lower-level statistical models they want, but it won't fundamentally improve the technology. There are systems out there with word error rates very close to that of humans, but the systems higher up in the stack that interpret what is recognized are still very crude.