The really great thing about deep networks isn't that they're more accurate. It's that they're radically simpler.
Current speech recognizers are basically layer upon layer of tricks discovered by researchers over the course of decades. Chop up the input signal. Then take a Fourier transform. Take the log to even the signal out. Do another transform to de-correlate different components of the audio. Add noise to the input. Project down to a subspace. Switch objective functions halfway through training to trade off different kinds of errors. Use more Guassians here. Use fewer there. Pump it into a language model.
It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.
The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.
This paper is actually incremental, not a "leap forward." They've basically replaced two of the middle layers of a speech recognizer (the Gaussian mixture model and hidden Markov model) with a modified neural network. But the exciting thing is that the neural network can start there, and slowly eat its way toward the outer layers, replacing a big stack of hacks with one simple algorithm.
> It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.
i'm not sure about this attitude. it reminds me of a quote by dijkstra:
"The question of whether Machines Can Think... is about as relevant as the question of whether Submarines Can Swim."
why demand that intelligence proceed from a single parsimonious gesture?
I doubt brandonb thinks anything so simple. His post suggest a strong familiarity with the subject.
I believe his emphasis is that previously, much of the structure learning was not discovered by the algorithm but hand tuned and built in manually. The new algorithm would seem to require less work to implement because it uncovered on its own a fair bit of the complex bits that previously had to be tweaked in by experimentation. In this sense we can talk about which is smarter in terms of required hand holding and massaging.
The question of what really constitutes intelligence is orthogonal to his post.
We can agree that what computers can do these days is quite different to human-style thinking.
But to imply what machines can do is somehow different from thinking, just as submarine propulsion is different from swimming, implies that thinking is different from computation.
Whether computation, a mechanical process, encompasses what we consider thinking is not a settled question, but I think there's a pretty good case for believing that computation does encompass thinking.
just as our sense of what swimming entails is wrongly constrained by our familiarity with specific implementations, beyond "moving about under water", we shouldn't limit "thinking" to mean "activity in a neural network", etc.
I kind of second the earlier replier with not really liking the attitude here:
>It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.
>The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.
Really, the brain works very much like a neural net? I was under the impression it was hacked together by the statistical process known as evolution stacked over many years... I'm wondering if this idea of "this time it's not a mere math hack!" is a case of the 'Lemon Glazing Fallacy': http://lesswrong.com/lw/vv/logical_or_connectionist_ai/
I do agree with you that it's hardly a leap forward. Marketing is fun.
> I was under the impression it was hacked together by the statistical process known as evolution stacked over many years...
The problem with that argument is that it comes close to Chomsky's concept of a "built-in" universal language, which has been demonstrated to be mostly wrong (some African languages differ wildly; attempts to produce a universal grammar mostly failed; humans who grow up in isolation never really acquire language).
Whatever evolution has bestowed upon our brain seems to be first of all flexibility and adaptability, not a series of inflexible statistical "hacks".
Refutations of Chomsky haven't really refuted a more general built-in-functionality argument, looked at on a wider plane than sentence parsing. Not many people in neuroscience seriously reject the idea in the bigger picture, e.g. the idea that there are parts of the brain hard-wired for vision processing, and that we seem to have evolved to be more adept at certain kinds of vision-processing tasks than others. It's even widely believed (though more controversially) that we have built-in face-recognition machinery, which operates differently from the general object-recognition machinery. Similarly, we seem to have wiring fairly specialized for tasks like "maintain balance" and "process vibrations in the ear".
> Nobody in neuroscience seriously rejects some variety of built-in functionality in the bigger picture
Absolutely correct, but I was using the more narrow interpretation of "human intelligence" as "traits that make us distinct from most other mammals". Most higher animals have specialized structures for vision processing and most of them are also good at "maintaining balance". What makes humans unique is the adaptability of our brains I was talking about and the ability to acquire different symbolic systems with relative ease (at least as young children).
Did they actually replace the HMM with a neural network? I'm only going by the abstract, but it seems that they just replaced the usual Gaussian mixture models with their neural networks.
No, the HMM is not replaced in that work. The GMM is replaced, as you surmise. There are three problems with standard ASR: HMMs, GMMs, and n-gram language models. The GMM is the easiest to remove. Keeping the HMM allows simple, efficient decoding algorithms.
Now I pose another question to you. What is "learning?"
When you feed a bunch of input to an ANN, isn't it doing exactly what it's told? The ANN itself does not seek out knowledge, it only adjusts weights according to an algorithm depending on inputs.
The allure to having a fixed, visible algorithm is that you are aware of many of the limitations. Having a black box obscures those limitations, but I won't deny it's alluring in other ways.
Which is better, something which can learn exactly as we programmed it to, or something which learns by some unknown way in which the big 'ol bucket of bytes orders itself?
Or, if you had a magic learning machine, wouldn't you want to know exactly what its learning mechanisms are?
The black box can adapt to new circumstances. A lot of the algorithms used today in speech recognition were established on much smaller data sets (thousands of hours of speech); the tradeoffs made then may not apply when you have 1000x the data. The more automatic the algorithm, the more it can change.
Existing speech recognizers aren't really a "white box". You can't look at a Gaussian Mixture Model and understand what it's doing.
The more you automate the whole training process, the more.
Is it just me or does 18% seem like a high error rate - and this is after improvement?
I've used technologies (Nuance??) that have significantly lower errors rates than this, even for systems I have not trained personally. Is there something I'm missing?
The difference in error rates is in large part due to the to the difference between dictated speech and spontaneous, informal conversational speech.
Switchboard (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...) is a set of telephone conversations between two people. Speakers tend to say a lot of "ums", abruptly restart an utterance in progress, talk past the telephone handset, etc. Dictated speech, especially when speakers know they're talking to a computer, has less acoustic and linguistic noise.
Nuance has speaker-independent systems as well. If you have an iPhone, try an app like Dragon Dictation or Siri. Neither one requires any training, and will start off with high-quality recognition right out of the box.
I was going to ask the same thing. It doesn't seem like that significant an improvement but I'm no expert. Also I do wonder if the performance improvements are mostly due to gpu acceleration as opposed to a switch to a different software model.
Probably not. This web page is from the PR department of Microsoft Research. The probability is low enough even if it had come from researchers, not PR types.
1. Nothing. People ARE implementing similar things. It takes time, effort, and lots of computation.
2. People often prefer to implement their own ideas and compete (especially researchers).
3. Potentially lack of patents might discourage other firms from doing it.
No. I worked at a small IVR systems company in 2000 and at Nuance in 2001. I also worked with the tech during my undergraduate years. My opinion on speech recognition is that it's very pie-in-the-sky and not yet ready for general applications. I don't say this because the technology itself isn't ready; it's that humans aren't ready for it.
Having stated my bias: Speech recognition systems are actually not that complex at their core. It's a blending of statistical models. Getting good data is a problem. You need a good acoustic model that's adapted to your users and the environment in which they will be using your application. Everything from the fluency of speakers, to physical environment, to the characteristics of the channel over which the speech is sent needs to be considered.
If you have a good acoustic model, now you have to worry about your language model. Are you going to try to accept all words in a language, or just restrict your users to a particular domain of language? If you have a good language model, then you need to worry about the dialog management. How do you keep context in a conversation? It's not an easy problem.
The primary problem with speech recognition systems is that human beings set their expectations of them too high. It's a psychological factor. When those expectations are not met, the user is frustrated and angry. Consider this. Whenever you call AT&T, your health insurance company, or credit card company, do you enjoy the experience of the IVR system that routes your call? Probably not. You probably don't even talk to it and resort to pressing the buttons instead. Unfortunately that's the experience most people have with speech recognition. I think it's the worst possible application of it.
If you're making a small, toy application whose vocabulary is pretty restricted and whose functionality set is small, then you're probably okay. If you venture into full dialog/anything-goes type applications, the chances are high that your app will be a bomb.
These researchers can swap out all the lower-level statistical models they want, but it won't fundamentally improve the technology. There are systems out there with word error rates very close to that of humans, but the systems higher up in the stack that interpret what is recognized are still very crude.
Current speech recognizers are basically layer upon layer of tricks discovered by researchers over the course of decades. Chop up the input signal. Then take a Fourier transform. Take the log to even the signal out. Do another transform to de-correlate different components of the audio. Add noise to the input. Project down to a subspace. Switch objective functions halfway through training to trade off different kinds of errors. Use more Guassians here. Use fewer there. Pump it into a language model.
It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.
The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.
This paper is actually incremental, not a "leap forward." They've basically replaced two of the middle layers of a speech recognizer (the Gaussian mixture model and hidden Markov model) with a modified neural network. But the exciting thing is that the neural network can start there, and slowly eat its way toward the outer layers, replacing a big stack of hacks with one simple algorithm.