Why is speech recognition difficult?

Why is speech recognition so difficult? What are the specific challenges involved? I've read through a question on speech recognition, which did partially answer some of my questions, but the answers were largely anecdotal rather than technical. It also still didn't really answer why we still can't just throw more hardware at the problem.

I've seen tools that perform automated noise reduction using neural nets and ambient FFT analysis with excellent results, so I can't see a reason why we're still struggling with noise except in difficult scenarios like ludicrously loud background noise or multiple speech sources.

Beyond this, isn't it just a case of using very large, complex, well-trained neural nets to do the processing, then throwing hardware at it to make it work fast enough?

I understand that strong accents are a problem and that we all have our colloquialisms, but these recognition engines still get basic things wrong when the person is speaking in a slow and clear American or British accent.

So, what's the deal? What technical problems are there that make it still so difficult for a computer to understand me?

Solution

Some technical reasons:

You need lots of tagged training data, which can be difficult to acquire once you take into account all the different accents, sounds etc.
Neural networks and similar gradient descent algorithms don't scale that well - just making them bigger (more layers, more nodes, more connections) doesn't guarantee that they will learn to solve your problem in a reasonable time. Scaling up machine learning to solve complex tasks is still a hard, unsolved problem.
Many machine learning approaches require normalised data (e.g. a defined start point, a standard pitch, a standard speed). They don't work well once you move outside these parameters. There are techniques such as convolutional neural networks etc. to tackle these problems, but they all add complexity and require a lot of expert fine-tuning.
Data size for speech can be quite large - the size of the data makes the engineering problems and computational requirements a little more challenging.
Speech data usually needs to be interpreted in context for full understanding - the human brain is remarkably good at "filling in the blanks" based on understood context. Missing informations and different interpretations are filled in with the help of other modalities (like vision). Current algorithms don't "understand" context so they can't use this to help interpret the speech data. This is particularly problematic because many sounds / words are ambiguous unless taken in context.

Overall, speech recognition is a complex task. Not unsolvably hard, but hard enough that you shouldn't expect any sudden miracles and it will certainly keep many reasearchers busy for many more years.....