A recent article from GeekWire caught my attention. It seems that a Microsoft, a pioneer in speech recognition, reached a record error rate. In one year, this rate has fallen from 5.9% to 5.1%. It seems impressive. IBM has announced an improvement of their speech recognition engine, too, down from 6.9% to 5.5%. Alexa from Amazon is also improving. Siri from Apple gets better than ever. The same for Google. Competition is healthy because it drives innovation and paves the way to breakthroughs. Yet, today, everyone is using the same magic. Could it be the wrong magic ?
The magic under the hood
Today, some, if not all of the speech engines use what is called a neural network. Basically, the machine tries to imitate the human brain. And the misconception in neural networks is the following: there are 100 billion neurons in the human brain, each with 100 to 10000 connections. Those connections are extremely important to the human intelligence. So, by the numbers, there are between 10 trillion and 1 quadrillion connections.
A big number, but after all, just a number. All we need is to get to have 1 quadrillion processors or something equivalent and the system will be as smart as a human. Well, something has been omitted here. Yes, there are so many connections in the human brain, but the part that is considered intelligence has much less ‘smart material’. If the human brain is a ball of 16 centimeters in diameter, the ‘intelligent’ part of it is an outer layer less than 3 mm thick. It the cerebral cortex. The rest of the brain is the animal part. Somehow, the intelligent layer of the brain has a quality that makes us smarter.
The real challenge
IBM claims that one word out of 20 is missed by a human listener. While I don’t agree with the claim, one fact is sure: people speak differently:
- different speeds;
- different volumes;
- different vocabularies;
- different pronunciations
and so on. All these differences adds up to the challenge of understanding speech. The English language has about 1 million words. 5% of a million is 50000 words. As many as the common vocabulary of a common speaker. Imagine 20 people in the countryside. Only one of them knows how to get to the castle of the king, 19 others leaking information that misleads. According to the current state of the art, no speech recognition can guarantee to bring you to the king. And if such a system were to be part of a self-driving car, well, I don’t even try to imagine.
The true breakthrough
A good speech system should be much more close to Six-Sigma and the reason why is that is should be able to infer what word it missed, make correct guess and ask clarifying questions. For those who are not aware, Six-Sigma is about 3.4 errors in a million.
Don’t misunderstand me. 5% is a great improvement. I remember when 20 years ago I used Microsoft’s experimental speech recognition system and each time I spoke ‘iexplore’ it understood ‘Netscape’. Yes, such was the case. Today it has changed, but 5% is not good enough for me. Not if I want to put the system in a place where people’s life depends on it.
The potential of IoT
While I am still a bit skeptical, there is a huge potential for speech recognition. By embedding Alexa or Siri into a small device like a temperature controller, or a water tap controller, we could interact in a more humane way with our environment. so there is hope. A new hope.
So keep working Microsoft, IBM, Amazon, Google and all other teams. The road is not a pleasant walk, but by the end of it, there such a big reward …