What is holding us back from seamlessly communicating with machines using our voice?

The main hurdle on our way towards a seamless voice interaction experience

What is holding us back from seamlessly communicating with machines using our voice?

Ohad Shemen-Ariely
Ohad Shemen-Ariely
VP Business Development
Tech

Table of Contents

A lot has been written about the future of Human-Machine communication. In his book, “The Singularity Is Near”, claims Ray Kurzweil that the next logical step, is to give humans the ability to communicate with machines directly from our minds — to go from hand control, to voice control, to mind control.

But before we reach the desired phase of mind control communication, spooky as it sounds, there are some minor hurdles humans first need to address which relates to the previous phase — voice control.

Like in most technologies, innovation sprout from demand, and the last increases when the adoption of a certain technology is relatively high. Take the automotive industry for example, which is constantly bursting with innovative technologies, would you assume that Tesla was existed today if humans were still using horses as a vehicle to take them from one point to another? Assumingly not, the same applies to voice control which suffers from bad rep which results in relatively low adoption rates.

There’s a “glass ceiling” over voice technologies

So what is it that prevents us from using voice control communication more commonly? Why don’t we use our voice assistants on a regular basis instead of doing so only in limited scenarios and pretty much for the same specific purposes (like asking our infotainment system for the time or to call someone)? What is it that puts a “glass ceiling” above voice control technologies that slows down adoption rates and prevent us from reaching a broader range of communication by voice?

The answer to that question is simple — Trust. Although trust has several significations, I will focus on what I believe is the most important one — They simply don’t hear us well. The overall experience of speaking to a voice user interface is somewhat frustrating when you’re trying to do so in typical acoustical conditions, we haven’t reached the point where we can trust voice user interfaces to “do their job” properly and act as we expect them to.

Here’s a fast question — would you ask Alexa or Siri to call someone while you’re driving your car with the windows open or while the radio is on? Most chances that before you’ll try to do so, you will intuitively work to maintain a quiet environment and only then, address your voice assistant. Now this is “Friction”, and friction affects trust, and trust affects demand.

In order to achieve an extensive adoption rate of voice command solutions, humans need to be able to comfortably communicate with voice user interfaces in all sorts of environments and be confident that the machine will do exactly as it been told. This is no less than a prerequisite.

Can you be more specific?

One of the most initial and important components in the voice recognition process is the “audio frontend”, which is responsible for delivering a high-quality signal to the automatic speech recognition (ASR) engine. As of today, the ASR’s ability to properly convert the speaker’s voice signal into a text is highly influenced by the acoustical conditions in the space in which the voice was captured. The lower the interrupting ambient and the interfering speech signals, the better the conversion outcome.

Most companies today are trying to tackle this hurdle using the same old audio frontend Beamforming infrastructure which falls short of providing a high-quality speech signal to the ASR in typical conditions, where the desired speech is corrupted by an environmental noise and competing speakers.

A Humanlike example…

Humans are able to conduct conversations in noisy coffee shops, thanks to our ability to cluster the complex acoustical scene comprised of background noises and several competing speakers into several simplified ones, each comprised of a single speech. Upon clustering, our brain can focus on one of the scenes and ignore the others. The audio front end technology should perform in a similar manner, it should cluster the acquired mixture of speech signals into individual speech components and provide the ASR with the ability to focus on each of the individual speech components separately.

Such a clustering functionality can be achieved by applying a more sophisticated source separation algorithms comparing to a simplified beamformer which falls short in accurately modeling the acoustic scenarios and thus unable to perform up to the expectations.

Voice recognition process diagram


No wonder that we are lingering in finding a “breakthrough” and leading humanity into a seamless voice command era (C’mon guys, don’t you want to order Pizza using your imagination only…?), in order to win this challenge, a profound change needs to be taken, as we cannot expect to overcome this challenge using a decades old technology.

Don’t get me wrong, it’s not that the other components of the entire voice recognition process like STI, NLP, etc. are not important for a successful and seamless humanlike experience, but rather try to look at it as the most fundamental foundation upon which relies everything else. In simple words, if the foundation is weak, everything else will collapse.

To sum things up

Facilitating a seamless voice control experience that disregards the acoustical conditions surrounding the speaker should be the main focus for voice technology companies today.

Once achieved, humanity will experience an exponential draft in voice control applications and will benefit from the amazing value propositions it has to offer and perhaps lead the way toward its next challenge, if you know what I mean…