The Crucial Role of the Audio Front End in AR Glasses
Voice control is a critical feature of augmented reality (AR) glasses, enabling users to interact with the digital world hands-free. However, the capabilities of the audio technology used are critical to their usability and, thus, widespread adoption.
Voice control is a critical feature of AR glasses, enabling users to interact with the digital world hands-free.
Examples of voice-controlled AR glasses include Magic Leap 2, built for emergency response training and as a real-time mixed reality collaboration platform for enterprises, and Vuzix for the medical, manufacturing, and warehousing sectors. While Apple’s AR glasses won’t be available for several years, it has released its mixed reality headset Vision Pro, which uses eyes, hands, and voice for control.
While these companies already have integrated speech recognition into their AR glasses, the capabilities of the audio technology used are critical to their usability and, thus, widespread adoption.
The audio front end in smart AR glasses captures and processes the user's voice. It filters out background noise and transmits the signal to the voice ID or communication modules. With accurate voice control, users can operate the glasses and make phone calls and video recordings, all hands-free.
Beamforming—the Limiting Factor When Using Voice in AR Glasses
Until now, AR glasses and other voice user interfaces utilized beamforming-based technology for reducing environmental noise and isolating the speaker’s voice. A beamformer separates signals based on the direction from which the signals arrive at the microphone array. Beamforming solutions are available from many sources, including Qualcomm, NXP, MediaTek, and DSP.
However, beamforming suffers from a few inherent limitations. First, performance is reduced the closer microphones are placed to each other (the array aperture), which, in AR glasses, is bounded by the frames' width or length. As a rule of thumb, beamforming can provide ~N^2dB of noise reduction for N microphones in the array without adding distortions.
Another limitation of beamforming is its inability to handle echoes effectively or situations where the noise and desired speech come from the same direction. Further, some solutions, such as Fluence by Qualcomm, are limited by the number of microphones they can support; in their case, up to three.
Kardome Spatial Hearing
Software for AR Glasses
With these challenges in mind, Kardome developed a unique spot-forming technology using a 3D neural network-based model that leverages reverberation to separate sound (speech) from different locations.
Kardome’s Spatial Hearing software is a holistic voice stack based on our patented spotformer. It equips AR glasses and other devices with superior noise reduction compared to beamforming-based solutions, provides source separation and audio zooming capabilities, improves voice recognition accuracy, facilitates wake-up word capabilities, and enables highly accurate biometric identification, all done directly on the glasses’ processor, with no connectivity required. These features unlock AR glasses’ potential for enhanced voice-user experience and functionality.
Voice AI— Breaking Away
From Beamforming
Kardome’s AI-based approach improves speech recognition performance in always-changing, noisy, and reverberant environments. Kardome’s Voice AI does this by constantly analyzing and adapting to the acoustic profile of all environmental noise sources, referred to as “ spotforming.”
One can think of spotforming as creating a virtual bubble around the desired sound source or sources. Kardome’s Spatial Hearing software captures sounds from direct and multiple paths, thus being able to audio focus on the preferred source's location in space.
Consequently, the output signal-to-noise ratio (SNR) increases significantly. Kardome dramatically improves performance and attenuates interfering signals by up to ~35 𝑑𝐵 without adding noticeable distortions.
Kardome’s AI-driven, spotforming technology also substantially improves speech recognition performance for SNRs that are less than 10 𝑑𝐵. It is worth mentioning that applying Kardome in a noisy environment can be the difference between a non-functioning ASR and a seamless user experience, even in a challenging scenario of SNR ≅-15 𝑑𝐵.
Three Benefits of Kardome’s Voice AI
for AR Glasses
Voice Communications
AR glasses must simultaneously support several voice use cases: Hands-free telephony for making calls, speaking to a speech recognition engine to interact with the AR glasses' interface and recording videos while excluding extraneous voices and noise.
As a rule of thumb, the human ear prefers better noise reduction even at the price of more significant speech distortion. In contrast, ASRs typically prefer distortionless speech even if some background noise remains.
Optimizing for each requires different system settings in the audio front end and the ability to operate simultaneously, especially if the device is always listening.
Kardome solves the problem of unwanted noise and voices interfering with a device's user interface by mitigating interfering signals up to 35 𝑑𝐵. Kardome's core technology, including speech separation, echo cancellation, and noise reduction, enables distortionless speech recognition in AR glasses in any challenging acoustic environment.
Security
Any device that uses voice technology must prevent unwanted access to its interface. There are two complementary ways to achieve this. The first is attenuating outside speech, so someone not using AR glasses cannot be a valid audio source. The second is using voice biometrics to identify an authorized user accurately.
However, in the first case, reducing outside noise is difficult to do with beamforming as external sounds may come from any direction. In the second case, voice biometrics must accurately identify a speaker and do this within a few seconds.
Kardome’s technology provides highly accurate voice biometric identification of who is speaking, even in noisy environments. A recent study shows that Kardome’s Spatial Voice Biometrics delivers 95% accuracy for utterances as short as 1 second in length in any acoustic environment.
Recording Video
Another use for AR glasses is recording and sharing a video of what the user sees for remote assistance, training, etc. When the user is focused on a specific area, for example, when trying to diagnose a problem, it is helpful to have the audio focus on where the user is looking, whether it be a machine or a person speaking. This capability is referred to as Audio Zooming and requires the audio front end to synchronize between the focal point of the glasses and the noise coming from it.
Audio Zoom works best with a clear sound source, like a single speaker. Multiple people talking can make it challenging to isolate a single voice. This scenario can result in leakage into the speech processing.
Kardome’s Audio Zoom uses patented spatial hearing technology to hone in on the desired speaker’s voice, eliminating background noise and other people talking to provide clear audio to accompany the video recording.
Conclusion
Overall, the audio front end in smart AR glasses plays a critical role in ensuring the voice user has a positive and productive experience. The audio front end can help make smart AR glasses more functional, secure, and user-friendly by processing speech without errors, attenuating unimportant sounds, and focusing on those that are.
Kardome's Spatial Hearing Technology overcomes the technical challenges facing manufacturers to provide a better audio experience in their AR glasses and several benefits for the user interface, recording, security, and voice communications.
The need for a powerful audio front end will increase as AR glasses address more and more use cases. Kardome's Spatial Hearing Technology is well-positioned to meet this need.