I’ve spent a lot of time over the last year or so with Google’s AIY Projects Voice Kit, including some time investigating how well TensorFlow ran locally on the Raspberry Pi attempting to use models based around the initial data release of Google’s Open Speech Recording to customise the offline “wake word” for my voice-controlled Magic Mirror.
Back at the start of last year this was a hard thing to do, it was really pushing the Raspberry Pi to its limits. However as machine learning software, such as TensorFlow Lite and other tools, have matured we’ve seen models being run successfully on much more minimal hardware.
With the privacy concerns raised by cloud connected voice devices, as well as the sometime inconvenient need for a network connection, it’s inevitable that we’ll start to see more offline devices.
While we’ve seen a number of “wake word” engines—a piece of code and a trained network that monitors for the special word like “Alexa” or “OK Google” that activates your voice assistant —these, like pretty much all modern voice recognition engines, need training data and the availability of that sort of data has really held smaller players.
Realistically most people won’t be able to gather enough audio samples to train a network for a custom wake word. The success of machine learning has relied heavily on the corpus of training data that companies — like Google — have managed to build up. For the most part these training datasets are the secret sauce, and closely held by the companies, and people, that have them. Although there are a number of open sourced collections of visual data to train object recognition algorithms, there are far fewer available speech data. Amongst one of the few available is the Open Speech Recording project from Google, and while they’ve made an initial dataset release, it’s still fairly limited.
In practice it’s never going to be feasible for most people to build the required large datasets, and while people are investigating transfer learning it’s generally regarded as not being quite ready.
However the leading indicators for this sort of rollout are going to be the appearance of toolkits like Picovoice, which sits on top of a new wake word engine called Porcupine, and interesting also a “speech-to-intent” engine called Rhino.
“A significant number of use-cases when building voice-enabled products revolves around understanding spoken commands within a specific domain. Smart home, appliances, infotainment systems, command and control for mobile applications, etc are a few examples. The current solutions use a domain-specific natural language understanding (NLU) engine on top of a generic speech recognition system. This approach is computationally expensive and if not delegated to cloud services requires significant CPU and memory for an on-device implementation.
Rhino solves this problem by providing a tightly-coupled speech recognition and NLU engine that are jointly optimised for a specific domain (use case). Rhino is quite lean and can even run on small embedded processors (think Arm Cortex-M or fixed-point DSPs) with very limited RAM (as low as 100 KB) making it ideal for resource-constrained IoT applications.”
Picovoice claims to wrap both of these wake word and speech-to-intent engines, along with a speech-to-text service, and has some interesting results for wake word performance compared to Snowboy and Pocketsphinx. Running on the NXP i.MX RT1050, an Arm Cortex-M7 with 512 KB of RAM, the toolkit looks rather intriguing.
While there doesn’t seem to be any pricing, the two engines are only “partially” open sourced, with the bits that are available being published under the Apache 2.0 license. Source, demo applications, and other information are available on the Picovoice site, but I’d expect it to be joined by other frameworks real soon. We’re sort of at that point in the cycle.