A Magic Mirror with Added TensorFlow
This the second post in a series of three posts building a simple voice controlled Magic Mirror. The first post in the series showed how to put the mirror together, while this post looks at how to use Machine Learning locally on the device to do custom hotword recognition without using the cloud. The final post in the series looks at integrating the Voice and Vision Kits together to build a mirror that recognises when you’re looking at it.
At the start of the month the new Google AIY Projects Voice Kits finally hit the shelves at Micro Center. However I’d been lucky enough to have my hands on the hardware from just ahead of the pre-order availability back in August and, along with a couple of other projects, I built a voice controlled magic mirror using the Voice Kit and a Raspberry Pi.
Towards the end of the build we ran into some problems using the Cloud Speech API. The Cloud Speech API is not free, and since we were using it detect our custom hotword we were running it all the time. That meant that our mirror was costing around $25 a day to run. However ignoring the cost, streaming everything in the room to the cloud all the time and searching through it for a hotword or phrase is probably going to be unacceptable to most people.
So we need a less expensive, more reasonable, way to add a custom hotword to our mirror. It turns out we can actually use TensorFlow on our device — on the Raspberry Pi — to look for and recognise a custom hotword entirely locally without talking to the cloud at all.
What is TensorFlow?
Google’s TensorFlow is open source software library for numerical computation using data-flow graphs. If you’re not familiar with the platform Amy Unruh has written a really good introductory article to the popular machine learning platform. So go read that, don’t worry, I’ll be here waiting when you get back.
Installing TensorFlow on the Raspberry Pi
It’s never been particularly easy to get TensorFlow installed onto the Raspberry Pi, but a couple of months back Pete Warden finally managed to successfully get it to cross-compile on an x86 Linux box. From there he set up nightly builds running as part of Google’s TensorFlow Jenkins project. This simplified installation, a lot.
Unfortunately while Pete’s original build instructions, and nightly builds, were compiled for Python 2, the AIY Projects Voice Kit uses Python 3.4. Fortunately while he’s now got a Python 3.4 nightly build going, he’s provided an alternative wheel to allow us to install it alongside the Voice Kit SDK.
Go ahead and download
tensorflow-1.4.0-cp34-none-any.whl wheel file and drop it onto your Raspberry Pi in the home directory. Then install TensorFlow as follows,
$ cd ~/AIY-voice-kit-python
$ source env/bin/activate
$ sudo apt-get install libblas-dev liblapack-dev python3-dev libatlas-base-dev gfortran python3-setuptools
$ sudo pip install ~/tensorflow-1.4.0-cp34-none-any.whl
This can take quite a while to complete. However once it’s done, TensorFlow should be installed and accessible using the Python 3 installation used by the AIY Projects environment.
Testing our Installation
To test out our installation we need some TensorFlow voice recognition models. Fortunately Google provides some pre-trained models as part of the initial release of their Speech Commands Dataset. The words covered by the model data are: yes, no, up, down, left, right, on, off, stop, and go.
Go ahead and download the models and unzip them into the home directory on your Pi, inside the zip file you should find both a graph and a labels file named
Then copy the example script into
and run the script passing the graph and label files on the command line.
$ ./listen.py --graph ~/conv_actions_frozen.pb --labels ~/conv_actions_labels.txt
After the script starts, go ahead and say a few of the words supported by the model; yes, no, up, down, left, right, on, off, stop, and go.
If all goes well you should see the words being recognised by the TensorFlow models. Just to prove to yourself that everything is being done locally, you can unplug the Raspberry Pi from the Ethernet—if you have it plugged into a wired network—and shut down the wireless networking, and run the script again. It’ll continue to work.
Integrating TensorFlow into AIY Projects
Now we have local voice recognition working on our Raspberry Pi, abet with a very limited vocabulary, we need to integrate our into our existing AIY Projects code by picking one of the ten supported words in our models, let’s chose “Go” for now, as our custom hotword.
We can reuse a good deal of the existing code base, adding the TensorFlow code as an additional processor to the Voice HAT recorder code.
A Working Mirror
Go ahead and start the Mirror Software from the desktop and then SSH into your mirror from your laptop and then type the following,
$ cd ~/AIY-voice-kit-python
$ source env/bin/activate
this will configure our SSH session in the same fashion as the dev terminal that we normally open by clicking on the desktop icon. Then go ahead and copy the new script into
~/AIY-voice-kit-python/src/ and run it.
If all goes well saying “Go… weather” should trigger the mirror to display the weather in much the same fashion as before.
Unfortunately we’re really operating at the top of the performance envelope of the Raspberry Pi. Running the mirror software, a TensorFlow network, and connecting to the cloud with Google’s Cloud Speech API will flat line the Raspberry Pi. Things are going to be somewhat sluggish, and because we’re using TensorFlow for local voice recognition, using a much more limited training dataset compared to the models running in the cloud, somewhat hit or miss for recognition of the initial hotword.
Training Our Own Models
It would be nice to use local speech recognition with TensorFlow to recognise a more appropriate custom hotword, like “mirror” for instance. Realistically however we won’t be able to gather enough audio samples to train our network for our own custom hotword. While in the future it might be possible to use transfer learning to achieve this, right now that isn’t really feasible.
However if you’re interested in how you could go about training the network there’s an excellent walkthrough of how to do this using the Speech Commands dataset, which consists of 65,000 WAV audio files of people saying thirty different words, which unfortunately illustrates the size of the training data set we’d need to accumulate to train the network for our own hotword.
Open Speech Recording
The real secret behind the recent successes of machine learning isn’t the algorithms, my first job back in the very early 90’s was writing neural network for data compression and image analysis, this stuff has been lurking in the background for decades waiting for computing to catch up. Instead, the success of machine learning has relied heavily on the corpus of training data that companies — like Google — have managed to build up. Data that is, in many cases, based on crowdsourced data uploaded by us to the Internet.
For the most part these training datasets are the secret sauce, and closely held by the companies, and people, that have them. While there are a number of open sourced collections of visual data to train object recognition algorithms, there are far fewer available speech data.
The Open Speech Recording project from Google is trying to change that.
“There are very few open source collections of speech data where individual words are spoken by a large number of different people. This makes it tough to build good open source examples of detecting spoken keywords such as ‘Yes’, ‘No’, ‘On’, or ‘Off’. To fill this gap, the AIY team is hoping to gather single spoken words from several hundred people and then release the resulting data set under an open license, together with an example of how to use it to create simple speech command classifiers.
The goal is to capture crowd-sourced speech clips to be used as training data for TensorFlow voice recognition models. Google intends to capture the speech then train the models, open sourcing both the data and the eventual models. We made use of the initial release of the Speech Commands Dataset to drive our mirror, but the Open Speech Recording project is aiming to push beyond the ten words in the initial release. If you have 15 minutes you should go and contribute your voice to the project.
Where now with our mirror?
Probably our most successful mirror implementation was our previous version using just the Cloud Speech API. Our TensorFlow driven mirror is technically interesting, but the lack of a real custom hotword makes it a bit impractical to deploy into the wild.
What we really need is something hands free, other than a hotword, to replace the button that would normally trigger our voice recognition code, and I think looking into computer vision to do facial detection is probably the way forward. Having the mirror ‘wake up’ and start streaming audio to the cloud when someone faces it will probably actually be a far more magical than using a hotword, and combining voice and vision machine learning could be a fun project.
Now on Shelves
The new Voice Kits are being produced by Google, and arrived onto shelves at Micro Center at the start of the month. The AIY Voice Kit is priced at $25 on its own, but you can pick one up for free if you order a Raspberry Pi 3 at $35 for in-store pickup from Micro Center.
If you like this build I’ve also written other posts on building a retro-rotary phone Voice Assistant with the Raspberry Pi and the AIY Projects Voice Kit, and a face-tracking cyborg dinosaur called “Do-you-think-he-saurs” with the Raspberry Pi and the AIY Projects Vision Kit.
This post was sponsored by Google.