Growing up, the free toys on the covers of magazines were made of plastic. They were cheap, and cheerful. Yet the last thirty years has reduced the price of computing to the point where cheap and cheerful plastic toys have been replaced by other things.
Around this time last year Google and Raspberry Pi did something rather intriguing. Together they packaged machine learning — the ability for your Raspberry Pi to think and reason — as a kit, and made it available free on the cover of a magazine.
Perhaps unsurprisingly, the print run of the magazine sold out in hours. To be fair, it wasn’t exactly on the cover. They had to put it, and the magazine, into a box. But I guess it’s thought that counts..?
Based around the a Raspberry Pi HAT the kit enabled you to add voice interaction to your Raspberry Pi. Later in the year, Google made the same kit available through retail channels.
Google called it AIY Projects, that would be AIY for “artificial intelligence it yourself,” and the kit came with almost all the bits and pieces you’d need to build a Google Home style Voice Assistant using a Raspberry Pi. There was even a cardboard case for the project build which, after Google Cardboard, has become almost synonymous with Google’s in-house prototyping efforts.
Then, towards the end of last year they announced the second AIY Projects Kit, this time it was a Vision rather than a Voice Kit. The contents looked familiar to anyone that’s played with the original Voice Kit, but this time the kit was based around a Raspberry Pi pHAT — better known as a Pi Bonnet .
Designed to work with the lower powered Raspberry Pi Zero instead of relying on the horse power of the Raspberry Pi’s 3 faster processor — the new kit moved a lot of the processing power it needs onto the Vision board itself, and the Intel Movidius chip on top hints at the biggest departure from the original Voice Kit.
Unlike the Voice Kit, the Vision Kit is designed to run the all the machine learning locally — on the device — rather than talk to the cloud.
Only a very limited quantity—around 2,000 units—of the Vision Kit made it onto shelves before Christmas, and then things went ominously quiet. That is, until just three weeks ago, when Google launched updated versions of both the Voice and Vision Kits.
This time the kits really did have everything, including a Raspberry Pi Zero W with pre-soldered headers, and an SD card with a pre-burned image. All you needed to add was a USB power supply.
Unlike the original Voice Kit, the latest releases are built around a Raspberry Pi Bonnet, and both the Voice and Vision Kits now use the Raspberry Pi Bonnet form factor.
The new Voice Bonnet has a few less pin outs than the original Voice HAT so if you’re really interested in hacking around with the kits and attaching them to external hardware — rather than just building them — you might want to think about picking up one of the original Voice HAT based kits while they’re still on shelves. You can still find them if you look.
Opening up the Vision Kit we see a lot less changes from the original pre-Christmas version of the hardware than we do with the Voice Kit. However here too the ‘missing parts,’ in this case a Raspberry Pi Zero and the Raspberry Pi Camera Module are now also included.
Again, just add a power supply.
The Vision Bonnet is dominated by the Intel Movidius chip. This is a vision processing unit, and it’s a powerful piece of silicon that probably accounts for most of the cost of the board.
However for those of you who might have worked with the Movidius chip before Google seem to have totally rewritten the Intel SDK and it bears no resemblance to the one shipping with the Movidius Neural Compute Sticks. Although after tinkering around for a while it does look like you can use the Intel SDK with the Google board, but it really doesn’t seem necessary.
So, the Voice Kit designed mostly for the cloud, while the Vision Kit is designed almost exclusively for local operation.
Before we talk about how to use it, lets talk about building it.
Especially with the two newest versions of the kits. It really is just plug-and-go, the Voice Kit requires a small watch makers screwdriver but that’s it. So if you can put Ikea flat-packed furniture together without too much death and bloodshed, you can probably put the kit together, and the latest set of instructions from Google are actually really quite good on how to physically put the kit together.
No soldering required.
The Voice Kit SDK — which is written in Python — gives you programmatic access to Google’s Voice Assistant and also their Cloud Speech API.
It’s also possible to run TensorFlow models locally on the Raspberry Pi, although that’s harder. The Google Voice Assistant is the ‘out of the box’ way to use the kit, and I think how Google expects most people to make use of the hardware.
However for those of us not already deeply involved in Google’s Cloud Platform, it does take some setting up. Go to
Even if you’ve never used Google Cloud Platform before, you can use your normal Google account to sign in, so you don’t have to create a new account.
Clicking on the “Select a Project ▾” menu to the right of the Google Cloud platform logo didn’t give me a drop down menu as expected. Instead I was presented with a popup window, and hitting the + button let me create a new project.
Hitting “Create” returned me to the home screen, where clicking on the “Select a Project ▾” menu item showed the same popup, but this time with my new project.
From there clicking on “AIY Project” took me to my project page which shown the name and associated resources.
At this point, now I have a working project, I needed to enable the “Google Assistant API” which is the service that lies behind the Voice Kit.
Once the Google Assistant API was enabled, I then needed to generate some create credentials so that the Voice Kit could talk to it.
Clicking on the “Credentials” menu item on the left brings me to back to a screen I recognise from the instructions, and from here onwards what I was seeing now looked the same as Google’s documentation.
So I went ahead and selected “OAuth Client ID” as my credential type.
However since this was the first time creating a client ID I needed to configure the application’s Consent Screen.
Hitting “Configure consent screen” brings you to a page that asks for details about your application. The only thing you need to fill in here is the project’s name — the one presented to users of the application in the authorisation step — although you can optionally add other metadata like associated URLs and logos.
Saving the Consent Screen details brings us back to the credential creation screen. Selecting other and adding a reasonably memorable name, you should hit “Create”
A popup window will then appear with your credentials, don’t panic when this disappears as this isn’t your only chance to grab them. Dismissing the pop up by clicking “OK” leaves you in a credentials list with your newly generated credentials.
Click on the down arrow with the line underneath to download the credentials as a JSON file. Find the JSON file you just downloaded, it’ll be named
client_secrets_XXXX.json, and rename it to
assistant.json. Then move it to
You’ll now need to go to your Google account’s Activity Controls panel. This is where you can configure the information that Google stores about you, and you need to enable “Web & App Activity,” “Location History,” “Device Information,” and “Voice & Audio Activity.”
Note that under “Web & App Activity” you must also tick the additional box to “Include Chrome browsing history and activity from websites and apps that use Google services.”
If you’re doing this from a different browser make sure you’re logged into the same Google account as when you were configuring the application.
Logging back into the Raspberry Pi and running the default voice recogniser code for the first time you you should see the Authentication Popup.
Hit “Allow,” and things should be ready to go.
Let’s take a brief look at the code behind that.
This is the Python code behind the Google Assistant in a cardboard box. It very lightly wraps the Google Voice Kit SDK, and starts the Assistant running.
It hands mostly all of the events to a single
_process_event() method where the real meat of things happen. Here we wait for the Assistant to be initially ready, and then go ahead and set up our button callback, and handle the various Button LED states depending on what the kit is doing at the time by calling the
status_ui() method in the SDK.
However what the Voice Assistant can offer is very much fenced in by what Google wants to do. If you need a bit more flexibility around how your voice controlled project responds, you need to look at the Cloud Speech API.
Unfortunately for those of us based in the European Union, the Cloud Speech API is not available unless you’re signed up to the Google Cloud Platform as a business. This isn’t a technical issue, it’s a legal one, so it’s not one you can work around at this point.
Assuming you can make use of it, you can use the API Library to find the Cloud Speech API and enable it for your project.
However unlike the Google Assistant API which is free to use, at least until you’ve reached the (fairly generous) daily quota, the Google Cloud Speech API is not free.
You’ll need to enable billing to support it in your project.
However adding billing to your Google Cloud Platform developer account is actually pretty easy, and signing up to billing for the first time Google will give you $300 in credit spend over the next 12 months. Which will at least let you sit down and test your mirror before deciding whether you want to use it.
You’ll need to create a payment profile and provide some credit card details.
Once you’ve created a billing account you can go ahead and enable Google Cloud Speech API for your project.
Once it’s enabled we need to go ahead and create a Service Account Key for the Cloud Speech API. Click on the ‘Credentials’ tab on the right hand side of your screen.
Then in the ‘Create credentials’ drop down select ‘Service account key.’
This takes you to the next page where you can can create the Service Account.
Fill in the project details and click ‘Create.’ A popup window will then appear with your credentials, don’t panic when this disappears as this isn’t your only chance to grab them. Dismissing the pop up by clicking “OK” leaves you in a credentials list with your newly generated credentials which you can then do ahead and download to your device.
The example code for using the Cloud Speech API is even simpler than the Voice Assistant, it’s half the length.
You just give the recognizer a list of phases to expect, and then all you need is some sort of trigger for it to know when to start listening. Here we’re using a button push.
Here we’re using a hotword. It’ll sit and listen — continuously — for the hotword and then start listening for its expected phrases.
Which makes it possible to build something like a magic mirror that knows it’s a mirror, rather than a voice assistant.
However there are two huge elephants in the room using the Cloud Speech API the way we just used it. Firstly, money. The Cloud Speech API is not free. Running that mirror costs around $25 a day.
But for most people the second elephant is a lot larger and a lot more scary, and that’s privacy. Ignoring the cost, using the Cloud Speech API to listen for a hotword means that you’re streaming everything in the room to the cloud all the time and searching through it for the hotword.
That’s probably not going to be acceptable to most people in most situations. There’s a reason why the Voice Assistant uses a locally recognized, “Ok, Google” hotword.
Which brings us to running TensorFlow models locally.
It’s never been particularly easy to get TensorFlow installed onto the Raspberry Pi, but a couple of months back Pete Warden finally managed to successfully get it to cross-compile on an x86 Linux box.
From there he set up nightly builds running as part of Google’s TensorFlow Jenkins project. This simplified installation, a lot.
But to run TensorFlow locally, we need a model.
The real secret behind the recent successes of machine learning isn’t the algorithms, my first job back in the very early 90’s was writing neural network for data compression and image analysis, this stuff has been lurking in the background for decades waiting for computing to catch up.
Instead, the success of machine learning has relied heavily on the corpus of training data that companies — like Google — have managed to build up. Data that is, in many cases, based on crowdsourced data uploaded by us to the Internet.
For the most part these training datasets are the secret sauce, and closely held by the companies, and people, that have them.
While there are a number of open sourced collections of visual data to train object recognition algorithms, there are far fewer available speech data.
Fortunately the Open Speech Recording project from Google provides one
The goal is to capture crowd-sourced speech clips — open source examples of detecting spoken keywords such as ‘Yes’, ‘No’, ‘On’, or ‘Off’ — from several hundred people and then release the resulting data set under an open license, together with an example of how to use it to create simple speech command classifiers.
There are already some pre-trained models as part of the initial release of their Speech Commands Dataset. The words covered by the initial release model data are: yes, no, up, down, left, right, on, off, stop, and go.
Unfortunately while running TensorFlow locally is possible. It really pushes the Pi to its limits, and right now it’s not well integrated into Kit’s SDK
So unless you have a really good use case, it’s unlikely that this is how you want to use the kit. The code is also a lot more involved than our previous examples.
Both the Voice and Vision Bonnets offer extra Pins outs that let you integrate external hardware with the kits. The Voice HAT from the original kit had a lot of external break out pins.
The made it really easy to do serial, SPI, and add a number of servos, unfortunately the move to the Bonnet form factor has reduced the number of exposed pin quite a bit when compared to the original HAT.
However there are still 4 PWM capable pins, as well as the button connector which would let you repurpose the GPIO pins that are connected to the button and kit LEDs for other purposes.
Again let’s start by looking at how to put it together. This one really is just plugging things together, you don’t even need a screw driver this time around.
Now interestingly, because all the model inferences is done locally, unlike the Voice Kit once it it’s built, we’re done. The kit “just works” with the SD Card included with the kit even automatically starts what Google are calling the “Joy Detector Service”
Which changes the colour of the top LED whether you’re happy or sad, and sounds a buzzer if you’re more than 75% of either.
However it’s fairly simple to shut off and write your own code.
This is a really simple face detector example to draw bounding boxes around all detected faces in the current camera image.
This is using the same model as the default Joy Detector Demo, which returns a bouncing box and a happyness score for every face detected in the current camera frame. It won’t quite keep with a real time. But it comes pretty close.
With the recent update to the two kits Google has made model availability much more explicit. Putting their own models online, and also putting out a call for contributions.
Right now all the models on their site belong to Google. But you can see that changing, possibly even in the future it might evolve into a real market place?
Right now there are six models available, and if you want to learn more about the details of the models Google has provided with the kit;
- The “Face Detector” model is the default model used by the Joy Detector Demo that runs when you boot the kit for the first time.
- The Dog / Cat / Human model can identify whether there’s a dog, cat, or person in an image and draw a box around the identified objects. It’s based on the MobileNet model.
- The Dish Classifier model is designed to identify food in an image. Again it’s based on the MobileNet model.
- The Google Image Classifier is a general-purpose model designed to recognize and identify a number of common objects. It’s based on the MobileNet ImageNet classifier model.
- The Image Classifier model is also designed to identify objects in an image. However this one is based on the SqueezeNet model.
- The Nature Explorer. It has 3 machine learning models based on MobileNet, trained on photos contributed by the iNaturalist community. These models are built to recognize 4,080 different species (~960 birds, ~1020 insects, ~2100 plants). It’s only included in the most recent SD card image. If you are using a version of the card image older than March, you will need to update it.
Like the Voice Kit the Vision Kit exposes some GPIO to play with, and again like the new Voice Bonnet we have 4 PWM capable GPIO pins, as well as the button connector.
Hacking around with the button connector however is somewhat constrained by the fact that it’s a custom, and rather small and fiddly, connector.
While it’s reasonably well documented, it’s still a pain.
Anyway, I’ve been playing with the Vision Kit a lot over the last few weeks.
Here I’ve wired up a small micro-servo to one of the pins on the Vision Bonnet, and then mounted the Camera Module on the servo.
What I’m then doing is looking at the position of the face detected in the image, seeing whether it is to the right of left of centre.
Then turning the servo to centre up the face in the image. Once the servo reaches maxium deflection, or the face moves out of the field of view, there’s a pause and the head is tracked back to the center position.
Here’s everything stuffed inside the plushie dinosaur.
Now obviously, all those electronics stuffed inside a very small dinosaur are going to generate a lot of heat. As it stands, the build is a bit of a fire hazard, and you should definitely not even think about charging the battery in-situ. That’s almost certainly going to be a step too far. But, I wanted something small, portable, and cute that I could use for five minutes at a time during conference talks. So for me, this was perfect. But, just this time, you should not follow along. If you do, you’re probably going to burn your house down.
If you do want to cyborg a plushie, find a larger one, put the electronics inside a proper enclosure inside the toy, and provide some decent ventilation.
You have been warned. Do not replicate this build, instead make a better one.
Integrating the two kits is also pretty easy. You don’t even have to run a wire, or a cable, because they’re both share (or at least can share) the same WiFi network.
Here I’ve used a simple webserver running on the Vision Kit which serves the current ‘view’ of the world, and whether it can see any faces in the view.
The voice kit polls the server, and waits for someone to walk in front of the mirror, before listening for commands.
The Bigger Picture
I’m actually quite intrigued in how these kits fit into the bigger picture.
Watching people interact with it was intriguing.
Adding simple sounds to imitate a ‘real’ phone — like a dial tone, a hang up noise, and a simple greeting by an fake operator — left enough room that people were… no longer quite sure whether they were talking to a machine, or human.
It lent a curious hesitancy to the interactions that I haven’t seen with other voice controlled objects. Like the retro-phone build I also deliberately made use of sound to attempt to make the mirror more magical.
Technology is never really mature until it is invisible, and while voice interfaces are a step towards that I still feel that these interfaces need to be further embedded, hidden, into the environment.
The ability to run these trained networks “at the edge” nearer the data — without the cloud support that seems necessary to almost every task these days, or even in some cases without even a network connection — could help reduce barriers to developing, tuning, and deploying machine learning applications.
It could potentially help make “smart objects” actually smart, rather than just network connected clients for machine learning algorithms running in remote data centres.
It could in fact, be the start of a sea change about how we think about machine learning and how the Internet of Things might be built.
Because now there is — at least the potential — to allow us to put the smarts on the smart device, rather than in the cloud.
The recent scandals and hearings around the misuse of data harvested from social networks has surfaced long standing problems around data privacy and misuse, while the GDPR in Europe has tightened restrictions around data sharing.
However the new generation of embedded devices, and the arrival of the Internet of Things, may cause the demise of large scale data harvesting entirely.
In its place smart devices will allow us process data at the edge, making use of machine learning to interpret the most flexible sensor we have, the camera.
Interpreting camera data in real-time, and abstracting it to signal rather than imagery, will allow us to extract insights from the data without storing potentially privacy and GDPR infringing data.
Processing imagery using machine learning models at the edge, on potentially non-networked enabled embedded devices, will allow us to feedback into the environment in real time closing the loop without the large scale data harvesting that has become so prevalent.
In the end we never wanted the data anyway, we wanted the actions that the data could generate. Insights into our environment are more useful than write-only data collected and stored for a rainy day.
Voice Kit Links
Links to the work I’ve done with the Voice Kit.
The Google AIY Projects Voice Kit Is Now Available for Pre-Order from Micro Center
In May, issue 57 of the Mag Pi came bundled with a Google project kit that enabled you to add voice interaction to your…
Hands on with the AIY Projects Voice Kit
Machine Learning on a Raspberry Pi is now available for pre-order!
A Retro Rotary Phone Powered by AIY Projects and the Raspberry Pi
Machine Learning inside a classic 1970’s GPO telephone
A Magic Mirror Powered by AIY Projects and the Raspberry Pi
Machine Learning imprisoned behind a sheet of glass
Vision Kit Links
Links to the work I’ve done with the Vision Kit.
Announcing the AIY Projects Vision Kit
The second Google AIY Projects kit has just been announced.
Teething Troubles for the New AIY Projects Vision Kit?
I’ve now gotten “hands on” with the retail version of the kit, and have written up a full walkthrough from unboxing…
Hands on with the AIY Projects Vision Kit
A Do-It-Yourself Intelligent Camera for the Raspberry Pi