Benchmarking TensorFlow and TensorFlow Lite on the Raspberry Pi
Update (24 Jun): Some benchmarks for the new Raspberry Pi 4.
I recently sat down to benchmark the new accelerator hardware that is now appearing on the market intended to speed up machine learning inferencing on the edge. But, so I’d have a rough yardstick for comparison, I also ran the same benchmarks on the Raspberry Pi.
However a lot of people complained that I should have used TensorFlow Lite for those benchmarks rather than TensorFlow. Enough people said it in fact, that I felt I really should see how much faster TensorFlow Lite was on the Raspberry Pi than ‘vanilla’ TensorFlow.
So, here goes…
Headline results from benchmarking
Using TensorFlow Lite we see a considerable speed increase when compared with the original results from our previous benchmarks using full TensorFlow.
We see an approximately ×2 increase in inferencing speed between the original TensorFlow figures and the new results using TensorFlow Lite.
Part I—Benchmarking
A more detailed analysis of the results
Benchmarking was done using both TensorFlow and TensorFlow Lite on a Raspberry Pi 3, Model B+ without any accelerator hardware. Inferencing was carried out with the MobileNet v2 SSD and MobileNet v1 0.75 depth SSD models, both models trained on the Common Objects in Context (COCO) dataset, converted to TensorFlow Lite.
A single 3888×2916 pixel test image was used containing two recognisable objects in the frame, a banana🍌 and an apple🍎. The image was resized down to 300×300 pixels before presenting it to the model, and each model was run 10,000 times before an average inferencing time was taken. The first inferencing run, which takes longer due to loading overheads, was discarded.
Comparing our new result with our previously obtained benchmark figures we see that using TensorFlow Lite for inferencing on an unaccelerated Raspberry Pi brings inferencing times very roughly into line with those seen from the NVIDIA Jetson Nano when using normal TensorFlow models before optimisation using NVIDIA’s TensorFlow with TensorRT library.
This is really rather suggestive that unoptimised ‘vanilla’ TensorFlow models are mostly running on the NVIDIA Jetson Nano’s processor, a 64-bit Quad-core ARM A57, rather than being offloaded to the GPU as you’d expect.
While it’s still extremely early days, TensorFlow Lite has recently introduced support for GPU acceleration for inferencing, and running models using TensorFlow Lite with GPU support should reduce the time needed for inferencing on the Jetson Nano. Taking our new results here on the Raspberry Pi as a yard stick we should expect the gap between the Jetson Nano and Google’s Coral hardware to close significantly at that point.
Heating and Cooling
As we observed last time the Raspberry Pi reached a high enough temperature during benchmarking that it suffered from thermal throttling of the CPU. This time we observed external temperatures in excess of those previously seen.
External temperatures were measured using a laser infrared thermometer which has an accuracy of ±2°C for temperatures ≤100°C after a extended test run of 50,000 inferences was completed.
The CPU temperatures were as reported by the operating system using the following command line invocation.
$ paste <(cat /sys/class/thermal/thermal_zone*/type) <(cat /sys/class/thermal/thermal_zone*/temp) | column -s $'\t' -t | sed 's/\(.\)..$/.\1°C/'
cpu-thermal 77.3°C
Last time the Raspberry Pi reached a temperature of 74°C during extended testing which meant that it suffered from thermal throttling of the CPU, it came close to the 80°C point where additional incremental throttling would occur. This time we saw increased temperatures, peaking around 78°C.
As before I’d recommend that, if you intended to run inferencing for extended periods using the Raspberry Pi, you should add at least a passive heatsink to avoid throttling the CPU. It’s even possible that a small fan might also be a good idea. Because let’s face it, CPU throttling can spoil your day.
Summary
While adding TensorFlow Lite on the Raspberry Pi to our benchmarks hasn’t changed the overall result, with the Coral Dev Board and USB Accelerator have a clear lead, with MobileNet models running between ×3 to ×4 times faster than the direct competitors. It’s really interesting to see that using TensorFlow Lite, and accepting the restrictions that the lightweight framework is going to place on you, increases performance this much.
While I was expecting things to run faster, a factor of ×2 is pretty impressive.
Both here and with our previous benchmarks I felt that approaching things in a relatively direct way, and trying as much as possible to keep the playing field level between platforms level, was the best approach to get a base line for how they all performed with respect to each other.
However there is obviously a great deal you can do with optimisation, both of the model you’re running and how you go about running it, to improve the inferencing speeds I talked about here and in my original benchmarking piece. I’m not unaware of that, and I’ll be interested to see how others can improve on the work I’ve done here.
Yes, you can get these models to run faster. Now show us how to do that.
Part II—Methodology
Installing TensorFlow Lite on the Raspberry Pi
Installing TensorFlow on the Raspberry Pi used to be a difficult process, however towards the middle of last year everything became a lot easier. Fortunately, thanks to the community, installing TensorFlow Lite isn’t that much harder. We aren’t going to have to resort to building it from source.
Go ahead and download the latest release of Raspbian Lite and set up your Raspberry Pi. Unless you’re using wired networking, or have a display and keyboard attached to the Raspberry Pi, at a minimum you’ll need to put the Raspberry Pi on to your wireless network, and enable SSH.
Once you’ve set up your Raspberry Pi go ahead and power it on, and then open up a Terminal window on your laptop and SSH into the Raspberry Pi.
% ssh pi@raspberrypi.local
Fortunately while the official TensorFlow binary distribution does not include a build of TensorFlow Lite, there is an unofficial distribution which does, and that means we don’t have to resort to building and install from source.
Once you’re logged into your Raspberry Pi go ahead and update and install our build tools,
$ sudo apt-get update
$ sudo apt-get install build-essential
$ sudo apt-get install git
then go ahead and install TensorFlow Lite.
$ sudo apt-get install libatlas-base-dev
$ sudo apt-get install python3-pip
$ git clone https://github.com/PINTO0309/Tensorflow-bin.git
$ cd Tensorflow-bin
$ pip3 install tensorflow-1.13.1-cp35-cp35m-linux_armv7l.whl
It’ll take some time to install. So you might want to take a break and get some coffee. Once it has finished installing you can test the installation as follows.
$ python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
Now TensorFlow has been successfully installed we’ll also need to go ahead and install OpenCV along with all its many dependences,
$ sudo apt-get install libwebp6 libwebp-dev
$ sudo apt-get install libtiff5 libtiff5-dev
$ sudo apt-get install libjasper1 libjasper-dev
$ sudo apt-get install libilmbase12 libilmbase-dev
$ sudo apt-get install libopenexr22 libopenexr-dev
$ sudo apt-get install libgstreamer0.10-0 libgstreamer0.10-dev
$ sudo apt-get install libgstreamer1.0-0 libgstreamer1.0-dev
$ sudo apt-get install libavcodec-dev
$ sudo apt-get install libavformat57 libavformat-dev
$ sudo apt-get install libswscale4 libswscale-dev
$ sudo apt-get install libqtgui4
$ sudo apt-get install libqt4-test
$ pip3 install opencv-python
as we’ll need OpenCV for our benchmarking script later. For the same reason we need to install the Pillow fork of the Python Imaging Library (PIL) and the NumPy library.
$ pip3 install Pillow
$ pip3 install numpy
We should now be ready to run our benchmarking scripts.
The benchmarking code
Our TensorFlow Lite benchmark script is slightly different than the version we used when running full TensorFlow on the Raspberry Pi during our previous benchmark inferencing runs.
The script is written to take pre-converted .tflite
files.
Converting models to TensorFlow Lite format
As before the benchmark run was with the MobileNet v2 SSD and MobileNet v1 SSD models, both models were trained on the Common Objects in Context (COCO) dataset. However before we can use these models both of them need to be converted to TensorFlow Lite format.
Let’s start out by grabbing the quantised version of our MobileNet SSD V1 model from the Coral Model Zoo along with the associated labels file.
If you don’t already have TensorFlow installed on your laptop you should go do that now, then download the model and uncompress.
$ cd ~
$ wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18.tar.gz
$ tar -zxvf ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18.tar.gz
To convert the model from TensorFlow to TensorFlow Lite you’ll need to know what the input and output nodes of the model are called. The easiest way to figure this out is to use the use the summarize_graph
tool to inspect the model and provide guesses about likely input and output nodes. Unfortunately if you’ve previously installed TensorFlow using pip
then this tool isn’t going to be available, you’ll have to go back and install from it source to have access to the C++ tools.
⚠️Warning If you have LittleSnitch running you may have to temporarily turn the network monitor off if you get ‘Host is down’ errors during installation.
Then, from the TensorFlow source directory, you can go ahead and build the summarize_graph
tool using bazel
,
$ bazel build tensorflow/tools/graph_transforms:summarize_graph
and run it on the quantised version of our MobileNet v1 SSD model.
$ bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.pb
After running the summarize_graph
tool you should see something like this,
Found 1 possible inputs: (name=normalized_input_image_tensor, type=float(1), shape=[1,300,300,3])No variables spotted.Found 1 possible outputs: (name=TFLite_Detection_PostProcess, op=TFLite_Detection_PostProcess)Found 4137705 (4.14M) const parameters, 0 (0) variable parameters, and 0 control_edgesOp types used: 451 Const, 389 Identity, 105 Mul, 94 FakeQuantWithMinMaxVars, 70 Add, 35 Sub, 35 Relu6, 35 Rsqrt, 34 Conv2D, 25 Reshape, 13 DepthwiseConv2dNative, 12 BiasAdd, 2 ConcatV2, 1 RealDiv, 1 Sigmoid, 1 Squeeze, 1 Placeholder, 1 TFLite_Detection_PostProcess
From here we can use the TensorFlow Lite Optimizing Converter (TOCO) to convert the quantised frozen graph to the TensorFlow Lite flat buffer format.
$ bazel run tensorflow/lite/toco:toco -- --input_file=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.pb --output_file=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.tflite --input_shapes=1,300,300,3 --input_arrays=normalized_input_image_tensor --output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' --inference_type=QUANTIZED_UINT8 --mean_values=128 --std_values=128 --change_concat_input_ranges=false --allow_custom_ops
This command takes the input tensor normalized_input_image_tensor
after resizing each camera image frame to 300×300 pixels. The outputs of the quantised model represent four arrays: detection_boxes
, detection_classes
, detection_scores
, and num_detections
.
You can follow a similar process for the quantised version of the MobileNet SSD V2 model from the Coral Model Zoo, an invoke the same toco
command line to convert it to a TensorFlow Lite model.
In closing
As I really tried to make clear in my previous article putting these platforms on an even footing and directly comparing them is actually not a trivial task. Hopefully this goes some way to proving that.
Links to getting started guides
If you’re interested in getting started with any of the accelerator hardware I used during my first benchmark I’ve put together getting started guides for the Google, Intel, and NVIDIA hardware I used there.