A while ago I made an example of how to use TensorFlow models in Unity using TensorFlow Sharp plugin: TFClassify-Unity. Image classification worked well enough, but object detection had poor performance. Still, I figured it can be a good starting point for someone who needs this kind of functionality in Unity app.

Unfortunately, Unity stopped supporting TensorFlow and moved to their own inference engine code-named Barracuda. You can still use the example above, but the latest plugin was built with TensorFlow 1.7.1 in mind and anything trained with higher versions might not work at all. The good news is that Barracuda should support ONNX models from the box and you can convert your TensorFlow model to the supported format easy enough. Bad news is that Barracuda is still in preview and there are some caveats.

Differences

With TensorFlow Sharp plugin, my the idea was to take TensorFlow example for Android and make a similar one for Unity using the same models, which is  inception_v1 for image classification and ssd_mobilenet_v1 for object detection. I had successfully tried  mobilenet_v1 architecture as well - it's not in the example, but all you need is to replace input/output names and std/mean values.

With Barracuda, things are a bit more complicated. There are 3 ways to try certain architecture in Unity: use ONNX model that you already have, try to convert TensorFlow model using TensorFlow to ONNX converter, or to try to convert it to Barracuda format using TensorFlow to Barracuda script provided by Unity (you'll need to clone the whole repo to use this converter, or install it with pip install mlagents).

None of those things worked for me with inception and ssd-mobilenet models. There were either problems with converting, or Unity wouldn't load the model complaining about not supported tensors, or it would crash or return nonsensical results during inference. Pre-trained inception ONNX seemed like it really wanted to get there, but craping out on the way with either weird errors or weird results (perhaps someone else will have more luck with that one).

But some things did work.

Image classification

MobileNet is a great architecture for mobile inference since, as it goes from its name, it was created exactly for that. It's small, fast and there are different versions that provide a trade-off between size/latency and accuracy. I didn't try latest mobilenet_v3, but v1 and v2 are working great both as ONNX and after tf-barracuda conversion. If you have .onnx model - you're set, but if you got .pb (TensorFlow model in protobuf format), the conversion is easy enough using the tensorflow-to-barracuda converter:

$ python tensorflow_to_barracuda.py ../mobilenet_v2_1.4_224_frozen.pb ../mobilenet_v2.nn

Converting ../mobilenet_v2_1.4_224_frozen.pb to ../mobilenet_v2.nn
Sorting model, may take a while...... Done!
IN: 'input': [-1, -1, -1, -1] => 'MobilenetV2/Conv/BatchNorm/FusedBatchNorm'
OUT: 'MobilenetV2/Predictions/Reshape_1'
DONE: wrote ../mobilenet_v2.nn file.

Converter figures out inputs/outputs itself. Those are good to keep around in case you need to modify them in code later. Here I have input name 'input' and output name 'MobilenetV2/Predictions/Reshape_1'. You can also see those in Unity editor when you choose this model for inspection. One thing to note: with mobilenet_v2, the converter and Unity inspector shows wrong input dimensions - it should be [1, 224, 224, 3] instead, but this doesn't seem to matter in practice.

Then you can load and run this model using Barracuda as described in the documentation:

public NNModel modelFile;
...
var model = ModelLoader.Load(this.modelFile);
var worker = WorkerFactory.CreateWorker(WorkerFactory.Type.ComputePrecompiled, model);
...
var inputs = new Dictionary<string, Tensor>();
inputs.Add(INPUT_NAME, tensor);  
worker.Execute(inputs);
var output = worker.PeekOutput(OUTPUT_NAME);
New MobileNet classification example

The important thing that has to be done with the input image to make inference work is normalization, which means shifting and scaling down pixel values so that they go from [0;255] range to [-1;1]:

public static Tensor TransformInput(Color32[] pic, int width, int height)
{
    float[] floatValues = new float[width * height * 3];

    for (int i = 0; i < pic.Length; ++i)
    {
        var color = pic[i];

        floatValues[i * 3 + 0] = (color.r - IMAGE_MEAN) / IMAGE_STD;
        floatValues[i * 3 + 1] = (color.g - IMAGE_MEAN) / IMAGE_STD;
        floatValues[i * 3 + 2] = (color.b - IMAGE_MEAN) / IMAGE_STD;
    }

    return new Tensor(1, height, width, 3, floatValues);
}

Barracuda actually has a method to create a tensor from Texture2D, but it doesn't accept parameters for scaling and bias. That's weird since it is often a necessary step before running inference on an image. Although be careful trying your own models - some of them might actually have scaling a bias layers as part of the model itself, so be sure to inspect it in Unity before using.

Object detection

There seem to be 2 object detection architectures that are currently used most often: SSD-MobileNet and YOLO. Unfortunately, SSD is not yet supported by Barracuda (as stated in this issue). I had to settle on YOLO v2, but originally YOLO is implemented in DarkNet and to get either Tensorflow or ONNX model you'll need to convert darknet weights to necessary format first.

Fortunately, already converted ONNX models exist, however, full network seemed like way too huge for mobile inference, so I chose Tiny-YOLO v2 model available here (opset version 7 or 8). But if you already have a Tensorflow model, then tensorflow-to-barracuda converter works just as well, in fact I have one in also works folder in my repository that you can try.

Funny enough, ONNX model already has layers for normalizing image pixels, except that they don't appear to actually do anything because this model doesn't require normalization and works with pixels in [0;255] range just fine.

The biggest pain with YOLO is that its output requires much more interpretation than SSD-Mobilenet. Here is the description of Tiny-YOLO output from ONNX repository:

"The output is a (125x13x13) tensor where 13x13 is the number of grid cells that the image gets divided into. Each grid cell corresponds to 125 channels, made up of the 5 bounding boxes predicted by the grid cell and the 25 data elements that describe each bounding  box (5x25=125)."

Yeah. Fortunately, Microsoft has a good tutorial on using ONNX models for object detection with .NET where we can steal o lot of code from, although with some modifications.

YOLO detection.
Tiny-Yolo only has 20 classes, so don't mind that PC monitor is recognized as TV monitor.

Let's not block things

My TensorFlow Sharp example was pretty dumb in terms of parallelism since I simply run inference once a second in the main thread, blocking the camera from playing. There were other examples showing a more reasonable approach like running model in a separate thread (thanks MatthewHallberg).

However, running Barracuda in a separate thread simply didn't work, producing some ugly looking crashes. Judging by documentation Barracuda should be asynchronous by default, scheduling inference on GPU automatically (if available), so you simply call Execute() and then query the result sometime later. In reality, there are still caveats.

So Barracuda worker class has 3 methods that you can run inference with: Execute(), ExecuteAsync() and an extension method ExecuteAndWaitForCompletion(). The last one is obvious: it blocks. Don't use it unless you want your app to freeze during the process. Execute() method works asynchronously, so you should be able to do something like this:

worker.Execute(inputs);
yield return new WaitForSeconds(0.5f);
var output = worker.PeekOutput(OUTPUT_NAME);

...or query output in a different method entirely. However, I've noticed that there is still a slight delay even if you just call Execute() and nothing else, causing camera feed to jitter slightly. This might be less noticeable on newer devices, so try before buy.

ExecuteAsync() seems like a very nice option to run inference asynchronously: it returns an enumerator which you can run with StartCoroutine(worker.ExecuteAsync(inputs)). However, internally this method does yield return null after each layer, which means executing one layer per frame, which, depending on amount of layers in your model and complexity of operations in them, might just be too often and cause the model execute much slower than it can (as I found out is the case with mobilenet model for image classification). YOLO model does seem to work better with ExecuteAsync() than other methods, although amount of devices I can test it on is quite limited.

Playing around with different methods to run a model, I found another possibility: since ExecuteAsync() is an IEnumerator, you can iterate it manually, executing as many layers per frame as you want:

var enumerator = this.worker.ExecuteAsync(inputs);

while (enumerator.MoveNext())
{
    i++;
    if (i >= 20)
    {
        i = 0;
        yield return null;
    }
};

That's a bit hacky and totally not platform-independent - so judge yourself, but I found it actually work better for mobilenet image classification model, causing minimum lag.

Conclusion

Once again, Barracuda is still in early development, a lot of things change often and radically. Like my example being tested with Barracuda 0.4.0-preview version of the plugin, and 0.5.0-preview already breaks it, making object detection produce wrong results (so make sure to install 0.4.0 if you're gonna try the example, and I'll be looking into newer versions later). But I think that an inference engine that works cross-platform without all the hassle with TensorFlow versions, supports ONNX from the box and baked into Unity is a great development.

So does Barracuda example give better performance than TensorFlow Sharp one? It's hard to compare, especially with object detection. Different architectures are used, Tiny-YOLO has a lot fewer labels than SSD-Mobilenet, TFSharp example faster with OpenGLES while Barracuda works better with Vulcan, async strategies are different and it's not clear how to get best async results from Barracuda yet. But comparing image classification with MobileNet v1 on my Galaxy S8, I got following inference times running it synchronously: ~400ms for TFSharp with OpenGLES 2/3 vs ~110ms for Barracuda with Vulcan. But what's more important is that Barracuda will likely be developed, supported and improved for years to come, so I fully expect even bigger performance gains in the future.

I hope this article and example will be useful for you and thanks for reading!

Check out complete code on my github: https://github.com/Syn-McJ/TFClassify-Unity-Barracuda