HoloLens Object Detection

We will explore running object detection on-device with HoloLens, using Unity game engine as our development platform.

AR Academy is a great introduction to all aspects of HoloLens development. It has 300-level tutorials that demonstrate how to connect the device to Azure Cognitive Services to perform machine learning tasks. There are no samples of performing those tasks completely on-device. The application described here fills this gap.

While this is not a step-by-step tutorial, the source code and general knowledge of Unity development together with some basic understanding of Mixed Reality Toolkit (AR Academy is a great place to get started) should be enough to follow along.

The Application

We walk around wearing Hololens and capture of the surrounding world. These images go through an object detection neural net, turn into holograms, which are placed back into our world in as close proximity as possible to the real objects captured in them, with lasers pointing from the camera (our eyes) towards the center of the detected objects.

The objects are detected using Tiny YOLOv2: the only object detection model available from ONNX model zoo at the time of developing this app.

Sources and Setup

This is a Unity 2017.4.16 .NET 4.6 app, using Mixed Reality Toolkit 2017.4.3.0 Here are the source code and setup instructions for the completed application.

The application workflow can be separated into three sections:

  1. Capturing scenes for detection
  2. Interfacing with model
  3. Interacting with holograms

Capturing Video

We are using a slightly modified version of the Vulcan Software video capture plugin. The plugin not only wraps up nicely all the code you may find for capturing video with HoloLens, but also has a few helper methods for converting coordinates between camera projection and the world, which we need for placing the holograms correctly as well as shoot lasers into them.

We make a slight modification to the plugin: in file VideoCaptureSample.cs we make VideoFrame public. We are going to need the VideoFrame structure to send it to the object detector.

Interfacing with Model

The steps here have already been performed in the source code, here they are:

The ONNX frozen graph should be unzipped (7zip is good for this) and dropped under the StreamingAssets subfolder under Assets.

Use mlgen from the Visual Studio Developer Command Prompt to generate C# code from the ONNX file in the StreamingAssets folder. Its parameters are self-explanatory. The file containing the generated code should be dropped under Scripts only change I needed to make was:

public sealed class TinyYoloV2O12Input
    public TensorFloat grid; // shape(-1,3,416,416)

replaced with:

public sealed class TinyYoloV2O12Input
    public ImageFeatureValue grid; // shape(-1,3,416,416)

This code is dropped into the Scripts folder of Assets subdirectory.

Visual Studio Tools for AI extension should be disabled for this project, because every time the solution is rebuilt from Unity it will attempt to regenerate classes for model interaction. It is annoying to have to remove them from the project every time.

Interacting with Holograms

Just to test-drive some of the Unity and Mixed Reality Toolkit functionality, I have created a toolbar, attached it to a speech source, added Receiver element to the scene with a CommandDispatcher component to process the toolbar commands.

TapToPlaceWithCamera component subclasses TapToPlace component from the MR Toolkit so holograms may be tapped, moved around with gaze (turning one’s head), then placed somewhere else in the world with another tap. This component is on the Capture element.

A Few Details

Singleton Class

A very convenient wrapper for singleton components in the Mixed Reality Toolkit.

using HoloToolkit.Unity;

public class RaycastLaser : Singleton 


The video capture plugin captures data to the system memory, not to the GPU. It is not that hard to modify the plugin to capture data to the GPU, or make it optional, since Windows ML should support GPU detections on HoloLens. I have not tried this. (As they write in textbooks: “This is left as an excercise”).

VideoFrame wraps system memory and GPU data in the same structure, however, if the data is in the system memory, it is stored in the SoftwareBitmap property of VideoFrame otherwise it is stored in the Direct3DSurface property.

While I have not experimented with the GPU, I am still trying to respect Windows ML memory type abstraction through VideoFrame structure. So, for instance, GetDimensionsFromVideoFrame function handles both system and GPU memory cases (see CPU vs GPU section below):

private Tuple<int, int> GetDimensionsFromVideoFrame(VideoFrame videoFrame)
    int width = 0, height =0;

    if (videoFrame.SoftwareBitmap != null)
        width = videoFrame.SoftwareBitmap.PixelWidth;
        height = videoFrame.SoftwareBitmap.PixelHeight;
    else if (videoFrame.Direct3DSurface != null)
        width = videoFrame.Direct3DSurface.Description.Width;
        height = videoFrame.Direct3DSurface.Description.Height;
    return new Tuple<int, int>(width, height);

Asynchronous Processing

Unity applications are single-threaded, while video capturing and detection functions are asynchronous following the async/await pattern (asynchronous tasks), and may be executed on separate threads. So, we need to be careful when creating or interacting with our holograms. All such interactions need to happen on the main thread:

UnityEngine.WSA.Application.InvokeOnAppThread(() =>
    picture = HoloPicture.CreateHologram(s.data, _resolution, s.camera2WorldMatrix, s.projectionMatrix);
}, true);

Coordinate Transformation

When we send a captured frame to the object detector, we receive the results back in the projected coordinates. We need the projection matrix as well as camera2world (world2camera) matrix for interacting with the holograms: finding the bounding rectangles of detected objects and shooting lasers to their centers. LocatableCameraUtils class from the HololensCameraStream plugin provides handy utilities for coordinate transformations.

Custom Shader

AR/HolographicImageBlend shader is used to render the captured frame on top of an empty Quad. This shader uses world2camera and projection matrices to render the frame, so when moving things around with gaze we need to be careful to update those correctly from the main camera. I extend TapToPlace for that purpose:

public class TapToPlaceWithCamera : TapToPlace
    protected override void Update()

        // Only if we are placing
        var renderer = gameObject.GetComponent();
        renderer.sharedMaterial.SetMatrix("_WorldToCameraMatrix", Camera.main.cameraToWorldMatrix.inverse);
        renderer.sharedMaterial.SetMatrix("_CameraProjectionMatrix", Camera.main.projectionMatrix);

Recorder Distortion

When recording a video of interactions with captured holograms, the holograms get distorted, I am not sure why. For some reason this only happens when interactions are recorded (must be a quantum effect). The best guess is, the code for updating the custom shader above is no longer valid. Maybe converting to a regular shader would fix this, but I have not tried it.


Unfortunately there is still no native F# support in Unity. One day, perhaps!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s