We will explore running object detection on-device with HoloLens, using Unity game engine as our development platform.
AR Academy is a great introduction to all aspects of HoloLens development. It has 300-level tutorials that demonstrate how to connect the device to Azure Cognitive Services to perform machine learning tasks. There are no samples of performing those tasks completely on-device. The application described here fills this gap.
While this is not a step-by-step tutorial, the source code and general knowledge of Unity development together with some basic understanding of Mixed Reality Toolkit (AR Academy is a great place to get started) should be enough to follow along.
The Application
We walk around wearing Hololens and capture of the surrounding world. These images go through an object detection neural net, turn into holograms, which are placed back into our world in as close proximity as possible to the real objects captured in them, with lasers pointing from the camera (our eyes) towards the center of the detected objects.
The objects are detected using Tiny YOLOv2: the only object detection model available from ONNX model zoo at the time of developing this app.
Sources and Setup
This is a Unity 2017.4.16 .NET 4.6 app, using Mixed Reality Toolkit 2017.4.3.0 Here are the source code and setup instructions for the completed application.
The application workflow can be separated into three sections:
- Capturing scenes for detection
- Interfacing with model
- Interacting with holograms
Capturing Video
We are using a slightly modified version of the Vulcan Software video capture plugin. The plugin not only wraps up nicely all the code you may find for capturing video with HoloLens, but also has a few helper methods for converting coordinates between camera projection and the world, which we need for placing the holograms correctly as well as shoot lasers into them.
We make a slight modification to the plugin: in file VideoCaptureSample.cs
we make VideoFrame
public. We are going to need the VideoFrame
structure to send it to the object detector.
Interfacing with Model
The steps here have already been performed in the source code, here they are:
The ONNX frozen graph should be unzipped (7zip is good for this) and dropped under the StreamingAssets
subfolder under Assets
.
Use mlgen
from the Visual Studio Developer Command Prompt to generate C# code from the ONNX file in the StreamingAssets
folder. Its parameters are self-explanatory. The file containing the generated code should be dropped under Scripts
only change I needed to make was:
public sealed class TinyYoloV2O12Input { public TensorFloat grid; // shape(-1,3,416,416) }
replaced with:
public sealed class TinyYoloV2O12Input { public ImageFeatureValue grid; // shape(-1,3,416,416) }
This code is dropped into the Scripts
folder of Assets
subdirectory.
Visual Studio Tools for AI extension should be disabled for this project, because every time the solution is rebuilt from Unity it will attempt to regenerate classes for model interaction. It is annoying to have to remove them from the project every time.
Interacting with Holograms
Just to test-drive some of the Unity and Mixed Reality Toolkit functionality, I have created a toolbar, attached it to a speech source, added Receiver
element to the scene with a CommandDispatcher
component to process the toolbar commands.
TapToPlaceWithCamera
component subclasses TapToPlace
component from the MR Toolkit so holograms may be tapped, moved around with gaze (turning one’s head), then placed somewhere else in the world with another tap. This component is on the Capture
element.
A Few Details
Singleton Class
A very convenient wrapper for singleton components in the Mixed Reality Toolkit.
using HoloToolkit.Unity; public class RaycastLaser : Singleton { ... }
CPU vs GPU
The video capture plugin captures data to the system memory, not to the GPU. It is not that hard to modify the plugin to capture data to the GPU, or make it optional, since Windows ML should support GPU detections on HoloLens. I have not tried this. (As they write in textbooks: “This is left as an excercise”).
VideoFrame
wraps system memory and GPU data in the same structure, however, if the data is in the system memory, it is stored in the SoftwareBitmap
property of VideoFrame
otherwise it is stored in the Direct3DSurface
property.
While I have not experimented with the GPU, I am still trying to respect Windows ML memory type abstraction through VideoFrame
structure. So, for instance, GetDimensionsFromVideoFrame
function handles both system and GPU memory cases (see CPU vs GPU section below):
private Tuple<int, int> GetDimensionsFromVideoFrame(VideoFrame videoFrame) { int width = 0, height =0; if (videoFrame.SoftwareBitmap != null) { width = videoFrame.SoftwareBitmap.PixelWidth; height = videoFrame.SoftwareBitmap.PixelHeight; } else if (videoFrame.Direct3DSurface != null) { width = videoFrame.Direct3DSurface.Description.Width; height = videoFrame.Direct3DSurface.Description.Height; } return new Tuple<int, int>(width, height); }
Asynchronous Processing
Unity applications are single-threaded, while video capturing and detection functions are asynchronous following the async/await
pattern (asynchronous tasks), and may be executed on separate threads. So, we need to be careful when creating or interacting with our holograms. All such interactions need to happen on the main thread:
UnityEngine.WSA.Application.InvokeOnAppThread(() => { picture = HoloPicture.CreateHologram(s.data, _resolution, s.camera2WorldMatrix, s.projectionMatrix); }, true);
Coordinate Transformation
When we send a captured frame to the object detector, we receive the results back in the projected coordinates. We need the projection matrix as well as camera2world (world2camera) matrix for interacting with the holograms: finding the bounding rectangles of detected objects and shooting lasers to their centers. LocatableCameraUtils
class from the HololensCameraStream
plugin provides handy utilities for coordinate transformations.
Custom Shader
AR/HolographicImageBlend
shader is used to render the captured frame on top of an empty Quad. This shader uses world2camera and projection matrices to render the frame, so when moving things around with gaze we need to be careful to update those correctly from the main camera. I extend TapToPlace
for that purpose:
public class TapToPlaceWithCamera : TapToPlace { protected override void Update() { base.Update(); if(!IsBeingPlaced) { return; } // Only if we are placing var renderer = gameObject.GetComponent(); renderer.sharedMaterial.SetMatrix("_WorldToCameraMatrix", Camera.main.cameraToWorldMatrix.inverse); renderer.sharedMaterial.SetMatrix("_CameraProjectionMatrix", Camera.main.projectionMatrix); } }
Recorder Distortion
When recording a video of interactions with captured holograms, the holograms get distorted, I am not sure why. For some reason this only happens when interactions are recorded (must be a quantum effect). The best guess is, the code for updating the custom shader above is no longer valid. Maybe converting to a regular shader would fix this, but I have not tried it.
F#
Unfortunately there is still no native F# support in Unity. One day, perhaps!