Supercharging Object Detection in Video: Optimizing Decoding and Graph Feeding

In the previous post we validated our install and ran a simple detector in C++. It is now time to start optimizing it. Source code for the finished project is here.

Optimizing Video Decoding

If we build and run the video_reader.cpp OpenCV sample, we will observe a staggering performance improvement available in OpenCV for decoding and reading video.

It is somewhat tricky to make the actual sample work, so I summarized the necessary steps gleaned from some wise folks on GitHub Issues in this repo.

As the screenshot above shows, we have an order of magnitude performance improvement by decoding the video and leaving frames on the GPU. At this point this is our performance increase potential: not only will it allow us to skip unnecessary and expensive memory copies, but also will set the stage for TensorRT which consumes data already on the GPU.

The first step towards this goal is to optimize feeding the Tensorflow graph.

Feeding Tensorflow Graph from the GPU

We are now working with the final version of this application from this repo. The first thing to do is to allocate a GPU tensor and fill it with decoded data, which, at this point, is also residing on the GPU in a GpuMat structure. Let’s deal with this copy first. Here we are just as lucky as we were with bridging Mat with Tensorflow tensors.

Status readTensorFromGpuMat(const cv::cuda::GpuMat& g_mat, Tensor& outTensor) {
    tensorflow::uint8 *p = outTensor.flat().data();
    cv::cuda::GpuMat fakeMat(g_mat.rows, g_mat.cols, CV_8UC3, p);

    // comes in with 4 channels -> 3 channels
    cv::cuda::cvtColor(g_mat, fakeMat, COLOR_BGRA2RGB);

    return Status::OK();

A noteworthy bit here is on line 6: the decoded frame has 4 channels, we use cvtColor to drop the transparency channel our network does not use.

Allocating CUDA Tensor

By carefully studying Tensorflow code:

// GPU allocator
#include "tensorflow/core/common_runtime/gpu/gpu_id.h"
#include "tensorflow/core/common_runtime/gpu/gpu_id_utils.h"
#include "tensorflow/core/common_runtime/gpu/gpu_init.h"
#include "tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.h"

const string gpu_device_name = GPUDeviceName(session.get());

// allocate tensor on the GPU
tensorflow::TensorShape shape = tensorflow::TensorShape({1, height, width, 3});

tensorflow::PlatformGpuId platform_gpu_id(0);

tensorflow::GPUMemAllocator *sub_allocator =
new tensorflow::GPUMemAllocator(
platform_gpu_id, false /*use_unified_memory*/, {}, {});

tensorflow::GPUBFCAllocator *allocator =
new tensorflow::GPUBFCAllocator(sub_allocator, shape.num_elements() * sizeof(tensorflow::uint8), "GPU_0_bfc");

inputTensor = Tensor(allocator, tensorflow::DT_UINT8, shape);

To confirm the tensor is indeed residing on the GPU:

bool IsCUDATensor(const Tensor &t)
    cudaPointerAttributes attributes;
    cudaError_t err =
        cudaPointerGetAttributes(&attributes, t.tensor_data().data());
    if (err == cudaErrorInvalidValue)
       return false;
   CHECK_EQ(cudaSuccess, err) << cudaGetErrorString(err);
#if CUDART_VERSION >= 10000
    return (attributes.type == cudaMemoryTypeDevice);
    return (attributes.memoryType == cudaMemoryTypeDevice);

CUDA 10 is depricating memoryType attribute, so the conditional compilation avoids compiler warnings.

Feeding Tensorflow Graph from the GPU

Doing this is not standard. See a long discussion on GitHub.
There exists an experimental technology, so things will probably change, but as of release 1.12 it still works. This is a sample from Google.

In our case this works:

CallableOptions opts;
std::unique_ptr<tensorflow::Session> session;
Session::CallableHandle feed_gpu_fetch_cpu;

const string inputLayer = "image_tensor:0";
const vector<string> outputLayer = {"detection_boxes:0", "detection_scores:0", "detection_classes:0", "num_detections:0"};

for (auto const &value : outputLayer)

const string gpu_device_name = GPUDeviceName(session.get());
opts.mutable_feed_devices()->insert({inputLayer, gpu_device_name});

auto runStatus = session->MakeCallable(opts, &feed_gpu_fetch_cpu);
if (!runStatus.ok())
    LOG(ERROR) << "Failed to make callable";
runStatus = session->RunCallable(feed_gpu_fetch_cpu, {inputTensor}, &outputs, nullptr);

We can compare the results by looking at NVIDIA Profiler results for our previous app and the current one:

Feeding from the CPU

Feeding from the GPU

(See the regions framed in deep pink on images above, marking large chunks of memory moved from host to device in the top snapshot)

Profile was taken over 20 seconds and we can see the difference in bytes moved back and forth. Also we can see individual bursts of 2.76 Mb moved from host to device on the “CPU” profile that do not appear on the “GPU”. It is easy enough to calculate that 2.76 Mb is the size of a decoded frame.

Performance Gain

So, how much did we gain through all this? A whopping 10%. We did expect more for all this work, however we will use what we learned to enable bigger gains down the line. It is now time to move to TensorRT.

One thought on “Supercharging Object Detection in Video: Optimizing Decoding and Graph Feeding

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.