MediaPipe Box Tracking

Table of contents

Overview
Object Detection and Tracking
Example Apps
1. Mobile
2. Desktop
Resources

Overview

MediaPipe Box Tracking has been powering real-time tracking in Motion Stills, YouTube’s privacy blur, and Google Lens for several years, leveraging classic computer vision approaches.

The box tracking solution consumes image frames from a video or camera stream, and starting box positions with timestamps, indicating 2D regions of interest to track, and computes the tracked box positions for each frame. In this specific use case, the starting box positions come from object detection, but the starting position can also be provided manually by the user or another system. Our solution consists of three main components: a motion analysis component, a flow packager component, and a box tracking component. Each component is encapsulated as a MediaPipe calculator, and the box tracking solution as a whole is represented as a MediaPipe subgraph.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer.

In the box tracking subgraph, the MotionAnalysis calculator extracts features (e.g. high-gradient corners) across the image, tracks those features over time, classifies them into foreground and background features, and estimates both local motion vectors and the global motion model. The FlowPackager calculator packs the estimated motion metadata into an efficient format. The BoxTracker calculator takes this motion metadata from the FlowPackager calculator and the position of starting boxes, and tracks the boxes over time. Using solely the motion data (without the need for the RGB frames) produced by the MotionAnalysis calculator, the BoxTracker calculator tracks individual objects or regions while discriminating from others. Please see Object Detection and Tracking using MediaPipe in Google Developers Blog for more details.

An advantage of our architecture is that by separating motion analysis into a dedicated MediaPipe calculator and tracking features over the whole image, we enable great flexibility and constant computation independent of the number of regions tracked! By not having to rely on the RGB frames during tracking, our tracking solution provides the flexibility to cache the metadata across a batch of frame. Caching enables tracking of regions both backwards and forwards in time; or even sync directly to a specified timestamp for tracking with random access.

Object Detection and Tracking

MediaPipe Box Tracking can be paired with ML inference, resulting in valuable and efficient pipelines. For instance, box tracking can be paired with ML-based object detection to create an object detection and tracking pipeline. With tracking, this pipeline offers several advantages over running detection per frame (e.g., MediaPipe Object Detection):

It provides instance based tracking, i.e. the object ID is maintained across frames.
Detection does not have to run every frame. This enables running heavier detection models that are more accurate while keeping the pipeline lightweight and real-time on mobile devices.
Object localization is temporally consistent with the help of tracking, meaning less jitter is observable across frames.


Fig 1. Box tracking paired with ML-based object detection.

The object detection and tracking pipeline can be implemented as a MediaPipe graph, which internally utilizes an object detection subgraph, an object tracking subgraph, and a renderer subgraph.

In general, the object detection subgraph (which performs ML model inference internally) runs only upon request, e.g. at an arbitrary frame rate or triggered by specific signals. More specifically, in this particular graph a PacketResampler calculator temporally subsamples the incoming video frames to 0.5 fps before they are passed into the object detection subgraph. This frame rate can be configured differently as an option in PacketResampler.

The object tracking subgraph runs in real-time on every incoming frame to track the detected objects. It expands the box tracking subgraph with additional functionality: when new detections arrive it uses IoU (Intersection over Union) to associate the current tracked objects/boxes with new detections to remove obsolete or duplicated boxes.

Example Apps

Please first see general instructions for Android, iOS and desktop on how to build MediaPipe examples.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For more information on how to visualize its associated subgraphs, please see visualizer documentation.

Mobile

Note: Object detection is using TensorFlow Lite on GPU while tracking is on CPU.

Graph: mediapipe/graphs/tracking/object_detection_tracking_mobile_gpu.pbtxt
Android target: (or download prebuilt ARM64 APK) mediapipe/examples/android/src/java/com/google/mediapipe/apps/objecttrackinggpu:objecttrackinggpu
iOS target: Not available

Desktop

Running on CPU (both for object detection using TensorFlow Lite and tracking):
- Graph: mediapipe/graphs/tracking/object_detection_tracking_desktop_live.pbtxt
- Target: mediapipe/examples/desktop/object_tracking:object_tracking_cpu
Running on GPU: Not available

Resources

Google Developers Blog: Object Detection and Tracking using MediaPipe
Google AI Blog: Get moving with the new Motion Stills
YouTube Creator Blog: Blur moving objects in your video with the new Custom blurring tool on YouTube