3D Multi-Object Tracking Using Graph Neural Networks with Cross-Edge Modality Attention

3D multi-object tracking (MOT) is an essential component of the scene understanding pipeline of autonomous robots. It aims at inferring associations between occurrences of object instances at different time steps in order to predict plausible 3D trajectories. These trajectories are then used in various downstream tasks such as trajectory prediction or navigation. Owing to recent advances in LiDAR-based object detection [1], the 3D tracking task has also seen significant performance improvements. Real-world deployment of these online methods in areas such as autonomous driving poses several challenges. When requiring regulatory approval, its robust behavior must be demonstrated on large sets of reference data which is arduous to obtain due to the lack of extensive ground truth. Therefore, performing high-quality offline labeling of real-world traffic scenes provides the means to test online methods on a larger scale and further sets a benchmark for what is within the realms of possibility for online methods.

Offline 3D Multi-Object Tracking

We present Batch3DMOT, an offline 3D tracking framework that follows the tracking-by-detection paradigm and utilizes multiple sensor modalities (camera, LiDAR, radar) to solve a multi-frame, multi-object tracking objective. Sets of 3D object detections per frame are first turned into attributed nodes. In order to learn offline 3D tracking, we employ a graph neural network (GNN) that performs time-aware neural message passing with intermediate frame-wise attention-weighted neighborhood convolutions. Different from popular Kalman-based approaches, which essentially track objects of different semantic categories independently, our method uses a single model that operates on category-disjoint graph components. As a consequence, it leverages inter-category similarities to improve tracking performance.


When evaluating typically used modalities such as LiDAR,we can make a striking observation: On the one hand, detection features such as bounding box size or orientation are consistently available across time. A similar observation can be made for camera features, even if the object is (partially) occluded. On the other hand, sensor modalities such as LiDAR or radar do not necessarily share this availability. Due to their inherent sparsity, constructing a feature, e.g., for faraway objects, is typically impractical as it does not serve as a discriminative feature that can be used in tracking. This potential modality intermittence translates to sparsity in the graph domain, which is tackled in this work using our proposed cross-edge modality attention. This enables an edge-wise agreement on whether to include the particular modality in node similarity finding.

Technical Approach

Feature Encoding and Graph Construction

In this approach, we turn detections into nodes on a category-disconnected tracking graph over multiple frames. The node attributes are inferred from the 3D pose & motion information of the respective 3D object detection proposal. Relative pose differences constitute the used edge attributes. Furthermore, we use multiple feature encoders to construct lower-dimensional feature representations of the involved modalities (camera, LiDAR and radar). Depending on the object-specific availability of each modality we perform cross-edge modality attention to construct a modality edge feature. In order to minimize the number of edges contained in a graph we only connect kinematically-similar objects using a k-NN scheme with forward-time directed edges.

Batch3DMOT Architecture and Trajectory Inference

The core learning paradigm used in Batch3DMOT follows the time-aware neural message passing scheme executed on disconnected graph components. This is extended by performing frame-wise attention-weighted k-NN neighborhood convolutions (GAT) across graph components to allow information exchange between different semantic categories. In order to substantiate the process of node similarity finding we introduce a novel edge-wise modality attention module that takes modality features of the involved objects and performs cross-attention. The GNN is trained using a class-balenced edge loss, where class frequencies are inferred from the number of GT annotations contained in the training set. The output of the network are Sigmoid-scaled probabilities denoting the respective edge activation score.


Batch3DMOT architecture
Figure: Overview of our Batch3DMOT architecture. A cross-edge modality attention mechanism fuses the features of the involved objects to construct an edge feature (left). Message passing including inter-category neighborhood attention propagates information. Blue arrows denote time-aware message passing, and red arrows denote frame-wise information propagation (middle). Predicted edge scores are turned into trajectory hypotheses using agglomerative trajectory clustering (right).

In order to turn the predicted edge scores into plausible trajectories a set of tracking constraints is satisfied (e.g. in-flow / out-flow per node is at most 1) using our agglomerative trajectory clustering paradigm, which tries to include nearly every edge possible (until a certain lower threshold is reached). Based on the assumption that the predicted edge scores show some inherent order, i.e., FP edges exhibit lower scores than TP edges within local neighborhoods of the graph, we propose a descending score-based handling. Empty (ordered) clusters are initialized that will later hold output trajectories. In the following, we loop through all edges from the highest to lowest score (until a certain category-spepcific) lower thresholds is reached and check whether the edge is constrained or unconstrained. If constrained, it is checked whether the edge would essentially add time-wise leading or trailing nodes to one of the temporary clusters or if it joins two clusters. In the case of joining two clusters, an additional score-wise threshold needs to be met. Otherwise, the edge does not violate any tracking constraints and a new cluster is initialized.

Video

Code

A software implementation of this project based on PyTorch and PyTorch Geometric including trained models can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.

Publication

Martin Büchner and Abhinav Valada,
"3D Multi-Object Tracking Using Graph Neural Networks with Cross-Edge Modality Attention"
IEEE Robotics and Automation Letters (RA-L), 2022.

(PDF) (BibTex)

People