AV2 2024 Scene Flow Challenge Announcement

TL;DR

The AV2 2024 Scene Flow Challenge is focused on the long tail of scene flow. As part of this year’s challenge, we are announcing a new scene flow evaluation protocol.

As part of this announcement, we are releasing:

  • BucketedSceneFlowEval, a pip installable dataloader + evaluation suite for scene flow with out-of-the-box support for Argoverse 2 and Waymo Open.
  • SceneFlowZoo, a complete codebase based on ZeroFlow [1] for training and evaluating scene flow methods on Argoverse 2 and Waymo Open using our eval protocol and dataloaders.
  • A new EvalAI leaderboard using our new evaluation protocol.


This is a two track challenge: a supervised track and an unsupervised track. Unsupervised methods must not use any human labels from the Argoverse 2 Sensor dataset. Winning methods will be highlighted at the CVPR 2024 Workshop on Autonomous Driving, as well as be eligible for prizes (prizes TBD). See the challenge details for more information.

Why are we releasing a new evaluation protocol?

Pedestrians matter in scene flow

Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. High quality scene flow methods should be able to describe the 3D motion of any object because they have an understanding of geometry and motion.

To ensure scene flow estimators live up to this promise, they must be evaluated across a diversity of objects — it’s just as important to be able to describe the motion of pedestrians, bicyclists, and motorcyclists as it is to describe the motion of cars, trucks, and buses. However, current state-of-the-art scene flow estimators are very far from robustly describing the motion of small but important objects.

Current methods fail on non-car objects

As a qualitative example, consider Argoverse 2’s val sequence 0bae3b5e-417d-3b03-abaa-806b433233b8. The ego vehicle attempts to turn left at an intersection, but pauses to let two pedestrians cross the street. We select this sequence because the two pedestrians are moving close to the ego vehicle, providing unusually high density lidar observations and thus ample point structure for scene flow methods to detect motion. Somewhat surprisingly, current methods fail to estimate scene flow for non-car (e.g. pedestrian) objects.

Figure 1. Accumulated point cloud of AV2’s val 0bae3b5e-417d-3b03-abaa-806b433233b8 sequence. The ego vehicle is preparing to make a left turn (shown in increasing red over time) and slows to allow two pedestrians (shown in blue) to cross the street.
GT Image

Ground Truth

NSFP Image

NSFP

ZeroFlow Image

ZeroFlow

Supervised Image

FastFlow3D

Figure 2. We visualize a cherry picked example of a pedestrian with unusually high density lidar returns. We expect that state-of-the-art scene flow methods should work particularly well on this instance, but find that all methods consistently fail.
  • Ground Truth Side view of the pedestrians crossing the street. The ground truth flow vectors, shown in red, depict the flow of the green points at time t to their positions at t+1 relative to the blue t+1 point cloud.
  • Neural Scene Flow Prior (NSFP) does not estimate any flow for the pedestrians. NSFP is a popular test-time optimization method for scene flow due to its simplicity and good Average EPE performance. Many recent methods use NSFP as their workhorse flow estimator (Chodosh et al. [3], NSFP++ [4], Vacek et al. [5]).
  • ZeroFlow [1], which uses NSFP pseudo-labels to train a student feed forward network that produces superior quality flow to its NSFP teacher, also does not capture any of the pedestrian motion between these two frames.
  • FastFlow3D, a fully-supervised scene flow method, similarly fails to capture meaningful motion.

Figure 2 qualitatively demonstrates the systematic failures on small objects seen across various state-of-the-art scene flow methods. These failures are present in all types of scene flow methods (e.g. supervised, self-supervised, optimization-based).

Current metrics are biased towards objects with many points

The most common Scene Flow evaluation metric is Average Endpoint Error (Average EPE), i.e. the average over the L2 distance between the endpoint of the estimated vs ground truth flow vector for each point. Because this evaluation is computed on a per-point basis, not a per-object basis (pedestrian instances are fairly common, but only comprise a tiny fraction of the total points), Average EPE is dominated by background points. On real-world AV datasets, over 80% of lidar points are from the background (Figure 4).

Chodosh et al. [3] present Threeway EPE, an alternative evaluation metric which partially addresses the foreground / background imbalance. Specifically, average EPE is computed for three mutually exclusive buckets: Background Points, Foreground Static (points moving <0.5m/s), and Foreground Dynamic (points moving >0.5m/s). As described in their paper, this prevents background points from dominating the resulting metric; however, it is still biased towards objects with many points. Foreground Dynamic and Foreground Static are dominated by points from CAR and OTHER_VEHICLES (Figure 3), owing to their larger per-instance object size and thus larger number of point returns.

Figure 3. Plot of number of points from each semantic meta-class for AV2 val. Although PEDESTRIAN instances are common, they are a small fraction of the total number of points owing to their small size relative to CAR and OTHER_VEHICLES. Number of points (Y axis) shown on a log scale.

Our new metric: Bucketed Scene Flow Eval

Our evaluation metric, Bucketed Scene Flow Eval, addresses these issues by breaking down the point distribution across semantic class and speed in order to directly evaluate performance across the long-tail of points. Our metric is computed as follows:

First, we assign all points in each frame-pair to a bucket (defined by the class-speed matrix, e.g. Table 1). We then compute the average EPE and average ground truth speed per bucket. We compute Normalized EPE for each of the dynamic buckets (the second speed column onwards is considered dynamic) by dividing the average EPE by the average speed.

Class
Speed Columns
0-0.4m/s
0.4-0.8m/s
0.8-1.2m/s
...
20-∞m/s
BACKGROUND
-
-
-
-
CAR
-
-
-
-
OTHER_VEHICLES
-
-
-
-
PEDESTRIAN
-
-
-
-
WHEELED_VRU
-
-
-
-
Table 1. Example of the structure of the class-speed matrix for Bucketed Scene Flow Eval.

Unlike existing metrics, Normalized EPE allows us to answer the question: what fraction of the total speed of the point was not described by the flow vectors in that bucket? A method that only predicts ego motion (or zero vectors if egomotion is compensated for) will have 1.0 Normalized EPE for all dynamic buckets, and a method that perfectly describes all motion will have 0.0 Normalized EPE for all moving buckets. Methods may achieve errors greater than 1.0 Normalized EPE by predicting errors with magnitude greater than the average speed; for example, a method that describes the negative vector of true motion will get exactly 2.0 Normalized EPE for all moving buckets with points (the bucket EPE will be exactly 2x the magnitude of the average speed). The range of Normalized EPE is between 0 (perfect) and ∞ (arbitrarily bad), and is undefined for buckets without any points.

We summarize the performance of each meta-class with two numbers:

1) We summarize performance on static objects with Average EPE of the static bucket (the first column). We do not use Normalized EPE, as dividing by very small or zero ground truth motion is noisy / undefined. 2) We summarize performance on dynamic objects with Normalized EPE of the dynamic buckets. (second speed column onwards).

We present the results of the ZeroFlow 3x model at 35m evaluated with our proposed metric in Table 2.

Class
Static (Avg EPE)
Dynamic (Norm EPE)
BACKGROUND
0.01
CAR
0.01
0.22
OTHER_VEHICLES
0.02
0.43
PEDESTRIAN
0.01
0.91
WHEELED_VRU
0.01
0.59
Table 2. Example _Bucketed Scene Flow Eval_ results for the _ZeroFlow 3x_ model from Vedder et al. [1], limited to a 35m box around the ego vehicle. BACKGROUND does not contain any moving points, so it has “—” recorded for its _Dynamic_ error. Lower is better.

As we can see in Table 2, ZeroFlow 3x performs well on static points, and fairly well on CAR (capturing over 75% of motion) but it performs extremely poorly on PEDESTRIAN (capturing less than 10% of motion). Thus, Bucketed Scene Flow Eval enables us to quantitatively measure the qualitative issues we discovered above.

To summarize overall method performance, we take a mean over the different meta-classes for each column, giving ZeroFlow 3x a mean Average Bucketed Error (mABE) of (0.011, 0.526). In the context of our challenge, methods are ranked by their mean dynamic normalized EPE (i.e. second component of mABE).

Figure 4. Plot of mean Dynamic Normalized EPE for various configurations of unsupervised ZeroFlow [1] and supervised FastFlow3D [6]. Supervised methods shown with hatching. Lower is better.

Method relative performance ordering is preserved between traditional metrics like Threeway EPE (e.g. ZeroFlow 3x XL vs ZeroFlow 1x) and our mean Dynamic Normalized EPE; however, Bucketed Scene Flow Eval highlights that state-of-the-art scene flow estimators have enormous room for improvement to make good on the promise of general motion detection — ZeroFlow 3x XL, which achieves state-of-the-art on Threeway EPE, on average only describes 50% of motion per metaclass (Figure 4).

Challenge Details

Supervised and Unsupervised Tracks

Our challenge has two tracks: a supervised track and an unsupervised track. If you submit to the unsupervised track, you must submit a method that does not use any of the human labels provided in the Argoverse 2 Sensor dataset; however labels from other datasets or shelf-supervised methods (e.g. Segment Anything [2]) are allowed as part of developing your method.

CVPR 2024 Workshop on Autonomous Driving Highlighted Methods

For a submission to be considered highlight-eligible at the CVPR 2024 Workshop on Autonomous Driving, must submit a whitepaper describing their methodology in a reasonable amount of detail. Methods that are particularly innovative, novel, or suitable for real-world usage may be highlighted with additional recognition outside of the standard leaderboard ranking.

Submission Format

Submissions should be in the form of a zip file containing each sequence’s estimated flow. The details of the format are described in the BucketedSceneFlowEval repository and an automatic creation script is provided as part of the standalone SceneFlowZoo repository.

AV2 Meta Classes

As part of Bucketed Scene Flow Eval, we define the following meta-classes for the Argoverse 2 dataset evaluation:

BACKGROUND = ['BACKGROUND']

CAR = ['REGULAR_VEHICLE']

OTHER_VEHICLES = [
   'BOX_TRUCK', 'LARGE_VEHICLE', 'RAILED_VEHICLE', 'TRUCK', 'TRUCK_CAB',
   'VEHICULAR_TRAILER', 'ARTICULATED_BUS', 'BUS', 'SCHOOL_BUS'
]

PEDESTRIAN = [
   'PEDESTRIAN', 'STROLLER', 'WHEELCHAIR', 'OFFICIAL_SIGNALER'
]

WHEELED_VRU = [
   'BICYCLE', 'BICYCLIST', 'MOTORCYCLE', 'MOTORCYCLIST', 'WHEELED_DEVICE',
   'WHEELED_RIDER'
]

Organizers

The AV2 2024 Scene Flow Challenge is organized by:

Bibliography

[1] K. Vedder et al., “ZeroFlow: Scalable Scene Flow via Distillation,” in Twelfth International Conference on Learning Representations (ICLR), 2024.

[2] A. Kirillov et al., “Segment Anything: Efficient, Unsupervised Object Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[3] N. Chodosh, D. Ramanan, and S. Lucey, “Re-evaluating LiDAR scene flow,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), 2024.

[4] M. Najibi et al., “Motion inspired unsupervised perception and prediction in autonomous driving,” European Conference on Computer Vision (ECCV), 2022.

[5] P. Vacek, D. Hurych, K. Zimmermann, P. Perez, and T. Svoboda, “Regularizing self-supervised 3D scene flows with surface awareness and cyclic consistency,” arXiv, 2023.

[6] P. Jund, C. Sweeney, N. Abdo, Z. Chen, and J. Shlens, “Scalable scene flow from point clouds in the real world,” IEEE Robotics and Automation Letters (RAL / ICRA), 2022.