ZeroVO: Visual Odometry with Minimal Assumptions

Video Demonstration

Video demonstration of ZeroVO.

Abstract

We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models’ ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks—KITTI, nuScenes, and Argoverse 2—as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.

Key Contributions

Calibration-Free Geometry-Aware Network

Designed to handle noise in estimated depth and camera parameters without requiring predefined calibration.

Language-Based Semantic Prior

Infuses semantic information via language models to enhance robust feature extraction and generalization to unseen domains.

Flexible Semi-Supervised Training

Iteratively adapts to new scenes using unlabeled data, boosting generalization across diverse real-world scenarios.

Methodology

Network Architecture

ZeroVO is a transformer-based visual odometry framework that uses flexible priors for better generalization, featuring a calibration-free, geometry-aware structure and a semi-supervised training process. The network processes two sequential RGB frames using a MaskFlownet encoder to extract optical flow and correlation features. It leverages Metric3Dv2 for depth and LLaVA-NeXT for language context, fusing them via cross-attention transformers. This fusion enables robust 3D scene flow estimation, guiding pose regression. The final MLP decoder predicts rotation and metric-scale translation. Training is semi-supervised, using geometry- and language-guided filtering to denoise pseudo-labels. ZeroVO achieves strong generalization across diverse settings, without requiring known camera intrinsics or multi-frame optimization.

Model Pipeline

Multimodal and Geometry-Guided Network Overview. Given a pair of input images, our model computes a rich multimodal embedding through a transformer-based fusion module. The embedding is then passed to a two-branch decoder MLP that outputs real-world translation and rotation. Our architecture (Sec. 3.1) leverages cross-attention to fuse complementary cues, including flow, depth, camera intrinsics, and language-based features in a geometry-aware manner. The language prior is first used to refine both the depth map and 2D flow estimates. The refined depth is then unprojected into 3D (using estimated parameters) to compute scene flow, which is further enhanced and fused with additional features before decoding. By embedding geometric reasoning and multimodal priors directly into the network structure, our model achieves strong zero-shot generalization across diverse and challenging settings.

Model Variants

ZeroVO: Default supervised model.
ZeroVO+: Further trained with semi-supervision and multimodal pseudo-label selection. Inference: ~0.6 FPS (constrained by LLaVA-NeXT).
LiteZeroVO+: Resource-constrained variant omitting language-conditioned input modules (uses self-attention instead of cross-attention for language). Inference: ~5 FPS.

Experimental Setup & Results

Quantitative Results

Method	KITTI				nuScenes				Argoverse				GTA
	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ
Metric-Scale Zero-Shot Setting:
XVO	16.82	3.84	168.43	0.17	12.75	5.11	8.30	0.16	9.13	4.86	5.70	0.12	25.56	12.64	28.02	0.21
M+DS	14.22	2.72	154.77	0.09	17.08	1.46	10.46	0.18	16.67	1.79	8.51	0.13	23.53	10.38	12.96	0.26
ZeroVO	7.69	2.72	105.07	0.07	10.98	4.48	6.79	0.14	6.83	3.13	4.10	0.11	14.74	10.63	8.55	0.17
ZeroVO+	6.81	2.69	104.69	0.06	9.74	4.37	6.03	0.12	4.64	2.83	3.05	0.09	13.42	7.99	8.24	0.17
LiteZeroVO+	8.85	2.90	118.54	0.08	11.57	4.44	6.87	0.13	7.65	3.82	5.28	0.11	15.93	12.16	11.26	0.18
Baselines Requiring Ground-Truth Scale Alignment:
TartanVO	13.85	3.27	103.07	-	10.27	6.35	6.26	-	11.17	5.30	7.03	-	10.56	9.35	3.82	-
DPVO	8.31	2.37	78.53	-	4.34	2.85	2.66	-	2.66	1.25	1.59	-	12.65	10.67	4.33	-

Comparative Analysis Across Datasets.We compare ZeroVO variants with existing baselines using standard metrics of translation, rotation, absolute trajectory, and scale errors. All methods are provided with estimated camera intrinsics and metric depth. ZeroVO+ is our model trained with further data using semi-supervision, and LiteZeroVO+ is a smaller model variant for resource-constrained settings. Our models demonstrate strong performance across metrics and datasets, particularly in metric translation estimation. As highlighted by the scale error, GTA and nuScenes contain challenging evaluation settings, including nighttime, weather variations, haze, and reflections. We note that TartanVO and DPVO baselines (in gray) only predict up-to-scale motion and use privileged information, ie, ground-truth scale alignment in evaluation.

Qualitative Results on KITTI

Qualitative Results on KITTI. We show trajectory prediction results across the four most complex driving sequences (00, 02, 05, and 08) from the KITTI dataset. Each subplot illustrates the trajectories generated by our proposed model and the baseline models alongside the ground truth trajectory. The qualitative results demonstrate that our approach achieves the highest alignment with the ground truth, particularly in challenging turns and extended straight paths. These findings highlight the robustness of our method in handling complex and diverse driving scenarios.

Ablation Studies

					KITTI				nuScenes				Argoverse				GTA
F	D	L	S	P	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ	tₑᵣᵣ	rₑᵣᵣ	ATE	sₑᵣᵣ
✓					18.76	5.49	174.24	0.18	19.40	7.42	12.54	0.22	12.23	6.34	9.42	0.20	25.68	15.52	25.38	0.25
✓	✓				8.99	2.92	123.42	0.08	12.26	5.23	8.40	0.15	8.62	4.11	5.71	0.11	16.76	12.75	12.37	0.19
✓	✓	✓			7.69	2.72	105.07	0.07	10.98	4.48	6.79	0.14	6.83	3.13	4.10	0.11	14.74	10.63	8.55	0.17
✓	✓	✓	✓		9.11	2.88	117.49	0.08	12.25	5.39	7.53	0.14	7.98	3.95	5.13	0.11	16.49	11.95	10.27	0.18
✓	✓	✓	✓	✓	6.81	2.69	104.69	0.06	9.74	4.37	6.03	0.12	4.64	2.83	3.05	0.09	13.42	7.99	8.24	0.17

Ablation Analysis for Model and Training Components. We analyze various model components: Flow module (F), Depth module (D), Language prior (L), Semi-supervised training (S), and Pseudo-label Selection (P). Flow, depth, and language correspond to the proposed supervised ZeroVO model. Results with additional semi-supervised training are shown as ZeroVO+ (showing state-of-the-art performance by integrating all of our proposed components

Novel GTA V Dataset

To comprehensively assess generalization, we introduce a new high-fidelity synthetic dataset derived from Grand Theft Auto V (GTA). It features 922 driving sequences (25s each at 10 FPS) with challenging scenarios:

Diverse weather conditions (snow, rain, day, night).
Varied driving speeds, including high-speed maneuvers.
Complex traffic scenarios and dynamic objects.
Varied camera settings and lens corruption effects (e.g., raindrops).
Off-road scenes with challenging terrains (mountains, deserts, forests).

GTA offers enhanced image realism and a wider variety of road conditions compared to existing simulators like CARLA or TartanAir.

Access GTA Dataset

Figure 1 (Supplementary): Example from our GTA Dataset. Replace with actual image.

GTA Dataset Examples

Qualitative examples from our GTA dataset showcasing diverse off-road (desert, forest, mountain) and on-road (city, highway, night) scenarios under various weather conditions, with LLaVA-generated captions.

The image depicts a driving scenario where a vehicle is traveling on a dirt road that cuts through a landscape dominated by wind turbines. The road appears to be unpaved and dusty, suggesting it might be a rural or remote area . The wind turbines are tall and white, with multiple blades that are likely designed to capture wind energy. The sky is clear and blue, indicating good weather conditions. The terrain is hilly with sparse vegetation, and the overall setting suggests a location that is likely used for renewable energy production.

The image depicts a driving scenario in a forested area . The road appears to be unpaved and is surrounded by tall trees, suggesting a rural or remote location . There is a layer of fog or mist that is rising from the ground, which could indicate recent rain or damp conditions. The lighting suggests it might be either early morning or late afternoon, as the sky is not brightly lit, and the shadows are long. The overall atmosphere is serene and somewhat mysterious due to the fog and the quiet, natural setting.

The image depicts a driving scenario on a narrow, unpaved road that appears to be winding along a cliff edge. The road is not well-maintained , with visible tire tracks and loose gravel, suggesting that it is frequently used by vehicles. The cliff on the left side of the road is steep and rocky, with sparse vegetation growing along the edge. The sky is overcast, and the lighting suggests it could be either early morning or late afternoon. The overall atmosphere is one of a rugged, possibly remote location, and the road seems to be a challenge for drivers.

In the image, we see a driving scenario where a black convertible car is stopped at a traffic light. The car is positioned on the right side of the road, as indicated by the traffic light and the road markings. The traffic light is displaying a red signal , indicating that the car must stop. The car’s brake lights are on , and the driver appears to be waiting for the light to change before proceeding. There are buildings on both sides of the street, and the architecture suggests a mix of commercial and residential structures. The road is marked with a crosswalk, and there are traffic signs visible, including one that reads 'DO NOT BLOCK INTERSECTION,' which is likely there to prevent vehicles from stopping in a way that could obstruct traffic flow.overall scene is typical of an urban environment where traffic rules are enforced, and drivers are expected to follow the signals and signs to ensure the smooth flow of traffic and pedestrian safety.

The image depicts a driving scenario on a highway during what appears to be a rainy day. The road is wet, reflecting the light, and there are multiple lanes with vehicles traveling in both directions. The vehicles include a black car in the foreground, which is the main focus of the image, and other cars in the background. The highway is bordered by a barrier on the right side, and there are trees and a hillside on the left side, suggesting a coastal or hilly region. Above the highway , there is a bridge or overpass with a green sign that is not clearly legible. The sky is overcast, and the lighting suggests it could be either dawn or dusk. The overall atmosphere of the image is somewhat moody and atmospheric, with a focus on the motion of the vehicles and the wet road conditions.

The image depicts a nighttime driving scenario on a road that appears to be wet , possibly due to recent rain. The visibility is reduced, and the atmosphere is dark, with the streetlights casting a glow on the wet surface. There are several vehicles on the road, including a blue and yellow taxi in the foreground, a white bus in the background, and other cars in between. The vehicles are moving along the road, and the traffic seems to be flowing , albeit at a slower pace due to the wet conditions. The road is bordered by a barrier on the left side, and there are palm trees lining the right side, suggesting a location that could be in a region with a warm climate. The overall scene is typical of a city street during a rainy evening, with the challenge of navigating through the reduced visibility and wet road conditions.

These examples illustrate the diverse scenarios in our GTA dataset, showcasing ZeroVO's robustness across challenging conditions.

Research Team

Resources & Citation

Downloads & Links

Paper (PDF)
Code (GitHub)
GTA Dataset
Supplementary Video
CVPR 2025

Cite This Work

@inproceedings{donotciteplaceholder,
  title={{ZeroVO}: Visual Odometry with Minimal Assumptions},
  author={Lai, Lei and Yin, Zekai and Ohn-Bar, Eshed},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}