We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models' ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.
Our method facilitates generalization via minimal and versatile image-based priors, integrated throughout our model structure. Given a pair of input images, our model computes a rich multimodal embedding through a transformer-based fusion module. The embedding is then passed to a two-branch decoder MLP that outputs realworld translation and rotation. Our architecture leverages cross-attention to fuse complementary cues, including flow, depth, camera intrinsics, and language-based features in a geometry-aware manner. The language prior is first used to refine both the depth map and 2D flow estimates. The refined depth is then unprojected into 3D (using estimated parameters) to compute scene flow, which is further enhanced and fused with additional features before decoding. By embedding geometric reasoning and multimodal priors directly into the network structure, our model achieves strong zero-shot generalization across diverse and challenging settings.
Comparative Analysis Across Datasets. We compare ZeroVO variants with existing baselines using standard metrics of translation, rotation, absolute trajectory, and scale errors. All methods are provided with estimated camera intrinsics and metric depth. ZeroVO+ is our model trained with further data using semi-supervision, and LiteZeroVO+ is a smaller model variant for resource-constrained settings. Our models demonstrate strong performance across metrics and datasets, particularly in metric translation estimation. As highlighted by the scale error, GTA and nuScenes contain challenging evaluation settings, including nighttime, weather variations, haze, and reflections. We note that TartanVO and DPVO baselines (in gray) only predict up-to-scale motion and use privileged information, i.e., ground-truth scale alignment in evaluation.
Ablation Analysis for Model and Training Components. We analyze various model components: Flow module (F), Depth module (D), Language prior (L), Semi-supervised training (S), and Pseudo-label Selection (P). Flow, depth, and language correspond to the proposed supervised ZeroVO model. Results with additional semi-supervised training are shown as ZeroVO+ (showing state-of-the-art performance by integrating all of our proposed components).
Qualitative Results on KITTI. We show trajectory prediction results across the four most complex driving sequences (00, 02, 05, and 08) from the KITTI dataset. Each subplot illustrates the trajectories generated by our proposed model and the baseline models alongside the ground truth trajectory. The qualitative results demonstrate that our approach achieves the highest alignment with the ground truth, particularly in challenging turns and extended straight paths. These findings highlight the robustness of our method in handling complex and diverse driving scenarios.
We introduce a newly generated simulated dataset derived from the high-fidelity GTA simulation. Our GTA dataset consists of 922 driving sequences captured within a simulated city environment, encompassing a range of diverse weather conditions, driving speeds (particularly high-speed maneuvers not found in other public datasets), traffic scenarios, and times of day. Compared to other commonly used open-source simulation platforms such as CARLA, GTA offers several key advantages: (1) enhanced image realism through the application of reshade graphic settings that support higher quality rendering, and (2) a wider variety of road conditions across various weather scenarios. For on-road driving, these conditions include significant uphill and downhill gradients, tunnels, and underground parking facilities; for off-road driving, the environment features mountains, deserts, snow-covered terrains, and forests, thereby enabling more precise and complex rotational dynamics throughout the map.
Off-Road Desert
Foggy Forest Trail
Mountain Cliff Path
Urban Intersection (Sunny)
Highway in Rain
Nighttime Highway (Rain)
@inproceedings{lai2025zerovo,
title={ZeroVO: Visual Odometry with Minimal Assumptions},
author={Lai, Lei and Yin, Zekai and Ohn-Bar, Eshed},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={17092--17102},
year={2025}
}