Given two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications, aiming at instant augmented reality anywhere, require scale-metric pose estimates, and hence, they rely on external depth estimators to recover the scale. We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images, we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training, nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches.
We show examples of MicKey computing a metric relative pose on various scenes of the Map-free relocalization benchmark. MicKey localizes the query image (blue camera) relative to the reference image (orange camera). MicKey does not require a scene map, e.g., a 3D point cloud computed from multiple images, but only the single reference image. We visualize 3D-3D correspondences and color-code them according to their position in the reference camera space.
Use the controls to switch between scenes.
We show examples comparing MicKey and SOTA matchers under extreme viewpoint changes. MicKey directly predicts metric 3D-3D correspondences from RGB images and intrinsics, where no additional information is required to recover the scale of the scene. In contrast, SOTA matchers are paired with DPT-KITTI depth estimates to scale their pose estimations.
We visualize the reference image (orange camera), the ground-truth position of the query image (green camera), and the different pose estimates. We use Map-free validation scenes to have access to the ground-truth query poses. Besides, for completeness, we also visualize a test scene (Test Scene ID: s00651) at the end that displays two opposing views. Note that ground-truth is not available for that scene.
Use the buttons to switch between methods and examples.
Alongside the relative pose estimates, MicKey also provides the confidence of its predictions. This is important to distinguish solvable and unsolvable cases. The pose confidence is computed as a form of soft-inlier counting. To visualize MicKey's confidence, we color-coded the 3D-3D correspondences according to their pose confidence.
From a single input image, MicKey generates its depth map, the 2D keypoint offsets and scores, and the keypoint descriptors. See some examples, where we show the input image (left), the generated depth maps (center), and MicKey's keypoint scores (right). MicKey's keypoint score and depth estimates are trained jointly alongside feature matching to optimize relative pose accuracy. Hence, MicKey learns to assign high scores (green areas) to positions where the depth is accurate. Note that the depth and score maps have a resolution 14 times smaller than the input images due to MicKey's feature encoder.
We finally also show the 2D-2D correspondences returned by MicKey and different state-of-the-art matchers. In the visualization, on the left (blue box), we show MicKey's 2D-2D correspondences, and on the rigth the selected matcher. The correspondences displayed in these examples are those that agree with the final estimated relative pose (inliers correspondences). We focus our examples on image pairs where the viewpoint changes are extreme.
If you find this work useful for your research, please consider citing our paper:
@inproceedings{barroso2024mickey, title={Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences}, author={Barroso-Laguna, Axel and Munukutla, Sowmya and Prisacariu, Victor and Brachmann, Eric}, booktitle={CVPR}, year={2024} }