Matching 2D Images in 3D:

Metric Relative Pose from Metric Correspondences

CVPR 2024 (Oral)


Given two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications, aiming at instant augmented reality anywhere, require scale-metric pose estimates, and hence, they rely on external depth estimators to recover the scale. We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images, we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training, nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches.

Metric Relative Pose

We show examples of MicKey computing a metric relative pose on various scenes of the Map-free relocalization benchmark. MicKey localizes the query image (blue camera) relative to the reference image (orange camera). MicKey does not require a scene map, e.g., a 3D point cloud computed from multiple images, but only the single reference image. We visualize 3D-3D correspondences and color-code them according to their position in the reference camera space.

Use the controls to switch between scenes.

Extreme Viewpoint Cases

We show examples comparing MicKey and SOTA matchers under extreme viewpoint changes. MicKey directly predicts metric 3D-3D correspondences from RGB images and intrinsics, where no additional information is required to recover the scale of the scene. In contrast, SOTA matchers are paired with DPT-KITTI depth estimates to scale their pose estimations.

We visualize the reference image (orange camera), the ground-truth position of the query image (green camera), and the different pose estimates. We use Map-free validation scenes to have access to the ground-truth query poses. Besides, for completeness, we also visualize a test scene (Test Scene ID: s00651) at the end that displays two opposing views. Note that ground-truth is not available for that scene.

Use the buttons to switch between methods and examples.

Scene ID: s00468 Scene ID: s00476 Scene ID: s00464 Scene ID: s00651

Visualizing the Pose Confidence

Alongside the relative pose estimates, MicKey also provides the confidence of its predictions. This is important to distinguish solvable and unsolvable cases. The pose confidence is computed as a form of soft-inlier counting. To visualize MicKey's confidence, we color-coded the 3D-3D correspondences according to their pose confidence.

Depth Maps and Keypoint Confidences

From a single input image, MicKey generates its depth map, the 2D keypoint offsets and scores, and the keypoint descriptors. See some examples, where we show the input image (left), the generated depth maps (center), and MicKey's keypoint scores (right). MicKey's keypoint score and depth estimates are trained jointly alongside feature matching to optimize relative pose accuracy. Hence, MicKey learns to assign high scores (green areas) to positions where the depth is accurate. Note that the depth and score maps have a resolution 14 times smaller than the input images due to MicKey's feature encoder.

2D-2D Matches

We finally also show the 2D-2D correspondences returned by MicKey and different state-of-the-art matchers. In the visualization, on the left (blue box), we show MicKey's 2D-2D correspondences, and on the rigth the selected matcher. The correspondences displayed in these examples are those that agree with the final estimated relative pose (inliers correspondences). We focus our examples on image pairs where the viewpoint changes are extreme.

Scene ID: s00556 Scene ID: s00598 Scene ID: s00523 Scene ID: s00653 Scene ID: s00606 Scene ID: s00528 Scene ID: s00569


If you find this work useful for your research, please consider citing our paper:

    title={Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences},
    author={Barroso-Laguna, Axel and Munukutla, Sowmya and Prisacariu, Victor and Brachmann, Eric},