Accelerated Coordinate Encoding:

Learning to Relocalize in Minutes using RGB and Poses

CVPR 2023 (Highlight)

Eric Brachmann1Tommaso Cavallari1Victor Adrian Prisacariu1,2

1Niantic   2University of Oxford   

Abstract


Learning-based visual relocalizers exhibit leading pose accuracy, but require hours or days of training. Since training needs to happen on each new scene again, long training times make learning-based relocalization impractical for most applications, despite its promise of high accuracy. In this paper we show how such a system can actually achieve the same accuracy in less than 5 minutes. We start from the obvious: a relocalization network can be split in a scene-agnostic feature backbone, and a scene-specific prediction head. Less obvious: using an MLP prediction head allows us to optimize across thousands of view points simultaneously in each single training iteration. This leads to stable and extremely fast convergence. Furthermore, we substitute effective but slow end-to-end training using a robust pose solver with a curriculum over a reprojection loss. Our approach does not require privileged knowledge, such a depth maps or a 3D model, for speedy training. Overall, our approach is up to 300x faster in mapping than state-of-the-art scene coordinate regression, while keeping accuracy on par.

Results


We learn a scene map from a set of posed RGB images by minimizing the reprojection error. ACE compiles the scene into 4MB worth of MLP weights during 5 minutes on a single consumer-grade GPU. To Relocalise, we pass a query frame through the ACE network to obtain image-to-scene correspondences followed by RANSAC+PnP.

Below we show a sample of scenes from the datasets in our paper. Use the controls to switch between scenes.

ACE is deployed as part of Niantic's Lightship Visual Positioning System (VPS) to offer relocalisation at almost 200,000 places world-wide.

Comparison to DSAC*


ACE builds on the previously leading learning-based relocaliser DSAC* but maps a scene up to 300x faster. ACE takes a process that took hours or days for each new scene, and makes it happen in minutes. In terms of relocalisation accuracy, ACE is on par with DSAC*, if not slightly superior.

Why Is ACE So Much Faster?


DSAC* optimises the scene reprojection error one mapping image at a time. One image provides a lot of pixels for learning, but their reprojection errors are highly correlated. The network prediction for a pixel and its neighbour are almost the same - so are their losses and their gradients.

ACE revolves around the idea of optimising the map across all mapping images, simultaneously. We split the regression network into a convolutional feature backbone and a MLP head. The backbone is pre-trained and extracts high-dimensional features for all mapping images. Since the MLP head needs no spatial context, we can create batches of a random selection of features during training. This de-correlates gradients within a batch and allows for stable optimization with very high learning rates.

In fact, the resulting optimization is so stable that we cain train ACE considerably faster than 5 minutes to obtain results that are still usable. Below, we show a run of ACE with 10 seconds training time (excluding 20s data preparation).

ACE on Larger Scenes


ACE fares fairly well in bigger outdoor scenes. Naturally, the 4MB memory footprint and short mapping time pose some restrictions on what it can do. As a simple strategy, one can split a larger scene into smaller chunks and train one ACE model per chunk. For relocalisation, we let each ACE model estimate a pose independently and choose the one with the highest inlier count.

Using an ACE ensemble naturally increases map size and mapping time. Conveniently, models can be trained in parallel if multiple GPUs are available.

The Wayspots Dataset


To demonstrate the advantages of ACE, we curated a new dataset of 10 small outdoor scenes stemming from Niantic's Map-Free dataset. Each scene was scanned twice, independently by two users. Posed mapping images come from real-time visual odometry on the user's phones. Depth or 3D point clouds are not available. Evaluation ground truth comes from SfM.

The dataset follows DSAC*'s data convention. You find more information in our code repository.

Citation


If you use parts of our code or our dataset, please consider citing our paper.

@inproceedings{brachmann2023ace,
    title={Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses},
    author={Brachmann, Eric and Cavallari, Tommaso and Prisacariu, Victor Adrian},
    booktitle={CVPR},
    year={2023},
}