Time | Event |
---|---|
09:05 - 09:20 | Welcome introduction by Victor Adrian Prisacariu, Niantic and Oxford University |
09:25 - 09:55 | Jakob Engel, Meta Spatial AI for Contextual AI This talk focuses on Spatial AI for Contextual AI: How localization and 3D spatial scene understanding will enable the next generation of smart wearables and contextually grounded AI models. I will talk about recent research including map-free and vision-free localization for AR and smart glasses, as well as semantic scene- and user- understanding; as well as talk about some of the challenges that are left. |
10:00 - 10:30 | Eduard Trulls, Google Beyond visual positioning: how localization turned out to provide efficient training of neural, semantic maps What do you do if you built a system that incorporates the largest ground-level corpus of imagery into a large-scale localization system serving thousands of queries per second? You add overhead and semantic data and simultaneously re-think the problem from the ground up. In SNAP, localization becomes merely a side-task and the resulting maps start to encode scene semantics, motivating an entirely new set of applications and research directions. In this talk we will give a brief overview of how we moved away from further iterations on Google's Visual Positioning Service to break with the classic visual-localization approaches. SNAP is trained only using camera poses over tens of millions of StreetView images. The resulting algorithm can resolve the location of challenging image queries beyond the reach of traditional methods, outperforming the state of the art in localization by a large margin. More interestingly though, our neural maps encode not only geometry and appearance but also high-level semantics, discovered without explicit supervision. |
10:30 - 11:00 | Coffee break and poster session |
11:00 - 11:45 | Vincent Leroy,
Naver Labs Europe
and Single Frame challenge track winner talk Grounding Image Matching in 3D with MASt3R The journey from CroCo to MASt3R exemplify a significant paradigm shift in 3D vision technologies. This presentation will delve into the methodologies, innovations, and synergistic integration of these frameworks, demonstrating their impact on the field and potential future directions. The discussion aims to highlight how these advancements unify and streamline the processing of 3D visual data, offering new perspectives and capabilities in map-free visual relocalization, robotic navigation and beyond. Zoom link |
11:50 - 12:20 | Torsten Sattler, CTU Prague Scene Representations for Visual Localization Visual localization is the problem of estimating the exact position and orientation from which a given image was taken. Traditionally, localization approaches either used a set of images with known camera poses or a sparse point cloud, obtained from Structure-from-Motion, to represent the scene. In recent years, the list of available scene representations has grown considerably. In this talk, we review a subset of the available representations. |
12:25 - 12:55 | Shubham Tulsiani, CMU Rethinking Camera Parametrization for Pose Prediction Every student of projective geometry is taught to represent camera matrices via an extrinsic and intrinsic matrix and learning-based methods that seek to predict viewpoints given a set of images typically adopt this (global) representation. In this talk, I will advocate for an over-parametrized local representation which represents cameras via rays (or endpoints) associated with each image pixel. Leveraging a diffusion based model that allows handling uncertainty, I will show that such representations are more suitable for neural learning and lead to more accurate camera prediction. |
12:55 - 13:00 |
Closing Remarks and award photos
|
Talk title: Spatial AI for Contextual AI
Summary: This talk focuses on Spatial AI for Contextual AI: How
localization and 3D spatial scene understanding will enable the next generation of smart
wearables and contextually grounded AI models. I will talk about recent research including
map-free and vision-free localization for AR and smart glasses, as well as semantic scene- and
user- understanding; as well as talk about some of the challenges that are left.
Bio: Jakob Engel is a Director of Research at Meta Reality labs, where he is leading egocentric machine perception research as part of Metaβs Project Aria. He has 10+ years of experience working on SLAM, 3D scene understanding and user/environment interaction tracking, leading both research projects as well as shipping core localization technology into Metaβs MR and VR product lines. Dr. Engel received his Ph.D. in Computer Science at the Computer Vision Group at the Technical University of Munich in 2016, where he pioneered direct methods for SLAM through DSO and LSD-SLAM.
Talk title: Grounding Image Matching in 3D with MASt3R
Summary: The journey from CroCo to MASt3R exemplify a
significant paradigm shift in 3D vision technologies. This presentation will delve into the
methodologies, innovations, and synergistic integration of these frameworks, demonstrating their
impact on the field and potential future directions. The discussion aims to highlight how these
advancements unify and streamline the processing of 3D visual data, offering new perspectives
and capabilities in map-free visual relocalization, robotic navigation and beyond.
Bio: Vincent is a research scientist in Geometric Deep Learning at Naver Labs Europe. He joined 5 years ago, in 2019, after completing his PhD on Multi-View Stereo Reconstruction for dynamic shapes at the INRIA Grenoble-Alpes under the supervision of E. Boyer and J-S. Franco. Other than that, he likes hiking in the mountains and finding simple solutions to complex problems. Interestingly, the latter usually comes with the former.
Talk title: Scene Representations for Visual Localization
Summary: Visual localization is the problem of estimating the
exact position and orientation from which a given image was taken. Traditionally, localization
approaches either used a set of images with known camera poses or a sparse point cloud, obtained
from Structure-from-Motion, to represent the scene. In recent years, the list of available scene
representations has grown considerably. In this talk, we review a subset of the available
representations.
Bio: Torsten Sattler is a Senior Researcher at CTU. Before, he was a tenured associate professor at Chalmers University of Technology. He received a PhD in Computer Science from RWTH Aachen University, Germany, in 2014. From Dec. 2013 to Dec. 2018, he was a post-doctoral and senior researcher at ETH Zurich. Torsten has worked on feature-based localization methods [PAMIβ17], long-term localization [CVPRβ18, ICCVβ19, ECCVβ20, CVPRβ21] (see also the benchmarks at visuallocalization.net), localization on mobile devices [ECCVβ14, IJRRβ20], and using semantic scene understanding for localization [CVPRβ18, ECCVβ18, ICCVβ19]. Torsten has co-organized tutorials and workshops at CVPR (β14, β15, β17-β20), ECCV (β18, β20), and ICCV (β17, β19), and was / is an area chair for CVPR (β18, β22, β23), ICCV (β21, β23), 3DV (β18-β21), GCPR (β19, β21), ICRA (β19, β20), and ECCV (β20). He was a program chair for DAGM GCPRβ20, a general chair for 3DVβ22, and will be a program chair for ECCVβ24.
Talk title: Beyond visual positioning: how localization turned
out to provide efficient training of neural, semantic maps
Summary: What do you do if you built a system that incorporates
the largest ground-level corpus of imagery into a large-scale localization system serving
thousands of queries per second? You add overhead and semantic data and simultaneously re-think
the problem from the ground up. In SNAP,
localization becomes merely a side-task and the resulting maps start to encode scene semantics,
motivating an entirely new set of applications and research directions. In this talk we will
give a brief overview of how we moved away from further iterations on Google's Visual
Positioning Service to break with the classic visual-localization approaches. SNAP is trained
only using camera poses over tens of millions of StreetView images. The resulting algorithm can
resolve the location of challenging image queries beyond the reach of traditional methods,
outperforming the state of the art in localization by a large margin. More interestingly though,
our neural maps encode not only geometry and appearance but also high-level semantics,
discovered without explicit supervision.
Bio: Eduard Trulls is a Research Scientist at Google Zurich, working on Machine Learning for visual recognition. Before that he was a post-doc at the Computer Vision Lab at EPFL in Lausanne, Switzerland, working with Pascal Fua. He obtained his PhD from the Institute of Robotics in Barcelona, Spain, co-advised by Francesc Moreno and Alberto Sanfeliu. Before his PhD he worked in mobile robotics.
Talk title: Rethinking Camera Parametrization for Pose
Prediction
Summary: Every student of projective geometry is taught to
represent camera matrices via an extrinsic and intrinsic matrix and learning-based methods that
seek to predict viewpoints given a set of images typically adopt this (global) representation.
In this talk, I will advocate for an over-parametrized local representation which represents
cameras via rays (or endpoints) associated with each image pixel. Leveraging a diffusion based
model that allows handling uncertainty, I will show that such representations are more suitable
for neural learning and lead to more accurate camera prediction.
Bio: Shubham Tulsiani is an Assistant Professor at Carnegie Mellon University in the Robotics Institute. Prior to this, he was a research scientist at Facebook AI Research (FAIR). He received a PhD. in Computer Science from UC Berkeley in 2018 where his work was supported by the Berkeley Fellowship. He is interested in building perception systems that can infer the spatial and physical structure of the world they observe. He was the recipient of the Best Student Paper Award in CVPR 2015.
Talk title: TBC (opening remarks)
Bio: Professor Victor Adrian Prisacariu received the Graduate
degree (with first class hons.) in computer engineering from Gheorghe Asachi Technical
University, Iasi, Romania, in 2008, and the D.Phil. degree in engineering science from the
University of Oxford in 2012.
He continued here first as an EPSRC prize Postdoctoral
Researcher, and then as a Dyson Senior Research Fellow, before being appointed an Associate
Professor in 2017.
He also co-founded 6D.ai, where he built
APIs to help developers augment reality in ways that users would find meaningful, useful and
exciting. The 6D.ai SDK used a standard built-in smartphone camera to build a cloud-based,
crowdsourced three-dimensional semantic map of the world all in real-time, in the background.
6D.ai was acquired by Niantic in March 2020. He is now Chief Scientist with Niantic.
Victor's research interests include semantic visual tracking, 3-D reconstruction, and SLAM.
The Map-free Visual Relocalization workshop investigates topics related to
metric visual relocalization relative to a single reference image instead of
relative to a map.
This problem is of major importance to many higher level applications, such as
Augmented/Mixed Reality, SLAM and 3D reconstruction.
It is important now, because both industry and academia are debating whether and
how to build HD-maps of the world for those tasks. Our community is working to
reduce the need for such maps in the first place.
We host the first Map-free Visual Relocalization Challenge 2024 competition with
two tracks:
map-free metric relative pose from a single image to a single image (proposed by
Arnold et al. in ECCV
2022) and from a query sequence to a single image (new).
While the former is a more challenging and thus interesting research topic, the
latter represents a more realistic relocalization scenario, where the system making
the queries may fuse information from query images and tracking poses over a short
amount of time and baseline.
We invite papers to be submitted to the workshop.
$6000 in prizes will be divided between the top submissions of the two tracks.
Niantic is also seeking partners from the growing community to co-fund and co-judge
the prizes.
We have extended the Map-free benchmark for the challenge with a sequence-based
scenario, based on feedback from senior community members.
Therefore, the challenge consists of two tracks:
1. The original, single query frame to single map
frame task published
with the ECCV 2022 paper;
2. A new task with multiple query frames (9) and their
mobile device provided, metric tracking poses.
To recap, the task in the first track consists of from a single query image predict the metric relative pose to a single map image without any further auxiliary information.
The second track is motivated by the observation that a burst of images, capturing
small motion, can be recorded while staying true to the map-free scenario: No
significant data capture or exploration of the environment.
At the same time, the burst of images allows the application of multi-frame depth
estimation and contains strong hints about the scene scale from the IMU sensor on
device.
We created a second version of the test set and leaderboard
for this track.
Please register your interest here, so we can keep you notified about news and updates!
We invite submissions of workshop papers and extended abstracts to the ECCV Map-free Visual Relocalization Workshop & Challenge 2024. This workshop aims to advance the field of visual relocalization without relying on pre-built maps. The following topics and related areas are of interest:
Workshop paper and extended abstract submission deadline: 2nd August, 2024. See also Important Dates.
Sign up through the contact form to stay up to date with future announcements.
We look forward to your contributions advancing the field of map-free visual
relocalization.
For any questions or clarifications, please contact the workshop
organizers.