3D Reconstruction without 3D Convolutions

ECCV 2022

Mohamed Sayed2*John Gibson1Jamie Watson1Victor Adrian Prisacariu1,3Michael Firman1Clément Godard4*

1Niantic   2University College London   3University of Oxford    4Google
*Work done while at Niantic, during Mohamed’s internship.



Traditionally, 3D indoor scene reconstruction from posed images happens in two phases: per image depth estimation, followed by depth merging and surface reconstruction. Recently, a family of methods have emerged that perform reconstruction directly in final 3D volumetric feature space. While these methods have shown impressive reconstruction results, they rely on expensive 3D convolutional layers, limiting their application in resource-constrained environments. In this work, we instead go back to the traditional route, and show how focusing on high quality multi-view depth prediction leads to highly accurate 3D reconstructions using simple off-the-shelf depth fusion. We propose a simple state-of-the-art multi-view depth estimator with two main contributions: 1) a carefully-designed 2D CNN which utilizes strong image priors alongside a plane-sweep feature volume and geometric losses, combined with 2) the integration of keyframe and geometric metadata into the cost volume which allows informed depth plane scoring. Our method achieves a significant lead over the current state-of-the-art for depth estimation and close or better for 3D reconstruction on ScanNet and 7-Scenes, yet still allows for online real-time low-memory reconstruction.

SimpleRecon is fast. Our batch size one performance is 70ms per frame. This makes accurate reconstruction via fast depth fusion possible!




Our key contribution is the injection of cheaply-available metadata into the feature volume. Each volumetric cell is then reduced in parallel with an MLP into a feature map before input into a 2D cost volume encoder-decoder. We also make use of an image encoder specifically used to enforce a strong image prior when propagating and correcting depth estimates from the cost volume throughout the frame in the cost volume encoder-decoder.

Metadata Infused Cost Volume


Metadata insertion Typical MVS systems predict depth from warped features or differences between features \eg dot products. We additionally include cheaply-available metadata for improved performance. Indices (k,i,j) are omitted for clarity.


ScanNetv2 Depths


ScanNetv2 Reconstructions

ScanNetv2 Depth Video Visualization









If you find this work useful for your research, please cite:

      title={SimpleRecon: 3D Reconstruction Without 3D Convolutions},
      author={Sayed, Mohamed and Gibson, John and Watson, Jamie and Prisacariu, Victor and Firman, Michael and Godard, Cl{\'e}ment},
      booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},


We thank Aljaž Božič of TransformerFusion, Jiaming Sun of Neural Recon, and Arda Düzçeker of DeepVideoMVS for quickly providing useful information to help with baselines and for making their codebases readily available, especially on short notice.

The tuple generation scripts make heavy use of a modified version of DeepVideoMVS's Keyframe buffer (thanks again Arda and co!).

The PyTorch point cloud fusion module is borrowed from 3DVNet's repo. Thanks Alexander Rich!

We'd also like to thank Niantic's infrastructure team for quick actions when we needed them. Thanks folks!

Mohamed is funded by a Microsoft Research PhD Scholarship (MRL 2018-085).

© This webpage was in part inspired from this template.