PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

1Niantic Spatial, 2KAUST, 3UCL.
*Work done during an internship at Niantic.

Language-guided 3D object placement: Our new task involves finding a valid placement for an asset according to a text prompt. This task requires semantic and geometric understanding of the scene, knowledge of the asset's geometry, and reasoning about object relationships and occlusions. The colored dots represent the positions of the objects mentioned in the prompt (provided only for visualization purposes and not given to the model), while the yellow arrow indicates the predicted frontal direction of the asset.

Abstract

We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.

Training Dataset Creation

Given a scene and an asset as input (a) the goal is to create a prompt (f) and corresponding mask M of valid placements (e). We start by finding the set of points which are physically plausible placements, shown in red in (b). We consider eight equally spaced rotation angles, which condition the valid placements. For this example, angle 0◦ has more valid placements than 45◦. To generate the language constraints, we use the ground truth scene graph (c). Object anchors are selected from the scene graph and combined with relationship types to create a constraint and corresponding validity mask (d). The different placement constraints are combined in the final output by intersecting the validity masks (e) given a mask of valid dense placements. Based on each selected set of anchors and constraint relationships, a natural language prompt is created using templates (please, see supplemental for more details).

Method

A point encoder extracts features from the 3D scene, which are then complemented with positional embeddings. Spatial pooling reduces feature dimensions, and a Q-Former merges the pooled features with trainable queries \( \mathbf{Q} \). The asset is encoded into a single vector by using a pretrained asset encoder followed by max-pooling. This vector, together with a size embedding, is passed to a projection layer that aligns the features with the LLM space. The LLM takes as input (i) the output of the Q-Former, (ii) the text prompt, and (iii) the projected asset features, and predicts three special tokens \( \mathrm{[ANC]} \), \( \mathrm{[LOC]} \), and \( \mathrm{[ROT]} \). A transformer-based decoder takes as input the features associated with the three special tokens and the pooled scene features and performs a few self and cross attention operations. Three heads produce the final outputs: \( \mathcal{M}_{\mathrm{loc}} \) the valid placement mask; \( \mathcal{M}_{\mathrm{anc}} \) an auxiliary mask that localizes the object anchors; and \( \mathcal{M}_{\mathrm{rot}} \) a mask indicating which rotation angles are valid at each location.

Qualitative Results

Colored highlights indicate anchors referenced in the textual prompts (predictions are generated entirely from point clouds, with anchor information provided only as text). The asset position is marked with a yellow circle, and a yellow arrow denotes the frontal orientation. Our method successfully follows language instructions and meets the specified constraints. The top-right example illustrates a placement that satisfies constraints but slightly intersects with the scene mesh. The bottom-right example demonstrates a failure case where one constraint is not met (highlighted in red).

Citation

                    
      @article{abdelreheem2025Placeit3d,
        author = {Abdelreheem, Ahmed and Aleotti, Filippo and Watson, Jamie and Qureshi, Zawar and Eldesokey, Abdelrahman and Wonka, Peter and Brostow, Gabriel and Vicente, Sara and Garcia-Hernando, Guillermo}
        title = {PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes},
        journal = {arXiv},
        year={2025}
      }

Copyright © Niantic Spatial 2025. Patent Pending. All rights reserved.