TL;DR: We ground CLIP features into a set of 3D language Gaussians, which attains precise 3D language fields while being 199 × faster than LERF.
Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF. Notably, LangSplat is extremely efficient, achieving a 199 × speedup compared to LERF.
Visualization of learned features of the previous SOTA method LERF and our LangSplat. LangSplat grounds CLIP features into a set of 3D Language Gaussians to construct a 3D language field. While LERF generates imprecise and vague 3D features, our LangSplat accurately captures object boundaries and provides precise 3D language fields without any post-processing. While being effective, our LangSplat is also 199 × faster than LERF at the resolution of 1440 × 1080.
Our LangSplat is able to focus more precisely on the queried object.
Our method is 199 × faster than LERF at 1440*1080 resolution.
Prior to any text query, our language field already exhibits precise 3D object boundaries.
@article{qin2023langsplat,
title={LangSplat: 3D Language Gaussian Splatting},
author={Qin, Minghan and Li, Wanhua and Zhou, Jiawei and Wang, Haoqian and Pfister, Hanspeter},
journal={arXiv preprint arXiv:2312.16084},
year={2023}
}