BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Lan, Yuqing; Zhu, Chenyang; Gao, Zhirui; Zhang, Jiazhao; Cao, Yihan; Yi, Renjiao; Wang, Yijie; Xu, Kai

BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

dc.contributor.author	Lan, Yuqing	en_US
dc.contributor.author	Zhu, Chenyang	en_US
dc.contributor.author	Gao, Zhirui	en_US
dc.contributor.author	Zhang, Jiazhao	en_US
dc.contributor.author	Cao, Yihan	en_US
dc.contributor.author	Yi, Renjiao	en_US
dc.contributor.author	Wang, Yijie	en_US
dc.contributor.author	Xu, Kai	en_US
dc.contributor.editor	Christie, Marc	en_US
dc.contributor.editor	Pietroni, Nico	en_US
dc.contributor.editor	Wang, Yu-Shuen	en_US
dc.date.accessioned	2025-10-07T05:02:41Z
dc.date.available	2025-10-07T05:02:41Z
dc.date.issued	2025
dc.description.abstract	Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module. The optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on CA-1M and ScanNetV2 datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.	en_US
dc.description.number	7
dc.description.sectionheaders	Detecting & Estimating from images and videos
dc.description.seriesinformation	Computer Graphics Forum
dc.description.volume	44
dc.identifier.doi	10.1111/cgf.70254
dc.identifier.issn	1467-8659
dc.identifier.pages	11 pages
dc.identifier.uri	https://doi.org/10.1111/cgf.70254
dc.identifier.uri	https://diglib.eg.org/handle/10.1111/cgf70254
dc.publisher	The Eurographics Association and John Wiley & Sons Ltd.	en_US
dc.subject	CCS Concepts: Computing methodologies → Scene understanding
dc.subject	Computing methodologies → Scene understanding
dc.title	BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion	en_US

Files

Original bundle

Now showing 1 - 3 of 3

Name:: cgf70254.pdf
Size:: 19.33 MB
Format:: Adobe Portable Document Format

Download

Name:: paper1156_mm1.mp4
Size:: 28.91 MB
Format:: Video MP4

Download

Name:: paper1156_mm2.pdf
Size:: 140.36 KB
Format:: Adobe Portable Document Format

Download

Collections

44-Issue 7