BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

dc.contributor.authorLan, Yuqingen_US
dc.contributor.authorZhu, Chenyangen_US
dc.contributor.authorGao, Zhiruien_US
dc.contributor.authorZhang, Jiazhaoen_US
dc.contributor.authorCao, Yihanen_US
dc.contributor.authorYi, Renjiaoen_US
dc.contributor.authorWang, Yijieen_US
dc.contributor.authorXu, Kaien_US
dc.contributor.editorChristie, Marcen_US
dc.contributor.editorPietroni, Nicoen_US
dc.contributor.editorWang, Yu-Shuenen_US
dc.date.accessioned2025-10-07T05:02:41Z
dc.date.available2025-10-07T05:02:41Z
dc.date.issued2025
dc.description.abstractOpen-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module. The optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on CA-1M and ScanNetV2 datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.en_US
dc.description.number7
dc.description.sectionheadersDetecting & Estimating from images and videos
dc.description.seriesinformationComputer Graphics Forum
dc.description.volume44
dc.identifier.doi10.1111/cgf.70254
dc.identifier.issn1467-8659
dc.identifier.pages11 pages
dc.identifier.urihttps://doi.org/10.1111/cgf.70254
dc.identifier.urihttps://diglib.eg.org/handle/10.1111/cgf70254
dc.publisherThe Eurographics Association and John Wiley & Sons Ltd.en_US
dc.subjectCCS Concepts: Computing methodologies → Scene understanding
dc.subjectComputing methodologies → Scene understanding
dc.titleBoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusionen_US
Files
Original bundle
Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
cgf70254.pdf
Size:
19.33 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
paper1156_mm1.mp4
Size:
28.91 MB
Format:
Video MP4
Loading...
Thumbnail Image
Name:
paper1156_mm2.pdf
Size:
140.36 KB
Format:
Adobe Portable Document Format
Collections