UPM Institutional Repository

VS-BIM: a cognitive map-driven framework enhancing MLLM for automatic safety inspection in construction


Citation

Wang, Lei and Liu, Yu and Wang, Cunrui and An, Hongda and Li, Yiting (2026) VS-BIM: a cognitive map-driven framework enhancing MLLM for automatic safety inspection in construction. Advanced Engineering Informatics, 69. art. no. 103985. pp. 1-18. ISSN 1474-0346

Abstract

The rise of Multimodal Large Language Models (MLLMs) offers new potential for automated construction safety inspection. However, current discriminative vision-language alignment approaches struggle with spatial understanding and complex reasoning, limiting proactive risk detection. To address this, we propose VS-BIM, a generative zero-shot framework driven by cognitive maps. We reconstruct 3D scenes from panoramic video and Building Information Models (BIM), and align visual and semantic information into a queryable 3D cognitive map that serves as the spatial working memory for MLLMs. VS-BIM includes three modules: (1) panoramic video and BIM-based 3D reconstruction; (2) pixel-level semantic embedding using segmentation and pretrained vision-language models; and (3) cognitive map generation via multi-view consistency constraints. We evaluate VS-BIM using public datasets and custom construction environments. In addition, we introduce the CER-QA benchmark to test its performance in configuration recognition, spatial reasoning, spatiotemporal inference, and risk assessment. Results demonstrate VS-BIM excels in 3D object detection, achieves near-human spatial reasoning, and surpasses human averages in spatial estimation. Overall, the cognitive-map-based zero-shot paradigm endows MLLMs with stronger spatial reasoning while greatly reducing the need for large-scale annotation and retraining. By adapting seamlessly to complex, dynamic sites, it shifts safety inspection from passive checks to proactive risk prediction, enabling more efficient construction-site management.


Download File

[img] Text
122268.pdf - Published Version
Restricted to Repository staff only

Download (4MB)

Additional Metadata

Item Type: Article
Subject: Information Systems
Subject: Artificial Intelligence
Divisions: Faculty of Computer Science and Information Technology
DOI Number: https://doi.org/10.1016/j.aei.2025.103985
Publisher: Elsevier
Keywords: 3D realistic geospatial scene; Computer vision; Construction site safety; Object detection; Spatial temporal relationship
Depositing User: Mr. Mohamad Syahrul Nizam Md Ishak
Date Deposited: 13 Jan 2026 00:30
Last Modified: 13 Jan 2026 00:35
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.1016/j.aei.2025.103985
URI: http://psasir.upm.edu.my/id/eprint/122268
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item