UPM Institutional Repository

Efficient NASNetMobile-enhanced Vision Transformer for weakly supervised video anomaly detection


Citation

Arif Mohamad, Muhammad Luqman and Abd Rahman, Mohd Amiruddin and Mohd Shah, Nurisya and Kumar Sangaiah, Arun (2026) Efficient NASNetMobile-enhanced Vision Transformer for weakly supervised video anomaly detection. IEEE Internet of Things Journal, 13 (1). pp. 536-548. ISSN 2327-4662

Abstract

Current video anomaly detection (VAD) methods struggle to prioritize informative frames and lack effective mechanisms to collect both local and global video contexts. Traditional methods rely on deep 3-D convolutional neural network (CNN) backbones that still fail to capture the intricate spatiotemporal dynamics of surveillance footage. In addition, current inflated 3D convnet (I3D)- and contrastive language-image pre-training (CLIP)-based models exhibit substantial processing demands and require supplementary training data. Consequently, these models are impractical for real-time use in internet of things (IoT)–edge deployment. This research presents NASNetMobile–EViT, a lightweight framework that addresses both the computational burden and the contextual information deficiency that hinder weakly supervised anomaly detection. First, we implemented a frame motion selector module utilizing the Gaussian mixture model (GMM) that ranks and samples frames rich in motion cues to ensure that downstream processing focuses on the most informative content. Second, we employed a pretrained, low-parameter NASNetMobile algorithm that efficiently extracts fine-grained local spatial details. Third, we enhanced the Vision Transformer (ViT) with root-mean-square normalization and extended query-adaptive pooling (QAP) from 2-D images to temporal token maps to dynamically weight frames to capture long-range temporal relations. The proposed framework outperforms existing models on four public benchmarks, achieving 91.61% AUC (UCF-Crime), 91.65% AP (XD-Violence), 30.36% accuracy (ActivityNet-VAD), and 77.00% accuracy (NREF). The results achieved with 7.9M parameters and 0.71 giga floating point operations per second (GFLOPs) show that NASNetMobile–EViT provides high accuracy and edge-level efficiency for autonomous, weakly supervised surveillance.


Download File

Full text not available from this repository.
Official URL or Download Paper: https://ieeexplore.ieee.org/document/11220260/

Additional Metadata

Item Type: Article
Subject: Signal Processing
Subject: Information Systems
Divisions: Faculty of Science
DOI Number: https://doi.org/10.1109/JIOT.2025.3625045
Publisher: Institute of Electrical and Electronics Engineers Inc.
Keywords: Enhanced Vision Transformer (ViT); Neural networks search mobile (NASNetMobile); Query-adaptive pooling (QAP); RMS normalization (RMSNorm); UCF-Crime; Video anomaly detection (VAD); Weakly supervised
Depositing User: Ms. Che Wa Zakaria
Date Deposited: 19 Mar 2026 02:43
Last Modified: 19 Mar 2026 02:43
Altmetrics: http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.1109/JIOT.2025.3625045
URI: http://psasir.upm.edu.my/id/eprint/122672
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item