Citation
Abstract
Current video anomaly detection (VAD) methods struggle to prioritize informative frames and lack effective mechanisms to collect both local and global video contexts. Traditional methods rely on deep 3-D convolutional neural network (CNN) backbones that still fail to capture the intricate spatiotemporal dynamics of surveillance footage. In addition, current inflated 3D convnet (I3D)- and contrastive language-image pre-training (CLIP)-based models exhibit substantial processing demands and require supplementary training data. Consequently, these models are impractical for real-time use in internet of things (IoT)–edge deployment. This research presents NASNetMobile–EViT, a lightweight framework that addresses both the computational burden and the contextual information deficiency that hinder weakly supervised anomaly detection. First, we implemented a frame motion selector module utilizing the Gaussian mixture model (GMM) that ranks and samples frames rich in motion cues to ensure that downstream processing focuses on the most informative content. Second, we employed a pretrained, low-parameter NASNetMobile algorithm that efficiently extracts fine-grained local spatial details. Third, we enhanced the Vision Transformer (ViT) with root-mean-square normalization and extended query-adaptive pooling (QAP) from 2-D images to temporal token maps to dynamically weight frames to capture long-range temporal relations. The proposed framework outperforms existing models on four public benchmarks, achieving 91.61% AUC (UCF-Crime), 91.65% AP (XD-Violence), 30.36% accuracy (ActivityNet-VAD), and 77.00% accuracy (NREF). The results achieved with 7.9M parameters and 0.71 giga floating point operations per second (GFLOPs) show that NASNetMobile–EViT provides high accuracy and edge-level efficiency for autonomous, weakly supervised surveillance.
Download File
Full text not available from this repository.
Official URL or Download Paper: https://ieeexplore.ieee.org/document/11220260/
|
Additional Metadata
| Item Type: | Article |
|---|---|
| Subject: | Signal Processing |
| Subject: | Information Systems |
| Divisions: | Faculty of Science |
| DOI Number: | https://doi.org/10.1109/JIOT.2025.3625045 |
| Publisher: | Institute of Electrical and Electronics Engineers Inc. |
| Keywords: | Enhanced Vision Transformer (ViT); Neural networks search mobile (NASNetMobile); Query-adaptive pooling (QAP); RMS normalization (RMSNorm); UCF-Crime; Video anomaly detection (VAD); Weakly supervised |
| Depositing User: | Ms. Che Wa Zakaria |
| Date Deposited: | 19 Mar 2026 02:43 |
| Last Modified: | 19 Mar 2026 02:43 |
| Altmetrics: | http://www.altmetric.com/details.php?domain=psasir.upm.edu.my&doi=10.1109/JIOT.2025.3625045 |
| URI: | http://psasir.upm.edu.my/id/eprint/122672 |
| Statistic Details: | View Download Statistic |
Actions (login required)
![]() |
View Item |
