Improving local features with relevant spatial information by vision transformer for crowd counting

Nguyen, H. Tran; Ta, Duc Huy; Duong, T. M. Soan; Nguyen, Phan; Dao, Huu Hung; Nguyen, D. Tr. Chanh; Bui, Trung; Truong, Q. H. Steven

dc.contributor.author	Nguyen, H. Tran
dc.contributor.author	Ta, Duc Huy
dc.contributor.author	Duong, T. M. Soan
dc.contributor.author	Nguyen, Phan
dc.contributor.author	Dao, Huu Hung
dc.contributor.author	Nguyen, D. Tr. Chanh
dc.contributor.author	Bui, Trung
dc.contributor.author	Truong, Q. H. Steven
dc.date.accessioned	2025-03-23T17:08:37Z
dc.date.available	2025-03-23T17:08:37Z
dc.date.issued	2022
dc.identifier.uri	https://vinspace.edu.vn/handle/VIN/599
dc.description.abstract	Vision Transformer (ViT) variants have demonstrated state-of-the-art performances in plenty of computer vision benchmarks, including crowd counting. Although Transformer-based models have shown breakthroughs in crowd counting, existing methods have some limitations. Global embeddings extracted from ViTs do not encapsulate fine-grained local features and, thus, are prone to errors in crowded scenes with diverse human scales and densities. In this paper, we propose LoViTCrowd with the argument that: LOcal features with spatial information from relevant regions via the attention mechanism of ViT can effectively reduce the crowd counting error. To this end, we divide each image into a cell grid. Considering patches of 3 × 3 cells, in which the main parts of the human body are encapsulated, the surrounding cells provide meaningful cues for crowd estimation. ViT is adapted on each patch to employ the attention mechanism across the 3 × 3 cells to count the number of people in the central cell. The number of people in the image is obtained by summing up the counts of its non-overlapping cells. Extensive experiments on four public datasets of sparse and dense scenes, i.e., Mall, ShanghaiTech Part A, ShanghaiTech Part B, and UCF-QNRF, demonstrate our method’s state-of-the-art performance. Compared to TransCrowd, LoViTCrowd reduces the root mean square errors (RMSE) and the mean absolute errors (MAE) by an average of 14.2% and 9.7%, respectively. The source is available at https://github.com/nguyen1312/LoViTCrowd.	en_US
dc.language.iso	en_US	en_US
dc.title	Improving local features with relevant spatial information by vision transformer for crowd counting	en_US
dc.type	Article	en_US

Files in this item

Name:: trungchanh2.pdf
Size:: 3.160Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Nguyen Do Trung Chanh, PhD [11]
Model Development Manager - College of Engineering and Computer Science

Show simple item record