TY - JOUR
T1 - Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative Metrics
AU - Park, Jurn Gyu
AU - Amangeldi, Aidar
AU - Fakhrutdinov, Nail
AU - Karzhaubayeva, Meruyert
AU - Zorbas, Dimitrios
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision transformers (ViTs) have emerged as a successful alternative to convolutional neural networks (CNNs) in deep learning (DL) applications for computer vision (CV), particularly excelling in accuracy on large-scale datasets within high-performance computing (HPC) or cloud domains. However, in the context of resource-constrained mobile and edge AI devices, there is a lack of systematic and comprehensive investigations into the challenging optimizations for both device-agnostic (e.g., accuracy and model size) and device-related (e.g., latency, memory usage, and power/energy consumption) multi-objectives. To resolve this problem, we first 1) introduce five device-agnostic (DA) and seven device-related (DR) quantitative metrics, 2) using which we thoroughly characterize the effects of ViT hyper-parameters on small datasets in terms of patch size and model size, and then 3) propose a simple yet effective optimization technique called the hierarchical and local (HelLo) tuning method for efficient ViTs. The results show that our method achieves significant improvements of up to 85% in MACs, 67.2% in inference latency, 77.7% in train latency/time, 63.3% in GPU memory, 73.8% in energy consumption, and 263.0% in FoM, with minimal accuracy degradation (up to 2%).
AB - Vision transformers (ViTs) have emerged as a successful alternative to convolutional neural networks (CNNs) in deep learning (DL) applications for computer vision (CV), particularly excelling in accuracy on large-scale datasets within high-performance computing (HPC) or cloud domains. However, in the context of resource-constrained mobile and edge AI devices, there is a lack of systematic and comprehensive investigations into the challenging optimizations for both device-agnostic (e.g., accuracy and model size) and device-related (e.g., latency, memory usage, and power/energy consumption) multi-objectives. To resolve this problem, we first 1) introduce five device-agnostic (DA) and seven device-related (DR) quantitative metrics, 2) using which we thoroughly characterize the effects of ViT hyper-parameters on small datasets in terms of patch size and model size, and then 3) propose a simple yet effective optimization technique called the hierarchical and local (HelLo) tuning method for efficient ViTs. The results show that our method achieves significant improvements of up to 85% in MACs, 67.2% in inference latency, 77.7% in train latency/time, 63.3% in GPU memory, 73.8% in energy consumption, and 263.0% in FoM, with minimal accuracy degradation (up to 2%).
KW - characterization
KW - Deep learning
KW - edge-AI
KW - efficient vision transformers (ViTs)
KW - embedded systems
KW - mobile devices
KW - multi-objective optimization
KW - on-device ML
UR - http://www.scopus.com/inward/record.url?scp=85217040651&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217040651&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2025.3536471
DO - 10.1109/ACCESS.2025.3536471
M3 - Article
AN - SCOPUS:85217040651
SN - 2169-3536
VL - 13
SP - 25704
EP - 25722
JO - IEEE Access
JF - IEEE Access
ER -