Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative Metrics

Jurn Gyu Park, Aidar Amangeldi, Nail Fakhrutdinov, Meruyert Karzhaubayeva, Dimitrios Zorbas

Research output: Contribution to journalArticlepeer-review

Abstract

Vision transformers (ViTs) have emerged as a successful alternative to convolutional neural networks (CNNs) in deep learning (DL) applications for computer vision (CV), particularly excelling in accuracy on large-scale datasets within high-performance computing (HPC) or cloud domains. However, in the context of resource-constrained mobile and edge AI devices, there is a lack of systematic and comprehensive investigations into the challenging optimizations for both device-agnostic (e.g., accuracy and model size) and device-related (e.g., latency, memory usage, and power/energy consumption) multi-objectives. To resolve this problem, we first 1) introduce five device-agnostic (DA) and seven device-related (DR) quantitative metrics, 2) using which we thoroughly characterize the effects of ViT hyper-parameters on small datasets in terms of patch size and model size, and then 3) propose a simple yet effective optimization technique called the hierarchical and local (HelLo) tuning method for efficient ViTs. The results show that our method achieves significant improvements of up to 85% in MACs, 67.2% in inference latency, 77.7% in train latency/time, 63.3% in GPU memory, 73.8% in energy consumption, and 263.0% in FoM, with minimal accuracy degradation (up to 2%).

Original languageEnglish
Pages (from-to)25704-25722
Number of pages19
JournalIEEE Access
Volume13
DOIs
Publication statusPublished - 2025

Keywords

  • characterization
  • Deep learning
  • edge-AI
  • efficient vision transformers (ViTs)
  • embedded systems
  • mobile devices
  • multi-objective optimization
  • on-device ML

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'Patch and Model Size Characterization for On-Device Efficient-ViTs on Small Datasets Using 12 Quantitative Metrics'. Together they form a unique fingerprint.

Cite this