Logo

A Decade's Battle on the Bias of Vision Backbone and Optimizer

1AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China 2DAMO Academy, Hangzhou, China
*Equal Contribution Corresponding author
BOCB Image

Introduction

🌟[NEW!] This paper introduces the first comprehensive benchmark exploring the intricate interplay between backbone architectures and optimizer selections in the domain of computer vision tasks.

🌟[NEW!] We unveil the phenomenon of backbone-optimizer coupling bias (BOCB), elucidating the potential limitations it imposes on vision backbones, such as the additional fine-tuning time and efforts required for downstream tasks.

🌟[NEW!] Through an in-depth analysis, we unravel the underlying rationale behind diverse network designs and their susceptibility to BOCB, thereby providing valuable guidelines for future vision backbone architecture engineering. Furthermore, our benchmarking results and publicly released codebase serve as practical resources for user-friendly deployment and evaluation.

🤔So, what is BOCB & where the BOCB come from?



Q1: Does any dependency exist between existing vision backbones and various optimizers?

Q2: If such dependencies exist, (how) do they affect the practice of vision backbones?

Q3: In the case of adverse impacts, can we identify any correspondence between these negative dependencies and specific network architecture elements to minimize potential risks?

🗣️APPEAL:
We aim to inspire the computer vision community to rethink the relationship between backbones and optimizers, consider BOCB in future studies, and thus contribute to more systematic future advancements.

Abstract

The past decade has witnessed rapid progress in vision backbones and an evolution of deep optimizers from SGD to Adam variants. This paper, for the first time, delves into the relationship between vision network design and optimizer selection. We conduct comprehensive benchmarking studies on mainstream vision backbones and widely-used optimizers, revealing an intriguing phenomenon termed backbone-optimizer coupling bias (BOCB). Notably, classical ConvNets, such as VGG and ResNet, exhibit a marked co-dependency with SGD, while modern architectures, including ViTs and ConvNeXt, demonstrate a strong coupling with optimizers with adaptive learning rates like AdamW. More importantly, we uncover the adverse impacts of BOCB on popular backbones in real-world practice, such as additional tuning time and resource overhead, which indicates the remaining challenges and even potential risks. Through in-depth analysis and apples-to-apples comparisons, however, we surprisingly observe that specific types of network architecture can significantly mitigate BOCB, which might serve as promising guidelines for future backbone design. We hope this work as a kick-start can inspire the community to further question the long-held assumptions on vision backbones and optimizers, consider BOCB in future studies, and thus contribute to more robust, efficient, and effective vision systems. It is time to go beyond those usual choices and confront the elephant in the room. The source code and models are publicly available.

Visual Insights

Backbone Roadmap

Backbone Roadmap

General Algorithm of Optimizer for DNNs

General Algorithm of Optimizer for DNNs

Four categories optimizer

Four categories optimizer

Hyper-parameters consistency across optimizers for certain backbone

Hyper-parameters consistency

Optimizer generality against backbones

Optimizer generality

Model parameter patterns

Model parameter patterns

CIFAR100 Benchmark

Top-1 accuracy (%) of representative vision backbones with 20 popular optimizers on CIFAR-100. The torch-style training settings are used for AlexNet, VGG-13, R-50/101 (ResNet-50/101), and MobV2 (MobileNet.V2), while other backbones adopt modern training recipes, including Eff-B0 (EfficientNet-B0). IF-12, PF-12, CF-12, AF-12, and CA-12 are abbreviated for MetaFormer S12 variants. The blue and gray regions denote the top and outlier (trivial) results, while others are inliers.

You can swipe left and right to see the full table.

Backbone Alex VGG R-50 R-101 MobV2 Eff-B0 DeiT-S MLP-S Swin-T CX-T Moga-S IF-12 PF-12 CF-12 AF-12 CA-12
SGD-M 66.78 77.08 78.76 84.87 77.16 79.41 63.39 72.64 78.95 60.09 75.06 77.40 77.70 83.46 83.02 81.21
SGDP 66.54 77.56 79.25 85.30 77.50 79.55 63.53 69.24 80.56 61.25 80.86 77.55 77.53 83.54 82.88 81.56
LION 62.11 73.87 75.28 82.79 75.35 76.97 74.57 74.19 81.84 82.29 85.03 78.65 79.66 84.62 82.41 79.59
Adam 65.29 73.41 74.10 83.34 74.56 76.48 71.04 72.84 80.71 82.03 84.92 78.39 79.18 84.81 81.54 82.18
Adamax 67.30 73.80 75.21 83.27 74.60 78.37 73.31 73.07 81.28 80.25 84.51 78.02 79.55 84.31 81.42 82.50
AdamP 60.27 75.56 78.17 84.64 77.79 78.65 71.55 73.66 80.91 84.47 86.45 79.20 81.70 85.15 82.12 83.40
AdamW 62.71 73.90 75.56 84.01 76.88 78.77 72.15 73.59 81.30 83.52 86.19 79.39 80.55 85.46 82.24 83.60
Adan 63.98 74.90 77.08 84.96 77.73 78.43 76.33 79.94 83.35 84.65 86.45 80.59 83.23 85.58 83.51 84.89
LAMB 66.90 75.55 77.19 85.05 77.49 78.77 75.39 74.98 83.47 84.13 86.04 80.21 80.01 85.40 83.16 83.74
NAdam 60.49 73.96 74.56 82.78 75.69 77.06 72.75 73.77 81.80 82.26 85.23 78.37 80.32 84.81 81.82 82.83
RAdam 61.69 74.64 75.19 81.85 75.62 77.08 72.41 72.11 79.84 82.18 84.95 78.46 79.71 84.93 81.44 82.35
AdaBelief 62.98 75.09 80.53 85.47 75.78 78.48 70.66 73.30 80.98 83.31 84.80 78.55 81.00 85.03 83.21 83.56
AdaBound 66.59 77.00 78.11 84.45 78.76 79.88 68.59 70.31 80.67 49.18 78.48 75.03 77.62 82.73 83.08 82.38
AdaFactor 63.91 74.49 75.41 84.42 75.38 77.83 74.02 71.16 80.36 82.82 85.17 78.78 78.81 84.90 81.94 82.36
LARS 64.35 75.71 78.25 84.45 76.23 72.43 71.36 72.64 81.29 61.40 75.93 77.66 78.78 82.98 81.00 82.05
NovoGrad 64.24 76.09 79.36 85.23 74.83 74.23 73.13 67.03 81.82 79.99 82.86 77.16 80.42 83.51 81.28 82.98
Sophia 64.30 74.18 75.19 82.54 76.60 78.95 71.47 72.74 80.61 83.76 85.39 77.67 78.90 84.58 81.67 82.96
AdaGrad 45.79 71.29 73.30 81.81 33.87 77.93 67.24 67.50 75.83 83.03 83.03 32.28 44.40 79.67 78.71 38.09
AdaDelta 66.72 74.14 75.07 83.58 75.32 77.88 65.44 71.32 80.25 74.25 81.06 75.91 76.40 84.05 82.62 82.08
RMSProp 59.33 73.30 74.25 79.38 73.94 76.83 70.71 71.63 77.52 82.29 85.17 77.40 77.14 84.01 79.72 81.83

Transfer Learning to Object Detection and 2D Pose Estimation

Transfer learning to object detection (Det.) with RetinaNet and 2D pose estimation (Pose.) with TopDown on COCO, evaluated by mAP (%) and AP$^{50}$ (%). We employ pre-trained VGG, ResNet-50 (R-50), Swin-T, and ConvNeXt-T (CX-T) with different pre-training settings, where 100-epoch pre-train by SGD, LARS, or RSB A3 (LAMB), 300-epoch pre-train by AdamW or RSB A2 (LAMB), and 600-epoch pre-train with RSB A1 (LAMB).

Pre-training 2D Pose Estimation Object Detection
VGG (SGD) R-50 (SGD) Swin-T (AdamW) VGG (SGD) R-50 (SGD) R-50 (A3) R-50 (A3) R-50 (A2) R-50 (A1) Swin-T (AdamW) CX-T (AdamW)
SGD-M 47.5 45.6 38.4 38.4 36.6 27.5 28.7 23.7 34.6 37.2 38.5
SGDP 47.3 41.2 38.9 38.9 36.6 17.6 18.5 26.8 26.7 37.2 22.5
LION 69.5 71.5 71.3 71.3 32.1 35.8 35.4 37.6 34.6 41.9
Adam 69.8 71.6 72.7 72.7 36.2 36.2 35.8 38.3 38.4 41.9 43.1
Adamax 69.0 71.2 72.4 72.4 36.8 36.8 36.4 38.3 38.4 41.5 42.0
AdamP 69.7 71.5 72.8 72.8 36.5 37.2 36.5 38.5 38.9 41.7 43.3
AdamW 70.0 72.0 72.8 72.8 37.1 37.1 36.7 38.4 39.5 41.8 43.4
Adan 69.7 72.1 72.8 72.8 37.7 37.0 36.0 38.6 39.0 42.0 43.2
LAMB 68.5 71.5 71.7 71.7 36.7 37.5 37.7 38.6 38.9 41.8 42.6
NAdam 69.7 71.8 71.9 71.9 36.0 36.6 36.1 38.2 38.4 41.9 43.4
RAdam 69.8 71.8 72.6 72.6 36.6 36.5 36.0 38.2 38.4 41.6 43.3
AdaBelief 69.6 67.0 61.8 61.8 36.2 34.4 33.1 36.4 38.2 40.0 41.4
AdaBound 34.0 44.9 28.4 28.4 35.9 34.2 31.9 37.0 35.0 38.8 41.2
AdaFactor 72.8 71.7 72.8 72.8 35.6 37.0 36.4 38.5 37.8 40.5 43.1
LARS 54.4 63.4 47.6 47.6 35.8 28.9 28.8 34.7 36.9 34.6 40.5
NovoGrad 64.2 70.7 69.8 69.8 35.6 27.2 26.3 35.2 28.6 40.4 39.0
Sophia 69.7 71.6 72.3 72.3 36.4 35.8 35.3 38.0 38.7 40.4 42.5
AdaGrad 66.0 61.2 48.4 48.4 26.4 21.9 28.3 32.7 27.1 32.9
AdaDelta 44.3 49.3 52.0 52.0 34.9 32.7 32.7 35.9 33.9 40.0
RMSProp 68.8 71.6 72.5 72.5 35.3 36.2 35.6 37.8 38.3 41.5 43.1

ImageNet-1K Benchmark

Top-1 accuracy (%) of DeiT-S and ResNet-50 training 300 epochs by popular optimizers using DeiT and RSB A2 training recipes on ImageNet-1K.

Backbone DeiT-S (DeiT) R-50 (A2)
SGD-M 75.35 78.82
SGDP 76.34 78.02
LION 78.78 78.92
Adam 78.44 78.16
Adamax 77.71 78.05
AdamP 79.26 79.28
AdamW 80.38 79.88
Adan 80.81 79.91
LAMB 80.23 79.84
NAdam 78.26 78.97
RAdam 78.54 78.75
AdaBelief 75.32 78.25
AdaBound 72.96 75.37
AdaFactor 79.98 79.36
LARS 73.18 79.66
NovoGrad 71.26 76.83
Sophia 79.65 79.13
AdaGrad 54.96 74.92
AdaDelta 74.14 77.40
RMSProp 78.03 78.04

Implementation Details (Vision Backbones)

Three categories of typical vision backbones proposed in the past decade.
Backbone Date Stage-wise design Block-wise design Operator (token mixer) Resolution Training setting
AlexNet NIPS'2012 - - Conv 224 PyTorch
VGG-13 ICLR'2014 - - Conv 224 PyTorch
ResNet-50/101 CVPR'2016 Hierarchical Bottleneck Conv 32 PyTorch
ResNet-101 CVPR'2016 Hierarchical Bottleneck Conv 32 DeiT
MobileNet.V2 CVPR'2018 Hierarchical Inv-bottleneck Conv 224 PyTorch
EfficientNet-B0 ICML'2019 Hierarchical Inv-bottleneck Conv & SE 224 RSB A2
DeiT-S (ViT) ICML'2021 Patchfy & Isotropic Metaformer Attention 224 DeiT
MLP-Mixer-S NIPS'2021 Patchfy & Isotropic Metaformer MLP 224 DeiT
Swin-T ICCV'2021 Patchfy & Hierarchical Metaformer Attention 224 ConvNeXt
ConvNeXt-T CVPR'2022 Patchfy & Hierarchical Metaformer Conv 32 ConvNeXt
MogaNet-S ICLR'2024 Patchfy & Hierarchical Metaformer Conv & Gating 32 ConvNeXt
IdentityFormer-S12 TPAMI'2024 Patchfy & Hierarchical Metaformer Identity 224 RSB A2
PoolFormerV2-S12 TPAMI'2024 Patchfy & Hierarchical Metaformer Pooling 224 RSB A2
ConvFormer-S12 TPAMI'2024 Patchfy & Hierarchical Metaformer Conv 224 RSB A2
AttentionFormer-S12 TPAMI'2024 Patchfy & Hierarchical Metaformer Attention 224 RSB A2

Implementation Details (Optimizer)

.
Four categories of typical optimizers with their components. From top to bottom are (a) fixed learning rate with momentum gradient, (b) adaptive learning rate with momentum gradient, (c) estimated learning rate with momentum gradient, and (d) adaptive learning rate with current gradient
Optimizer Date Learning rate Gradient Weight decay
SGD-M TSMC'1971 Fixed lr Momentum
SGDP ICLR'2021 Fixed lr Momentum Decoupled
LION NIPS'2023 Fixed lr Sign Momentum Decoupled
Adam ICLR'2015 Estimated second moment Momentum
Adamax ICLR'2015 Estimated second moment Momentum
AdamW ICLR'2019 Estimated second moment Momentum Decoupled
AdamP ICLR'2021 Estimated second moment Momentum Decoupled
LAMB ICLR'2020 Estimated second moment Momentum Decoupled
NAdam ICLR'2018 Estimated second moment Nesterov Momentum
RAdam ICLR'2020 Estimated second moment Momentum Decoupled
Adan TPAMI'2023 Estimated second moment Nesterov Momentum Decoupled
AdaBelief NIPS'2019 Estimated second moment variance Momentum Decoupled
AdaBound ICLR'2019 Estimated second moment Momentum Decoupled
AdaFactor ICML'2018 Estimated second moment (decomposition) Momentum Decoupled
LARS ICLR'2018 L2-norm of Gradient Momentum Decoupled
Novograd arXiv'2020 Sum of estimated second momentum Momentum Decoupled
Sophia arXiv'2023 Parameter-based estimator Sign Momentum Decoupled
AdaGrad JMLR'2011 Second moment Gradient
AdaDelta arXiv'2012 Estimated second moment param moment Gradient
RMSProp arXiv'2012 Estimated second moment Gradient

What is BOCB?

The phenomenon of backbone-optimizer coupling bias (BOCB) we observed during the benchmarking arises from the intricate interplay between the design principles of vision backbones and the inherent properties of optimizers. In particular, we notice that traditional CNNs, such as VGG and ResNets, exhibit a marked coupling with SGD optimizers. In contrast, modern meta-architectures like ViTs and ConvNeXt strongly correlate with adaptive learning rate optimizers, particularly AdamW. As aforementioned, we assume that such a coupling bias may stem from the increasing complexity of optimization as backbone architectures evolve. Concretely, recent backbones incorporate complex elements such as stage-wise hierarchical structures, advanced token-mixers, and block-wise heterogeneous structures. These designs shape a more intricate and challenging optimization landscape, necessitating adaptive learning rate strategies and effective momentum handling. Thus, modern backbones exhibit stronger couplings with optimizers that can navigate these complex landscapes. The BOCB phenomenon has several implications for the design and deployment of vision backbones:

  • User-Friendliness: Backbones with weaker coupling, typically traditional CNNs, offer greater flexibility as they can be effectively optimized with different optimizers. This makes them more user-friendly, especially for practitioners with limited resources for extensive hyper-parameter tuning. Conversely, modern architectures like ViTs and ConvNeXt, which exhibit strong coupling with adaptive learning rate optimizers, demand careful optimizer selection and meticulous hyper-parameter tuning for optimal performance.
  • Performance and Generalization: While weaker coupling offers more user-friendliness, stronger coupling can potentially lead to better performance and generalization. Tailoring the optimization process to certain architectural characteristics of modern backbones, such as stage-wise hierarchical structures and attention mechanisms for token mixing, can more effectively navigate complex optimization landscapes, unlocking superior performance and generalization capabilities.
  • Design Principles: The BOCB phenomenon highlights the need to consider the coupling between backbone designs and optimizer choices. When designing new backbone architectures, it is crucial to account for both the inductive bias introduced by the macro design principles (e.g., hierarchical structures, attention mechanisms) and the optimizer matching bias. A balanced approach that harmonizes the backbone design with the appropriate optimizer choice can lead to optimal performance and efficient optimization, enabling the full potential of the proposed architecture to be realized.

Where the BOCB Comes from?

To investigate the causes of BOCB, we first consider what matters the most: optimizers or backbones. As shown in Figure 4 and Table 1, four categories of optimizers show different extents of BOCB with vision backbones. Category (a) shows a broader performance dispersion, necessitating meticulous hyper-parameter tuning to classical CNNs while demonstrating less adaptability to advanced backbones' optimization demands. Category (b) and Category (c) exhibit a robust, hyperparameter-insensitive performance peak, adept at navigating the complex optimization landscapes of primary CNNs and modern DNNs. Category (d) shows the worst performances and heavy BOCB. The trajectory of vision backbone macro design has significantly sculpted the optimization landscape, progressing through distinct phases that reflect the intricate relationship between network architectural complexity and training challenges.

  • Foundational Backbones: Primary CNNs like AlexNet and VGG established a fundamental paradigm in computer vision. These architectures featured a straightforward design of stacked convolutional and pooling layers, culminated by fully connected layers. This conventional paradigm was effective but set the stage for further optimization of landscape alterations.
  • Classical Backbone Advancements: The introduction of ResNet marked a pivotal shift towards stage-wise hierarchical designs, significantly enhancing feature extraction and representation learning ability. ResNet-50, in particular, demonstrated a well-balanced approach to BOCB, which exhibited strong compatibility with SGD optimizers and a relatively lower BOCB compared to its contemporaries.
  • Modern Backbone Evolution: The transition to Modern DNN backbones, such as ConvNeXt and MogaNet, introduced complex block-wise heterogeneous structures, increasing the optimization challenge and the degree of BOCB due to their sophisticated feature extraction mechanisms. Representing a pinnacle in evolution, the MetaFormer architecture incorporates both stage-wise and block-wise heterogeneity into its design. This innovative macro design refines the optimization landscape by harmonizing with optimizers, leading to reduced BOCB and enhanced performance.

For details, please refer to the full paper.

BibTeX


@article{li2024battle,
title={A Decade's Battle on Bias of Visual Backbone and Optimizer},
author={Siyuan Li and Juanxi Tian and Zedong Wang and Luyuan Zhang and Zicheng Liu and Cheng Tan and Weiyang Jin and Lei Xin and Yang Liu and Baigui Sun and Stan Z. Li},
year={2024},
}

Contribution

The main contributors are:

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CVinW

© 2024 BOCB. All rights reserved.