オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【画像処理】Transformer #まとめ編

Index

画像への応用

Transformer を画像へ応用した手法をまとめる.

課題

  • 画像 (2 D) 情報を 1 D 情報へ変換
  • Self Attention のコスト
  • 局所的な特徴量

アルゴリズム

Image Transformer / 2018

Set Transformer / 2018

  • Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Axial Transformer / 2019

  • Axial Attention in Multidimensional Transformers

Vision Transformer / ViT / 2020 ★

Deep ViT / 2021

Re Attention.

DeIT / 2021 ★

Swin Transformer / 2021 ★

MetaFormer / PoolFormer / 2021



MLP-Mixer / 2021

Conv Mixer / 2022

Pyramid Vision Transformer / 2021

  • Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

ToMe / 2022

Sequencer / 2022

Deformable Attention Transformer / DAT / 2022

U-ViT / 2022

DiT / 2022

Fourier Learner-Transformers / FLT / 2023

  • Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

Bcos-ViT / 2023

Reversible Vision Transformers / 2023

逆変換可能なNICE機構(T(x,y) = (x, y+f(x)))をViTの自己注意機構とMLP部分に適用.

特にトークン数が変わらない場合はネットワーク全体が逆変換可能となり学習時の活性値保存が不要.

メモリ使用量を1/15に抑えつつ、精度は殆ど劣化しない.



Vit-22B / 2023

Energy Transformer / 2023

Visual Atoms / 2023

StraIT / 2023

画像生成.

  • StraIT: Non-autoregressive Generation with Stratified Image Transformer

GHN-3 / 2023

  • Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?

ElasticViT / 2023

  • ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

FastViT / 2023

  • FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

SparseViT / 2023

  • SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

Mixed Resolution VIT / 2023

SparseFormer / 2023

  • SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Slide-Transformer / 2023

  • Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

AutoTaskFormer / 2023

  • AutoTaskFormer: Searching Vision Transformers for Multi-task Learning

工夫・テクニック

Dual PatchNorm / 2023

ViT の patch embedding の前後に Layer Normalization (LN) を挿入し,精度の改善.

ViT オリジナルの LN 配置である pre-LN がほぼ最適であるとし, Transformer Block ではなく,patch embedding 前後に LN を挿入する形を提案.

Multi Headed Self Attention

  • SpectFormer: Frequency and Attention is what you need in a Vision Transformer
    • [2023]
    • 2 Related Work
      • Quadratic Complexity of Attention Nets

ViT

DeIT

Spectral Layers

  • SpectFormer: Frequency and Attention is what you need in a Vision Transformer
    • [2023]
    • 2 Related Work
      • Spectral Transformers

FNet / 2021

  • FNet: Mixing Tokens with Fourier Transforms

GFNet / 2021

  • Global Filter Networks for Image Classification

AFNO / 2021

  • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

SpectFormer / 2023

  • SpectFormer: Frequency and Attention is what you need in a Vision Transformer

他分野への応用

動画への応用

3D

3D Patch / 2021

参考

書籍

Web サイト