Index

Index
Swin Transformer
- 画像 (Vision) への適応の課題
ネットワークアーキテクチャ
Swin Transformer Block
参考
- Web サイト

Swin Transformer

画像 (Vision) のタスクに Transformer を対応させた手法.

Shifted Window

画像 (Vision) への適応の課題

Transformer を自然言語処理から画像処理へ適応させるときの課題は、
入力が単語から画像パッチへ変わることによる、情報の拡大である.

この問題に対応するために、Swin Transformer では Shifted Window を導入する.

ネットワークアーキテクチャ

以下の 4 つで構成されている.

Patch Partition
Linier Embedding
Swin Transformer Block
Patch Merging

Hierarchical Feature Map

Vision Transformer (ViT) に比べて、FPN / U-Net のような、階層構造を導入することで
Classification 以外の局所的な特徴が有効であるタスク (Detection etc)での精度の向上を図る.

Vision Transformer
- yhayato1320.hatenablog.com
Feature Pyramid Network
- yhayato1320.hatenablog.com
U-Net
- yhayato1320.hatenablog.com

Patch Partition

ViT 同様、画像をパッチ単位で分割する.

このパッチは、NLP でいうと token に対応する.

この論文での、実装では、 $4\ \times\ 4$ のパッチで、RGB で考えると、 $4\ \times\ 4\ \times 3$ .

Linier Embedding

分割したパッチを $C$ 次元のベクトルに変換.

Swin Transformer Block

Self Attention の演算を行うブロック. (詳細は後述)

Patch Merging

階層表現を作成するために、ネットワークが深くなるにつれて、
パッチをマージすることで、パッチの数が減り、チャネル数が増える.

例えば、最初のマージでは、 $2\ \times\ 2$ の隣接するパッチを連結し、 $4C$ 次元のベクトルを作成する.

Swin Transformer Block

ViT で利用されている Multi Head Self Attention (MSA) を改善している.

Normalization

各主要な処理の前 (Pre Norm)に、Layer Normalization を施す.

Pre Norm
- yhayato1320.hatenablog.com
Layer Normalization
- yhayato1320.hatenablog.com

MLP Layer

MLP では、非線形活性化関数である GELU を利用.

GELU
- yhayato1320.hatenablog.com

W-MSA / SW-MSA

冒頭にも挙げたように、Transformer を画像分野に適用させるときの問題
(正確には、入力データが大きくなってしまうときの問題) に対応する.

この問題は、Self Attention の特徴によるもので、学習のためのパラメータが、入力サイズに対して 2 次関数的に増えてしまうことである
(1 つのパッチ (トークン) / Query に対し、Key / Value の 2 つの情報を持たなければならない)

この問題に、効率的に入力を工夫することで対応する.

W-MSA : Window-based Multi-head Self-Attention
SW-MSA : Shifted Window-based Multi-head Self-Attention

Window の導入

Patch をまとめて、1 つの Window というものを作る.

例えば、 $h\ \times\ w$ 個の Patch から構成されている画像あるとする.

この Patch を $M\ \times\ M$ 個の Window になるように Patch をまとめる.

Window をまとめることで、計算量を削減する.

W-MSA : Window-based Multi-head Self-Attention

Shifted Windowの工夫

W-MSA では、Windows 間の繋がりの情報がないとい問題もある.

そこで、この問題に対応したSW-MSA を導入し、
Swin Transformer Block の内部で交互に構成する.

SW-MSA は、W-MSA とは異なる (Shifted した) 分割方法で、Window を作成する.

上の図では、 $\frac{M}{2}$ だけ W-MSA で作成した Window をずらすことで、 SW-MSA で入力する Window を作成する.

SW-MSA を導入することで、「Window 間の繋がりの情報」を持たせる.

Relative Position Bias

Self Attention のスコア関数は、Scaled Dot Product Attention の計算に Bias を加える工夫を施す.

Scaled Dot Product Attention
- yhayato1320.hatenablog.com

$Attention(Q,\ K,\ V)\ =\ Softmax \left( \displaystyle \frac{Q\ K^{T}}{\sqrt{d}} \right) V$

上が、Transformer / Vision Transformer のスコア関数として利用されているいる、
Scaled Dot Product Attention である.

そこに、Bias として $B\ \in\ R^{(2M-1)\ \times\ (2M-1)}$ を以下のように追加する.

$Attention(Q,\ K,\ V)\ =\ Softmax \left( \displaystyle \frac{Q\ K^{T}}{\sqrt{d}}\ +\ B \right) V$

参考

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- [2021]
- Abstract
- 1 Introduction
- 3 Method
  - 3.1 Overall Architecture
  - 3.2 Shifted Window based Self-Attention
- arxiv.org

Web サイト

SwinTransformerでCIFAR-10を一から訓練する
- blog.shikoan.com
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
- Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料 from Yusuke Uchida
  www.slideshare.net

オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【深層学習】Swin Transformer #アルゴリズム編

Index

Swin Transformer

画像 (Vision) への適応の課題

ネットワークアーキテクチャ

Hierarchical Feature Map

Patch Partition

Linier Embedding

Swin Transformer Block

Patch Merging

Swin Transformer Block

Normalization

MLP Layer

W-MSA / SW-MSA

Window の導入

Shifted Windowの工夫

Relative Position Bias

参考

Web サイト