Index

Index
MDETR
Architecture
Training / Loss
- Contrastive Alignment Loss
  - Object Part
  - Text Part
- Total Loss
参考
- Web サイト
- Post

MDETR

Modulated DETR

DETR #まとめ編
- yhayato1320.hatenablog.com
Phrase Grounding
- yhayato1320.hatenablog.com

Architecture

DETR に Text Vector を concat.

2 つの Encoder

Text Encoder は、学習済みのモデルを利用する.

Concat

Text と Image の Feature Vector を連結する.

その後、Encoder へ入力される.

DETR への入力

DETR の構造と同様.

Training / Loss

Contrastive Alignment Loss

$Contrastive Alignment Loss\ =\ \displaystyle \frac{l_{o}\ +\ l_{t}}{2}$

Object Part

$l_{o}\ =\ \displaystyle \sum_{i=0}^{N-1}\ \frac{1}{|T_{i}^{+}|}\ \sum_{j\ \in\ T_{i}^{+}}\ -\ \log\ \left( \frac{\exp(o_{i}^{T}\ t_{i} / \tau)}{\displaystyle \sum_{k=0}^{L-1}\ \exp(o_{i}^{T}\ t_{k} / \tau)} \right)$

$N$ : Object の数
$i$ : Object Index
$T_{i}^{+}$ : Object $i$ と比較するTextの集合
$j$ : Text Index

Text Part

$l_{t}\ =\ \displaystyle \sum_{i=0}^{L-1}\ \frac{1}{|O_{i}^{+}|}\ \sum_{j\ \in\ O_{i}^{+}}\ -\ \log\ \left( \frac{\exp(t_{i}^{T}\ o_{i} / \tau)}{\displaystyle \sum_{k=0}^{N-1}\ \exp(t_{i}^{T}\ o_{k} / \tau)} \right)$

$L$ : Text の数
$i$ : Text Index
$O_{i}^{+}$ : Text $i$ と比較するObjectの集合
$j$ : Object Index

Total Loss

Box Loss (L1 Loss + GIoU Loss) + Soft-Token Loss + Contrastive Alignment Loss

参考

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
- [2021]
- 2 Method
  - 2.2 MDETR
    - 2.2.1 Architecture
    - 2.2.2 Training
      - Soft token prediction
      - Contrastive alignment
      - Combining all the losses
- arxiv.org

Web サイト

MDETRについて
- zenn.dev

Post

https://t.co/rVnDkUPwxz
テキストと画像で学習することにより、物体を示す任意のテキストを使って物体検知が行えるMDETRを提案。DETRをベースに、予測物体とその対応するテキストの位置が一致するように学習させる。"ピンク色の象"など任意のテキストで検知可能。 pic.twitter.com/gMDzzxtrD1
— akira (@AkiraTOSEI) August 18, 2021

オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【深層学習】MDETR