2022-05-19

【深層学習】GLIP #実装編 #01

データサイエンスデータサイエンス-深層学習

Index

Index
GLIP
前置き
実装
- 実行

GLIP

物体検出のおけるラベルの表現を豊かにするために、事前学習を導入した.

Grounded Language-Image Pre-training

GLIP #アルゴリズム編
- yhayato1320.hatenablog.com

前置き

タスク

画像からの物体検出.

物体検出 / Object Detection #まとめ編
- yhayato1320.hatenablog.com

ラベルの可変性も確認してみる.

データセット

COCO データセットを利用.

COCO
- yhayato1320.hatenablog.com

実装

Google Colab (2022/05/19)
Python 3.7.13

実行

環境の構築に時間がかかる.

2022-05-18

【深層学習】BLIP

データサイエンスデータサイエンス-深層学習データサイエンス-マルチモーダル

Index

Index
BLIP
応用
- InstructBLIP / 2023
参考
- 動画

BLIP

応用

InstructBLIP / 2023

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- [2023]
- arxiv.org

参考

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- [2022]
- arxiv.org

動画

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding&Generation
- www.youtube.com

2022-05-17

【深層学習】GLIP #アルゴリズム編

データサイエンスデータサイエンス-深層学習

Index

Index
GLIP
Frame Work
- Formulation / 定式化
  - Object Detection
  - Object Detection + Phrase Grounding
- Deep Fusion
学習データセット
実装編
参考
- Web サイト

GLIP

物体検出のおけるラベルの表現を豊かにするために、事前学習を導入した.

Grounded Language-Image Pre-training

Natural Language Supervision

画像認識・物体認識のアルゴリズムでは、通常、事前にラベルを固定してから学習を行うが、生の自然言語を教師にする Natural Language Supervision という考えを導入した CLIP というアルゴリズムがあった.

CLIP
- yhayato1320.hatenablog.com

Phrase Grounding

文章中のフレーズ・単語が、画像中のどの領域に対応するかを推定するマルチモーダルのタスクである Phrase Groundingがある.

Phrase Grounding
- yhayato1320.hatenablog.com

Phrase Grounding を物体の種類と場所を特定する物体検出 / Object Detection のタスクに応用させた.

Phrase Grounding と Object Detection の融合

では、どのようなフレームワークにするのか.

入力は、画像とテキスト.

そして、テキストとしての入力は 2 つ.

ラベル情報
テキスト

画像から物体の領域が検出され、それぞれの領域の特徴量を抽出する.

入力されたテキストからも、特徴量を抽出する.

そして、それらの特徴量の関連性をベクトルの内積として表現する.

CLIP とは異なり、2 つの Encoder の中間出力に関連を持たせる.

Frame Work

さらにこの仕組みを細かくみていく.

Formulation / 定式化

式に描いてみる.

Object Detection

Object Detection の場合.

物体検出の損失はこんなもん.

$L\ =\ L_{cls}\ +\ L_{loc}$

画像からの物体検出.

$O\ =\ Enc_{I}\ (Img)$

$Img$ : 入力画像
$Enc_{I}$ : Image Encoder / Vision Transformer とか
$O$ : 物体検出結果

物体の特徴量
$O\ \in\ R^{N\ \times\ d}$
$N$ : 物体数

物体の分類とそのスコアを算出.

Two Stage.

$S_{cls}\ = \ OW^{T}$

$W$ : 分類器 $C$ の重みパラメータ

$W\ \in\ R^{c\ \times\ d}$

$S_{cls}$ : 分類確率

$S_{cls}\ \in\ R^{N\ \times\ c}$
$c$ : クラスの数

損失 (Class Loss) の計算.

$L_{cls}\ =\ loss(S_{cls};T)$

$T\ \in\ \{0,\ 1\}^{N\ \times\ c}$ : ラベル

Object Detection + Phrase Grounding

Object Detection に Phrase Grounding を導入した形式で定式化する.

まずは、Class について.

[person, bicycle, car, ..., toothbrush]

このようなクラスであったら、こう.

Prompt = “Detect: person, bicycle, car, ... , toothbrush”

次は、処理.

画像からの物体検出.

$O\ =\ Enc_{I}\ (Img)$

$Img$ : 入力画像
$Enc_{I}$ : Image Encoder / Vision Transformer、DyHead とか
$O$ : 物体検出結果

物体の特徴量
$O\ \in\ R^{N\ \times\ d}$
$N$ : 物体数

ここまでは、同様.

次は、テキストからの情報抽出.

$P\ =\ Enc_{L}\ (Prompt)$

$Prompt$ : 入力テキスト (ラベル情報)
$Enc_{L}$ : Language Encoder / Transformer とか
$P$ : Embedding された情報

テキスト (単語 / token) の特徴量
$P\ \in\ R^{M\ \times\ d}$
$M$ : 単語数 / token

トークンと物体の (類似度の) スコアを計算.

$S_{ground}\ =\ OP^{T}$

$S_{ground}$ : トークンと物体の (類似度の) スコア

$S_{ground}\ \in\ R^{N\ \times\ M}$

ラベルはこう.

$T\ \in\ \{0,\ 1\}^{N\ \times\ c}$ : ラベル
$T^{'}\ \in\ \{0,\ 1\}^{N\ \times\ M}$ : ラベル

$T\ \longrightarrow\ T^{'}$

そして、損失を算出.

$loss(S_{ground};T^{'})$

Deep Fusion

ここまでの定式化では、画像とテキストは別々の Encoder によって処理され、スコアを計算するために、最後に各の情報が初めて触れ合う.

このようなモデルを Late Fusion Model と呼ぶ.

Vision と Language の情報の共有の精度を上げるために、より良い情報の共有の方法を導入する.

各 Encoder の中間レイヤーにより、中間的な特徴量が出力される.

$O^{i},\ P^{i},\ \ i\ \in\ \{ 0,\ 1,\ \cdots,\ L-1 \}$

$O^{i}$ : レイヤー $i$ からの画像 (物体) の中間特徴量
$P^{i}$ : レイヤー $i$ からのテキスト (フレーズ / token) の中間特徴量
$L$ : レイヤーの数

それぞれの中間特徴量の情報を共有.

$O_{t2i}^{i},\ P_{i2t}^{i}\ =$ X-MHA $(O^{i},\ P^{i})$

X-MHA : Cross-Modality Multi Head Attention
$O_{t2i}^{i}$ : text to image で情報を共有
$P_{i2t}^{i}$ : image to text で情報を共有

それぞれの中間特徴量に、互いの共有された情報を加えて、次のレイヤーの処理へと進む.

$\begin{align} O^{i+1}&\ =\ DyHeadModule (O^{i}\ +\ O_{t2i}^{i}) \\ P^{i+1}&\ =\ BERTLayer (P^{i}\ +\ P_{i2t}^{i}) \end{align}$

上の場合は、画像から特徴量抽出に DyHead を、テキストからの特徴量抽出に BERT を利用.

最後のレイヤーの出力が、最終的な出力となる.

$\begin{align} O&\ =\ O^{L}\\ P&\ =\ P^{L} \end{align}$

学習データセット

Flickr30K
VG Caption

実装編

GLIP #実装編
- yhayato1320.hatenablog.com

参考

Grounded Language-Image Pre-training
- [2021]
- Abstract
- 1 Introduction
- 3 Grounded Language Image Pre-training
  - 3.1 Unified Formulation
  - 3.2 Language-Aware Deep Fusion
  - 3.3 Pre-training with Scalable Semantic-Rich Data
- arxiv.org
- paperswithcode.com

Web サイト

GLIP: Grounded Language-Image Pre-training
- sh-tsang.medium.com

2022-05-14

【動画像処理】Frame Sampling #実装編

データサイエンスデータサイエンス-画像処理

Index

Index
Frame Sampling
タスク
データセット
実装
- 実行
環境構築

Frame Sampling

動画像データから画像 (フレーム) をサンプリングし、シーケンシャルな (もしくはシーケンシャルでない) 画像データセットを作成する.

Frame Sampling #まとめ編
- yhayato1320.hatenablog.com

タスク

今回は、動画からのランダムサンプリングを行う.

データセット

今回は、Multi Object Tracking 16 / MOT16 を利用してみる.

Multi Object Tracking 16 / MOT16
- yhayato1320.hatenablog.com

実装

Google Colab (2022/05/14)
Python 3.7.13

実行

環境構築

Docker Image を作って、どこでも実行できるようにする.

FROM jjanzic/docker-python3-opencv:opencv-4.0.1

RUN apt update
RUN apt -y upgrade

RUN pip3 install -U pip

WORKDIR /home/work

hub.docker.com
- 作成した Docker Image

2022-05-13

【深層学習】DETR #まとめ編

データサイエンスデータサイエンス-深層学習

Index

Index
DETR / 2020
- 実装編
応用

DETR / 2020

2020 年に Facebook から発表された Transformer を利用した Object Detection のアルゴリズム.

Object Detection #まとめ編
- Attention を利用
- yhayato1320.hatenablog.com
Transformer #まとめ編
- 画像への応用
- yhayato1320.hatenablog.com

DETR : Detection Transformer

DETR #アルゴリズム編
- yhayato1320.hatenablog.com

実装編

実装編 #01
- Google Colab
- 物体検出と検出するときの注目箇所を計算
- yhayato1320.hatenablog.com
実装編 #02
- Docker で環境構築 / Torch Serve で API 化
- yhayato1320.hatenablog.com

応用

Deformable DETR / 2020

Deformable DETR
- yhayato1320.hatenablog.com

MDETR / 2021

MDETR
- DETR + RoBERT
- yhayato1320.hatenablog.com

Conditional DETR / 2021

Conditional DETR for Fast Training Convergence
- [2021]
- [2108.06152] Conditional DETR for Fast Training Convergence

DAB-DETR / 2022

DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
- [2022]
- arxiv.org

DN-DETR / 2022

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising
- [2022]
- arxiv.org

DINO / 2022 -

DINO #まとめ編
- yhayato1320.hatenablog.com
  - DINO / 2022
  - DINO v2 / 2023
  - Stable-DINO / 2023
  - Grounding DINO / 2023

HDETR / 2022

DETRs with Hybrid Matching
- [2022]
- arxiv.org

KS-DETR / 2023

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer
- [2023]
- arxiv.org

Real-Time DEtection TRansformer / RT-DETR / 2023

DETRs Beat YOLOs on Real-time Object Detection
- [2023]
- arxiv.org

2022-05-12

【深層学習】DETR #アルゴリズム編

データサイエンスデータサイエンス-深層学習

DETR #まとめ編
- yhayato1320.hatenablog.com

Index

Index
DETR
Architecture / Design
参考
- Web サイト

DETR

2020 年に Facebook から発表された Transformer を利用した Object Detection のアルゴリズム.

Attention を用いた Object Detection #まとめ編
- yhayato1320.hatenablog.com

DETR : Detection Transformer

従来の物体検出との違い

Non Maximum Suppression (NMS) や Anchor Base のような人での設定が必要となるような処理を撤廃.

NMS
- yhayato1320.hatenablog.com
Anchor Base
- yhayato1320.hatenablog.com

Transformer の利点

シーケンシャルなデータからの特徴抽出を行うことができる.

Object Detection でシーケンシャルな入力単位とはなにか.

Language では、単語 / token
Vision では、分割画像 / patch であったが、
Detection では、物体 / Object Query となる.

Object Detection への適用 / Direct Set Prediction Problem

このアルゴリズムでは、Object Detection を「直接集合予測問題 / Direct Set Prediction Problem」とみなすことでタスクを解いている.

予測 BB とラベル GT とのマッチング

Bipartite Matching / 2 部マッチングの考えを利用して損失を設計する.

割当・マッチング問題 / Assignment・Matching
- yhayato1320.hatenablog.com

Architecture / Design

Direct Set Prediction Problem を解くために、いくかの重要な要素がある.

Loss Design
Network Architecture

Loss

Loss の定義の前に諸定義.

$N$ : 推測する物体の集合の要素数

$\hat{y}\ =\ \{ \hat{y}_{i} \}_{i=1}^{N}$ : あるインデックス $i$ の予測された物体 $\hat{y}_{i}$ の集合

$y\ =\ \{y_{i}\}$ : あるインデックス $i$ のラベリングされた (GT) 物体 $y_{i}$ の集合

$y_{i}\ =\ (c_{i},\ b_{i})$ : クラス情報 $c_{i}$ と座標情報 $b_{i}$ で構成されている

$b_{i}\ \in\ [0,\ 1]^{4}$ : 座標情報は 4 次元で、各要素は、 $0 \leq b_{i}^{*} \leq 1$ を満たす

$\hat{p}_{\sigma\ (i)} (c_{i})$ : あるインデックス $i$ のラベリングされた (GT) 物体に対応している予測された物体のクラスの予測確率 / 確信度

$\phi$ : 該当物体なし / No Object を表す

$\sigma\ (i)$ : 「ラベリングされたある物体 $i$ 」に対応する「予測された物体」へマッピングする関数

$S_{N}\ \ni\ \sigma$ : 関数の集合

まず 2 つの損失について考える.

$L_{iou}$ : Generalized Intersection over Union

$L_{L1}$ : L1 Loss

$\begin{align} L_{iou} (b_{i},\ \hat{b}_{\sigma\ (i)})\ &=\ GIoU \\ L_{L1} (b_{i},\ \hat{b}_{\sigma\ (i)})\ &=\ || b_{i}\ -\ \hat{b}_{\sigma\ (i)} ||_{1} \end{align}$

Generalized Intersection over Union / GIoU

$L_{iou}$ について
[2019]
arxiv.org

Bounding Box Loss

2 つ Box の距離を表現.

上の二つの損失の加重平均.

$L_{box} (b_{i},\ \hat{b}_{\sigma\ (i)})\ =\ \lambda_{iou}\ \cdot\ L_{iou} (b_{i},\ \hat{b}_{\sigma\ (i)})\ +\ \lambda_{L1}\ \cdot\ L_{L1} (b_{i},\ \hat{b}_{\sigma\ (i)})$

$\lambda_{iou},\ \lambda_{L1}\ \in\ R$

Matching Cost / Loss

ラベリングされた物体と対応している予測した物体のマッチングの感度・精度を表現.

$L_{match} (y_{i},\ \hat{y}_{\sigma\ (i)})\ =\ -\ \mathbb{1}_{ \{c_{i}\ \neq\ \phi\} } \hat{p}_{\sigma\ (i)} (c_{i})\ +\ \mathbb{1}_{ \{c_{i}\ \neq\ \phi\} } L_{box} (b_{i},\ \hat{b}_{\sigma\ (i)})$

そのマッチング関数 $\sigma\ \in\ S_{N}$ の中でもコストを最小化するマッチング関数 $\hat{\sigma}$ を取得.

$\hat{\sigma}\ =\ \DeclareMathOperator*{\argmin}{arg\,min} \displaystyle \argmin_{\sigma\ \in\ S_{N}} \displaystyle \sum_{i}^{N} L_{match} (y_{i},\ \hat{y}_{\sigma\ (i)})$

Hungarian Loss

上で決定したマッチング関数 $\hat{\sigma}$ を利用して、ハンガリアンアルゴリズムにおける損失・評価を計算する.

Hungarian Algorithm
- yhayato1320.hatenablog.com

$L_{Hungarian}(y,\ \hat{y})\ =\ \displaystyle \sum_{i=1}^{N} \left[ - \log \hat{p}_{\hat{\sigma}\ (i)} (c_{i})\ +\ \mathbb{1}_{ \{c_{i}\ \neq\ \phi\} } L_{box} (b_{i},\ \hat{b}_{\hat{\sigma}\ (i)}) \right]$

閑話休題

Auxiliary Decoding Loss

Character-Level Language Modeling with Deeper Self-Attention
- Auxiliary Decoding Loss について
- [2018]
- arxiv.org

Network Architecture

上の図でも出したが、DETR は、大きく分けると 3 つの要素から構成されている.

Backbone
Transformer
- Encoder
- Decoder
Head

Backbone

処理の内容を確認する前の諸定義.

$x_{img}\ \in\ R^{3\ \times\ H_{0}\ \times\ W_{0}}$ : 3 次元の入力画像

$H_{0}$ : 入力画像の高さ

$W_{0}$ : 入力画像の幅

CNN を利用して特徴マップ $f\ \in\ R^{C\ \times\ H\ \times\ W}$ を取得する.

$C\ =\ 2048,\ H,\ W\ =\ \displaystyle \frac{H_{0}}{32},\ \frac{W_{0}}{32}$

Transformer

Encoder

Encoder で行われる処理は、以下のよう.

次元削減のための Resize
Positional Encoding
Encoding

中間特徴マップ $f$ を $1\ \times\ 1$ の畳み込み演算で、圧縮し、中間特徴マップ $z_{0}\ \in\ R^{d\ \times\ H\ \times\ W}$ を取得する.

CNN
- yhayato1320.hatenablog.com

Encoder への入力は、シーケンシャルなデータを想定しるため、中間特徴マップ $z_{0}$ を $H\ \times\ W$ のベクトルが、 $d$ 個ある状態にする.

Transformer / Vision Transformer でも行われる Positional Encoding を施すことで、位置情報を与える.

その後、Encodingを実施し $d$ 個の $H\ \times\ W$ のベクトルを得る.

Decoder

Decoder の入力は、サイズ $d$ の Object Query が $N$ 個となる.

Object Query も Encoder 同様に、Positional Encoding が施される.

その後、Decoding され、サイズ $d$ の $N$ 個のベクトルが出力される.

Head

最終的な予測は、3 層の MLP と活性化関数からなる FFN から最終的な予測が出力される.

Parallel Decoder

Parallel Decoding
- yhayato1320.hatenablog.com

参考

End-to-End Object Detection with Transformers
- [2020 Facebook AI]
- Abstract
- 2 Related work
  - 2.1 Set Prediction
  - 2.2 Transformers and Parallel Decoding
  - 2.3 Object detection
- 3 The DETR model
  - 3.1 Object detection set prediction loss
  - 3.2 DETR architecture
- arxiv.org
End-to-end object detection with Transformers
- ai.facebook.com
- Facebook の公式ブログ

Web サイト

Transformerを採用した最新の物体検出手法「DETR」
- club.informatix.co.jp
Transformerを使った初めての物体検出「DETR」
- www.ogis-ri.co.jp
物体検出DETR （DEtection TRansformer）
- qiita.com
Transformer を物体検出に採用！話題のDETRを詳細解説！
- deepsquare.jp

2022-05-09

【動画像処理】Frame Sampling #まとめ編

データサイエンスデータサイエンス-画像処理データサイエンス-時系列解析

Index

Index
Frame Sampling
- ランダムサンプリング
アルゴリズム

Frame Sampling

動画像データから画像 (フレーム) をサンプリングし、シーケンシャルな (もしくはシーケンシャルでない) 画像データセットを作成する.

動画像処理 #まとめ編
- yhayato1320.hatenablog.com

ランダムサンプリング

Frame Sampling #実装編
- yhayato1320.hatenablog.com

アルゴリズム

SCSampler / 2019

SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition
- [2019 Facebook]
- arxiv.org

Adversarially Robust Frame Sampling / 2020

Adversarially Robust Frame Sampling with Bounded Irregularities
- [2020]
- arxiv.org

MGSampler / 2021

MGSampler: An Explainable Sampling Strategy for Video Action Recognition
- [2021]
- Action Recognition への対応
- arxiv.org

Index

GLIP

前置き

タスク

データセット

ライブラリ / ソースコード

実装

実行

Index

BLIP

応用

InstructBLIP / 2023

参考

動画

Index

GLIP

Natural Language Supervision

Phrase Grounding

Phrase Grounding と Object Detection の融合

Frame Work

Formulation / 定式化

Object Detection

Object Detection + Phrase Grounding

Deep Fusion

学習データセット

実装編

参考

Web サイト

Index

Frame Sampling

タスク

データセット

実装

実行

環境構築

Index

DETR / 2020

実装編

応用

Deformable DETR / 2020

MDETR / 2021

Conditional DETR / 2021

DAB-DETR / 2022

DN-DETR / 2022

DINO / 2022 -

HDETR / 2022

KS-DETR / 2023

Real-Time DEtection TRansformer / RT-DETR / 2023

Index

DETR

従来の物体検出との違い

Transformer の利点

Object Detection への適用 / Direct Set Prediction Problem

予測 BB とラベル GT とのマッチング

Architecture / Design

Loss

Bounding Box Loss

Matching Cost / Loss

Hungarian Loss

Auxiliary Decoding Loss

Network Architecture

Backbone

Transformer

Encoder

Decoder

Head

Parallel Decoder

参考

Web サイト

Index

Frame Sampling

ランダムサンプリング

SCSampler / 2019

Adversarially Robust Frame Sampling / 2020

MGSampler / 2021