Vision Language
- yhayato1320.hatenablog.com

Index

Index
ALIGN
- VSE
Dataset
Architecture
- Image Encoder
- Text Encoder
Pre Training
- Image to Text Classification
- Text to Image Classification
参考
- Web サイト

ALIGN

A Large-scale ImaGe and Noisy-text embedding

VSE

VSE の概念を利用.

VSE
- yhayato1320.hatenablog.com

Dataset

Conceptual Caption Dataset のようなフィルタリングと後処理で綺麗になったデータではなく、、

そのような処理かかる手間を大量のデータセットを作る方に転じた.

1.8 B の Image - Text pair のデータセットを作成.

Architecture

Image Encoder と Text Encoder の Dual Encoder Architecture.

Image Encoder

EifficientNet を利用して、画像の特徴量を取得.

Text Encoder

BERT の CLS token をテキストの Embedding として利用.

Pre Training

$x_{i}$ : Image Embedding
$y_{j}$ : Text Embedding

Image to Text Classification

$L_{i2t}\ =\ -\ \displaystyle \frac{1}{N}\ \displaystyle \sum_{i}^{N}\ \log\ \displaystyle \frac{\exp(x_{i}^{T}\ y_{i}\ /\ \sigma)}{\displaystyle \sum_{j=1}^{N}\ \exp(x_{i}^{T}\ y_{j}\ /\ \sigma)}$

Text to Image Classification

$L_{t2i}\ =\ -\ \displaystyle \frac{1}{N}\ \displaystyle \sum_{i}^{N}\ \log\ \displaystyle \frac{\exp(y_{i}^{T}\ x_{i}\ /\ \sigma)}{\displaystyle \sum_{j=1}^{N}\ \exp(y_{i}^{T}\ x_{j}\ /\ \sigma)}$

参考

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- [2021]
- 2 Related Work
- 3 A Large-Scale Noisy Image-Text Dataset
- 4 Pre-training and Task Transfer
  - 4.1 Pre-training on Noisy Image-Text Pairs
  - 4.2 Transferring to Image-Text Matching & Retrieval
  - 4.3 Transferring to Visual Classification
- arxiv.org

Web サイト

ALIGN：ノイズの多い文章を教師に使って視覚と言語で共通する特徴表現を学習(1/3)
- webbigdata.jp

オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【マルチモーダル】ALIGN