Transformer #まとめ編
- yhayato1320.hatenablog.com

この記事の読者

深層学習・ディープラーニングの手法の１つである「Transformer」について知りたい.

キーワード・知ってると理解がしやすい

Attention
Memory Network
Attention #まとめ編
- yhayato1320.hatenablog.com

Index

Index
Transformer とは
- メリット
Architecture
- Encoder
- Decoder
Attention Layer
Position-Wise Fully Connected Feed Forward Network
ネットワークの入力
- Embedding
- Positional Encoding
学習 / Training
参考
- Web サイト
- 動画

Transformer とは

Seq2Seq (Encoder - Decoder Model) をベースに再帰的なニューラルネットワーク(RNN etc)を廃止し、代わりに、Feed Forward Network (FFN)とMulti Head Self Attention の 2 つレイヤで構成されたモデル.

メリット

RNN のような再帰的な計算をなくすことで、計算コストを低下させる.

つまり、再帰的な計算だと、前の計算が終わらないと次の計算ができないが、時系列の情報を行列で扱うので、行列演算でえいやできる.

RNN #まとめ編
- yhayato1320.hatenablog.com

Architecture

Encoder と Decoder の2つから構成されている. (Encoder-Decoder Model)

Encoder 入力系列 : $\\{ x_1,\ \cdots,\ x_n \\}$

入力系列の長さ : $n$

Encoder 出力 (Decoder 入力) : $\\{ z_1,\ \cdots,\ z_n \\}$

出力(入力)系列の長さ : $n$

Decoder 入力系列 : $\\{ y_1,\ \cdots,\ y_m \\}$

入力系列の長さ : $m$

Encoder

いくかの (発表論文では6) のLayer(= Encoder Layer)で、構成されている.

その Encoder Layer は 2 つの Layer で構成されている.

Multi-Head Self-Attention Layer (Attention)
Position-Wise Fully Connected Feed-Forward Network (FFN)

また、Residual Connections を導入して、以前の情報を生かす.

ResNet
- yhayato1320.hatenablog.com

その後、Layer Normalization を行う.

Normalization
- yhayato1320.hatenablog.com

Decoder

いくかの (発表論文では6) のLayer(= Decoder Layer)で、構成されている.

その Decoder Layer は 3 つの Layer で構成されている.

Position-Wise Fully Connected Feed-Forward Network (FFN）
Multi-Head Self-Attention Layer (Attention）
Masked Multi-Head Self-Attention Layer (Attention)

Encoder と異なるのは、「Masked Multi-Head Self-Attention Layer」の部分.(あと 2 つは同様)

Attention Layer

Attention Layer は、Encoder / Decoder のどちらにも組み込まれている.

Encoder : Multi-Head Self-Attention Layer
Decoder : Multi-Head Self-Attention Layer + Masked Multi-Head Self-Attention Layer
Attention
- yhayato1320.hatenablog.com

Multi-Head Self-Attention Layer の性質・工夫・構成要素は以下.

Self Attention
Scaled Dot Product Attention
Multi Head Attention

Masked Multi-Head Self-Attention Layer は、Multi-Head Self-AttentionをNLP のタスクとして穴埋めを行うため、マスクを施した入力に対応した形式にしたもの.(Masked Self Attention?)

Self Attention

入力に対し、入力自身のどの部分に注目するかを重み付けする仕組み.

Self Attention
- yhayato1320.hatenablog.com

正確には、Decoder の2つ目の Attention は、入力に対し、Self (入力自身) だけの情報だけでなく、 Encoder の情報も重要視するかどうかも、重み付けするので、Self Attention というよりは、Source Target Attention に近いイメージか?

Scaled Dot Product Attention

Query と Key の類似度を計算する手段であるスコア関数に、行列の内積(Dot Product) を採用している Attention.

Scaled Dot Product Attention
- yhayato1320.hatenablog.com

Multi Head Attention

表現力を増やすために、複数のAttention のための重みを利用する仕組み.

(層を増やす訳ではないので、逐次的な計算が増える訳ではなく、並列的な計算を行列計算で行える.)

Multi Head Attention
- yhayato1320.hatenablog.com

Position-Wise Fully Connected Feed Forward Network

Encoder / Decoder にある、Attention の後にある処理.

全結合 + ReLU + 全結合の処理.

$FFN(x)\ =\ \max(0, xW_{1}\ +\ b_{1})W_{2}\ +\ b_{2}$

位置単位(単語単位)に、共通のFeed Forward Network を適用する.

Layer ごとには、異なる重みを利用する.

Dropout Layer 入れる?

ネットワークの入力

Encoder / Decoder 以外の部分の Architecture にも着目してみる.(主に入力部分)

2つの処理が、行われている.

Embedding
Encoding

Embedding

事前学習モデル(分散表現)を利用して、(Encoder / Decoder への) 入力情報(入力トークンと出力トークン)をベクトルに変換する.

分散表現
- yhayato1320.hatenablog.com

Encoder の Embedding Layer (分散表現)と、Decoder の Embedding Layer (分散表現)と、 Decoder の出力部分の線形変換の重みを共通のものを利用する.

参考 : Using the output embedding to improve language models.

Positional Encoding

Encoder と Decoder への入力前(Embedding の後)に行う処理.

目的は、入力の位置関係(時系列関係)を表現すること.

Transformer は、再帰的なシステム(ネットワーク)をなくしたため、(単語の)時系列的ま情報が表現されていない.

そのため、時系列関係の情報を加えることで、時系列情報を所持する.

ネットワークが単語の位置関係を認知できるように、各(単語)時系列に追加される定数行列にすぎない.

使用される関数は、sin / cos を利用する.

この関数の決定には、様々な選択肢があり、検討の余地がある.

学習 / Training

データセット

英語 <-> ドイツ語翻訳データセット
- WMT2014
- 約450万の文の英語とドイツ語ペアのデータセット.
- 約37000 単語の語彙で Embedding / トークン化された.
英語 <-> フランス語翻訳データセット
- 約3600万の文の英語とドイツ語ペアのデータセット.
- 約32000 単語の語彙で Embedding / トークン化された.
バッチサイズは、約25000 のペア

ハードウェアと訓練時間

NVIDIA P100 GPU x8
Base Model
- 0.4 s / step
- 10万 step
- 12 h
Big Model
- 1.0 s / step
- 30万 step
- 3.5 day

Optimizer

Adam を利用.

Adam
- yhayato1320.hatenablog.com

$\beta_{1}\ =\ 0.9,\ \beta_{2}\ =\ 0.98,\ \epsilon\ =\ 10^{-9}$

正則化 / Regularization

3 つの正則化を実施.

Encoder / Decoder

各サブレイヤーの正規化 (Layer Normalization) の前に Dropout を適用

Word Embedding / Positional Encoding の後に Dropout を適用
ラベルスムージング / Label Smoothing

Dropout
- yhayato1320.hatenablog.com

参考

Attention Is All You Need
- [2017 Google]
- 1 Introduction
- 3 Model Architecture
  - 3.1 Encoder and Decoder Stacks
  - 3.2 Attention
    - 3.2.1 Scaled Dot-Product Attention
    - 3.2.2 Multi-Head Attention
  - 3.3 Position-wise Feed-Forward Networks
  - 3.4 Embeddings and Softmax
  - 3.5 Positional Encoding
- 5 Training
  - 5.1 Training Data and Batching
  - 5.2 Hardware and Schedule
  - 5.3 Optimizer
  - 5.4 Regularization
- arxiv.org

Web サイト

深層学習界の大前提Transformerの論文解説！
- qiita.com
大規模言語モデルの自然言語処理「Transformer」モデルの仕組み
- thinkit.co.jp

動画

Attention Is All You Need
- www.youtube.com

オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【深層学習】Transformer #アルゴリズム編