深層学習で使用される手法の１つである「Attention」について、書いた記事をまとめた.

#まとめ編一覧
- yhayato1320.hatenablog.com

Index

Index
Attention とは
データ分野
用語定義
スコア関数 / Score Function
基本的な Attention
Attention の応用
参考
- 書籍
- Web Site
- 動画

Attention とは

深層学習を用いた特徴量を抽出する仕組み.

深層学習 #まとめ編
- yhayato1320.hatenablog.com

データのどこに注目するべきかを含め学習することで、特徴量を抽出する.

Attention #概念編
- yhayato1320.hatenablog.com

データ分野

「時系列データ」の「生成モデル」 (seq2seq / Encoder-Decoder) への改善手法として提案された「Attention」だが、「時系列データ」の「予測モデル」や「画像データ」への適用も行われている.

画像データへの応用

画像処理
- yhayato1320.hatenablog.com

時系列データへの応用

時系列解析
- yhayato1320.hatenablog.com

Multi Modal
- yhayato1320.hatenablog.com

用語定義

入力系列

(Encoder)入力系列 : $\\{ x_1,\ \cdots,\ x_I \\}$

入力系列の長さ : $I$

Decoder入力系列 : $\\{ y_1,\ \cdots,\ y_J \\}$

入力系列の長さ : $J$

Encoder - Decoder 出力

Encoder 出力 (隠れ状態) : $\\{ h_i^{(s)}\ |\ i=1, \cdots , I \\}$

Memory に相当

Decoder 出力 (隠れ状態) : $\\{ h_j^{(t)}\ |\ j=1, \cdots , J \\}$

Attention

重要度 / Attention Weight :

$\\{ a_i\ |\ i=1, \cdots , I \\}$

alignment score function / アラインメントスコア関数

重要度を計算するための関数
$score()$

コンテキストベクトル / 重み付き平均 :

$c$ または、 $\bar{h}$

Query : 入力情報 (q)

(Decoder の隠れ状態 / 入力系列などにあたる)

Memory : 情報源

Key : Query と一緒に、「重要度・関連度」の計算に使用するための Memory (k)
Value : Attention Weight と計算して、コンテキストベクトルを取得するための Memory (v)
(Encoder の全時系列の隠れ状態などにあたる)

スコア関数 / Score Function

Query と Key が、どの程度似ているか、どの程度関連があるかを計算する関数.

内積が使用されるケースが多いが、複数の関数が手法として提案されている.

スコア関数 / Score Function
- yhayato1320.hatenablog.com

基本的な Attention

Self vs Not Self

情報の抽出を自分自身 (self) から行うか別の情報から行うかどうか

Self Attention

入力Queryと索引Memoryが同じAttention.

Self Attention
- yhayato1320.hatenablog.com

Source Target Attention

入力Queryと索引Memoryが別物の場合のAttention.

Soft vs Hard

コンテキストベクトルを求める際、どのような方法で求めるか.

Soft Attention

特に、複数ベクトルの重み付き平均を使う方法を Soft Attention と呼ぶ.

(「深層学習による自然言語処理」より)

複数の情報源のベクトル $\{ h_1^{(s)},\ \cdots,\ h_I^{(s)} \}$ に対して、それぞれの重要度 $\{ a_1,\ \cdots,\ a_I \}$ を別々のネットワークで計算し、その重み付き平均を使うのが Soft Attention です. Soft Attention の計算はすべて微分可能な関数のみで構成されているので、通常の誤差逆伝播法で勾配を計算することができる.

Source Target Attention / Soft Attentionを seq2seq (Encoder-Decoder Model) に適用.

(「ゼロから作るDeep Learning 2」より)

seq2seq

yhayato1320.hatenablog.com

Hard Attention

Soft Attention の場合は、確率 $a_i$ をそのまま、重み付き平均、すなわち期待値をとっていた.

Hard Attention の場合は、その確率 $a_i$ に従って、ベクトル(情報)を選ぶ.

Local vs Global

Local Attention

Local Attention
- 注意する範囲を選択する Attention
- yhayato1320.hatenablog.com

Global Attention

Global Attention
- 注意できる範囲すべてを注意する Attention
- yhayato1320.hatenablog.com

Multi Head Attention

複数のAttention を並べる.

Transformer で利用.

Multi Head Attention
- yhayato1320.hatenablog.com

Attention の応用

Transformer / 2017

Transformer
- yhayato1320.hatenablog.com

Staircase Attention / 2021

時間方向 (シーケンス全体にわたる) と深さ方向 (層をまたぐ) の両方でリカレンスを活用. これにより、従来の Transformer では困難だった状態追跡タスクを解決する. また、パラメータ数が同じであれば、Transformer と比較して言語モデリングにおいてより優れたパフォーマンス (低いパープレキシティ) を発揮する.

Staircase Attention for Recurrent Processing of Sequences

Staircase Attention for Recurrent Processing of Sequences
- [2021]
- arxiv.org
Which one is more important: more parameters or more computation?
- parl.ai

FlashAttention / 2022

FLOPsではなくメモリIOに着目して注意機構の計算を近似なしで効率化.

注意計算のボトルネックはGPUのSRAMとHBM（high bandwidth memory）間のデータのやり取りにあるとし、これを削減するための計算方法を考案.

この工夫により、注意機構の計算で2-4倍の高速化と10-20倍のメモリ削減が可能.

NeurIPS 2022 参加報告後編

大規模言語モデル

計算コストの削減

blog.recruit.co.jp

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- [2022]
- arxiv.org

Attention Manipulation / ATMAN / 2023

AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation
- [2023]
- arxiv.org

Exponential Signal Preserving Attention / E-SPA / 2023

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
- [2023]
- arxiv.org

Intention / 2023

KVQ 空間における新しい計算.

Exploring the Space of Key-Value-Query Models with Intention
- [2023]
- arxiv.org

PagedAttention / 2023

Efficient Memory Management for Large Language Model Serving with PagedAttention
- [2023]
- arxiv.org
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
- blog.vllm.ai
大規模言語モデルの出力スピードを最大24倍に高めるライブラリ「vLLM」が登場、メモリ効率を高める新たな仕組み「PagedAttention」とは？
- gigazine.net

Attention Sinks / 2023

Efficient Streaming Language Models with Attention Sinks
- [2023]
- arxiv.org
Efficient Streaming Language Models with Attention Sinks (Paper Explained)
- www.youtube.com

Griffin / 2024

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
- [2024]
- arxiv.org

参考

書籍

生成 Deep Learning / オライリー
- 7.2.2 Keras でアテンション機構を作成する (内容は Self Attention)
- 7.2.4 エンコーダ - デコーダネットワークのアテンション機構 (内容は Source Target Attention / Soft Attention)
- 生成 Deep Learning ―絵を描き、物語や音楽を作り、ゲームをプレイする
  - 作者:David Foster
  - オライリー・ジャパン
  Amazon
深層学習による自然言語処理
- 4.1 注意機構 (4.1.1 ソフト注意機構 / 4.1.2 ハード注意機構)
- 深層学習による自然言語処理 (機械学習プロフェッショナルシリーズ)
  - 作者:坪井祐太,海野裕也,鈴木潤
  - 講談社
  Amazon
ゼロから作るDeep Learning 2
- 8章 Attention (内容は Source Target Attention / Soft Attention)
- ゼロから作るDeep Learning ❷ ―自然言語処理編
  - 作者:斎藤康毅
  - オライリージャパン
  Amazon
コンピュータービジョン最前線 Winter 2021
- 5 ニュウモン Vision and Language
  - 5.3 V&L を支える基礎技術
    - 5.3.4 注意機構によるモダリティ統合
- コンピュータビジョン最前線 Winter 2021
  - 作者:井尻善久,牛久祥孝,片岡裕雄,藤吉弘亘
  - 共立出版
  Amazon

Web Site

Attention? Attention!
- Attention 種類まとめ
- lilianweng.github.io
最近の深層学習におけるAttention機構 - 名古屋CVPRML勉強会 ver. -
- speakerdeck.com

動画

Deep Learning入門：Attention（注意）
- Self Attention
- 画像データへの Attention
- 言語(系列)データへの Attention
- www.youtube.com
【速習！】Attentionから始めるTransformer超入門
- www.youtube.com

オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【深層学習】Attention #まとめ編