オムライスの備忘録

数学・統計学・機械学習・プログラミングに関することを記す

【マルチモーダル】分野一覧 #まとめ編

データサイエンスデータサイエンス-マルチモーダル

#まとめ編一覧
- yhayato1320.hatenablog.com

Index

Index
アルゴリズム
タスク
工夫・テクニック
データセット
研究分野
対象のデータ
参考
- Web サイト

アルゴリズム

Text-to-Table / 2021

Text-to-Table: A New Way of Information Extraction
- [2021]
- arxiv.org

Gato / 2022

2022年5月に DeepMind が発表したGatoは、テキストや画像などの出力だけでなく、様々なアクションまでも実行できる多機能なマルチモーダルAI.

Gato
- yhayato1320.hatenablog.com

SpeechPainter / 2022

SpeechPainter: Text-conditioned Speech Inpainting
- [2022]
- arxiv.org

IM2WAV / 2022

Image to Audio.

I Hear Your True Colors: Image Guided Audio Generation
- [2022]
- arxiv.org

SadTalker / 2022

Audio to Video (Face Motion).

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
- [2023]
- arxiv.org
- github.com

Make A Vide 3D / MAV3 / 2023

Text-To-4D Dynamic Scene Generation
- [2023]
- arxiv.org

ConceptFusion / 2023

ConceptFusion: Open-set Multimodal 3D Mapping
- [2023]
- arxiv.org

MINOTAUR / 2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries
- [2023]
- arxiv.org

Video Localized Narratives / 2023

Connecting Vision and Language with Video Localized Narratives
- [2023]
- arxiv.org

Vid2Avatar / 2023

Video to 3D.

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition
- [2023]
- arxiv.org

KOSMOS-1 / 2023

Language Is Not All You Need: Aligning Perception with Language Models
- [2023]
- arxiv.org
github.com
- github

Vid2Seq / 2023

Video Caption.

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
- [2023]
- arxiv.org

ChatCaptioner / 2023

ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
- [2023]
- arxiv.org

Unified Visual Relationship Detection / UniVRD / 2023

Unified Visual Relationship Detection with Vision and Language Models
- [2023]
- arxiv.org

LERF / 2023

Text to 3D.

LERF: Language Embedded Radiance Fields
- [2023]
- arxiv.org
- www.lerf.io

CG3D / 2023

CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition
- [2023]
- arxiv.org
- jeya-maria-jose.github.io
- github.com

MM-REACT / 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- [2023]
- arxiv.org
- multimodal-react.github.io
- github.com
- huggingface.co

DreamBooth3D / 2023

DreamBooth3D: Subject-Driven Text-to-3D Generation
- [2023]
- arxiv.org
- dreambooth3d.github.io

Follow Your Pose / 2023

Text to Video.

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
- [2023]
- arxiv.org
- follow-your-pose.github.io

TM2D / 2023

Text + Music -> 3D Dance.

TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration
- [2023]
- arxiv.org
- garfield-kh.github.io

Verb-Focused Contrastive / VFC / 2023

Verbs in Action: Improving verb understanding in video-language models
- [2023]
- arxiv.org

Soundini / 2023

Video -> Video + Music.

Soundini: Sound-Guided Diffusion for Natural Video Editing
- [2023]
- arxiv.org
- kuai-lab.github.io

Cond Foley Gen / 2023

Video to Audio.

Conditional Generation of Audio from Video via Foley Analogies
- [2023]
- arxiv.org
- xypb.github.io

ImageBind / 2023

IMAGEBIND : One Embedding Space To Bind Them All
- [2023]
- arxiv.org
Meta、マルチモーダルAI「ImageBind」をオープンソース化
- www.itmedia.co.jp

Self-Chained Video Localization-Answering / SeViLA / 2023

Self-Chained Image-Language Model for Video Localization and Question Answering
- [2023]
- arxiv.org

タスク

タスク一覧
- yhayato1320.hatenablog.com

Vision Language

Vision Language
- yhayato1320.hatenablog.com

Speech Language

Speech Language
- yhayato1320.hatenablog.com

Speech to Image

SadTalker / 2022

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
- [2022]
- arxiv.org
- github.com

Image to Video

Image to Video
- yhayato1320.hatenablog.com

工夫・テクニック

Natural Language Supervision

Natural Language Supervision
- ラベルのついた教師データではなく、生の自然言語を画像予測のタスクの教師に利用する手法.
- yhayato1320.hatenablog.com

GAN

GAN を用いた異なるモーダル間 (Multimodal) の変換.
- yhayato1320.hatenablog.com

Data Augmentation

Data Augmentation #まとめ編
- マルチモーダルにおける Data Augmentation
- yhayato1320.hatenablog.com

Transformer

Transformer #まとめ編
- yhayato1320.hatenablog.com

Diffusion Model

Diffusion Model
- yhayato1320.hatenablog.com

データセット

マルチモーダルデータ
- yhayato1320.hatenablog.com

研究分野

Representation
- 各モーダルのデータをどう表現したり要約したりする
Translation
- 各モーダルのデータ間の変換方法
Alignment
- モダリティ間の直接的な関係を明らかにするタスク
- 異なるモダリティのデータの一部が与えられたとして，それらのなかで関連する部分を探すようなタスク
Fusion
- 複数モダリティのデータを用いて予測
- マルチモーダル学習の中で最も歴史が長いものの一つ
- Audio Visual Speech Recognition (AVSR) 等
Co-learning
- あるモダリティ内で作られた予測モデル，ベクトル表現などを別のモダリティに転移させる

対象のデータ

Verbal
- text, words, language
Visual
- image, video
Vocal
- audio

参考

Multimodal Machine Learning: A Survey and Taxonomy
- [2017]
- 1 INTRODUCTION
- arxiv.org
Deep Multimodal Representation Learning: A Survey
- [2019]
- 1 INTRODUCTION
- 2 DEEP MULTIMODAL REPRESENTATION LEARNING FRAMEWORKS
- 3 TYPICAL MODELS
- https://ieeexplore.ieee.org/abstract/document/8715409ieeexplore.ieee.org
- https://ieeexplore.ieee.org/iel7/6287639/8600701/08715409.pdf
Multimodal Learning with Transformers: A Survey
- [2022]
- arxiv.org

Web サイト

マルチモーダル深層学習の研究動向
- マルチモーダル深層学習の研究動向 from Koichiro Mori
  www.slideshare.net
最近、人工知能による自然言語処理が爆発的に進化しているのでまとめてみた。【後編】
- 5 マルチモーダルAI
  - 5.1 Flamingo（2022年4月）
  - 5.2 Gato（2022年5月）
- note.com