従来のMuZeroが直面していた確率的な環境での限界を克服するために設計されており、モンテカルロ木探索と学習された確率的遷移モデルを組み合わせている. 2048パズル、バックギャモン、囲碁などの様々なゲームでテストされ、Stochastic MuZeroは確率的環境においてMuZeroを大幅に上回る性能を示し、既存の手法にも匹敵するかそれ以上の結果を出している.

Planning in Stochastic Environments with a Learned Model
- [2023]
- openreview.net
Researchers from DeepMind and University College London Propose Stochastic MuZero for Stochastic Model Learning
- www.marktechpost.com

テクニック・工夫

Imitation Learning / 模倣学習

Imitation Learning / 模倣学習
- yhayato1320.hatenablog.com

Meta Reinforcement Learning

Adaptive Agent / AdA / 2023

Human-Timescale Adaptation in an Open-Ended Task Space
- [2023]
- arxiv.org

Transformer

Transormer を利用した強化学習
- yhayato1320.hatenablog.com

Curriculum Reinforcement Learning / CRL

GRADIENT / 2023

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation
- [2023]
- arxiv.org

ELLM / Exploring with LLMs / 2023

Guiding Pretraining in Reinforcement Learning with Large Language Models
- [2023]
- arxiv.org

Intrinsic Performance

単一AgentのRLにおけるべき乗則を示すため導入.

モデルサイズ・環境へのインタラクションについてべき乗則が見られた.

計算予算と最適なモデルサイズもべき乗則に従う.

【DL輪読会】Scaling laws for single-agent reinforcement learning
- 【DL輪読会】Scaling laws for single-agent reinforcement learning from Deep Learning JP
  www.slideshare.net

Offline

Cal QL / 2023

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
- [2023]
- arxiv.org

Synthetic Experience Replay / SynthER / 2023

Synthetic Experience Replay
- [2023]
- arxiv.org

Dataset / Benchmark

ManiSkill2 / 2023

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills
- [2023]
- arxiv.org

研究

Can Wikipedia Help Offline Reinforcement Learning?
- [2022]
- arxiv.org

オフライン強化学習とTransformerにおいて、テキストコーパスによる事前学習済みモデルが無関係な下流タスク（例：Atariのゲーム）に転移できる.

NeurIPS 2022 参加報告後編

強化学習

オフライン強化学習

blog.recruit.co.jp

On the Effect of Pre-training for Transformer in Different Modality on Offline Reinforcement Learning
- [2022]
- arxiv.org
The Role of Baselines in Policy Gradient Optimization
- [2023]
- arxiv.org
The Phenomenon of Policy Churn
- [2022]
- arxiv.org

自然言語処理 / NLP への応用

自然言語処理
- yhayato1320.hatenablog.com

環境・システム

AI Economist / 2021

Salesforce AIが開発したAIエコノミストという新しい強化学習（RL）システム. このAIシステムは、シミュレートされた経済において、生産性と平等を最適化する動的な税政策を学習することを目的としている. 従来の経済手法を上回り、AIエコノミストは、人間の意思決定を完全に置き換えるのではなく、政府関係者が有利な税政策を策定するのを支援することを意図している. このシステムは、社会福祉の目標を達成するために政府と経済主体の行動をモデル化する2層の深いシミュレーションを採用し、人間の協力とAIのコラボレーションが将来の政策立案において重要な役割を果たすことを示唆.

The AI Economist: Optimal Economic Policy Design via Two-level Deep Reinforcement Learning
Salesforce AI Introduces ‘AI Economist’: A Reinforcement Learning (RL) System That Learns Dynamic Tax Policies To Optimize Equality Along With Productivity In Simulated Economies, Outperforming Alternative Tax Systems
- www.marktechpost.com