ray/rllib/agents/mbmpo
gjoliver 99a0088233
[RLlib] Unify the way we create local replay buffer for all agents (#19627)
* [RLlib] Unify the way we create and use LocalReplayBuffer for all the agents.

This change
1. Get rid of the try...except clause when we call execution_plan(),
   and get rid of the Deprecation warning as a result.
2. Fix the execution_plan() call in Trainer._try_recover() too.
3. Most importantly, makes it much easier to create and use different types
   of local replay buffers for all our agents.
   E.g., allow us to easily create a reservoir sampling replay buffer for
   APPO agent for Riot in the near future.
* Introduce explicit configuration for replay buffer types.
* Fix is_training key error.
* actually deprecate buffer_size field.
2021-10-26 20:56:02 +02:00
..
tests [RLlib] Unify all RLlib Trainer.train() -> results[info][learner][policy ID][learner_stats] and add structure tests. (#18879) 2021-09-30 16:39:05 +02:00
__init__.py [RLlib] Implementation of "Model-based Meta Policy Optimization" (MB MPO) (#9409) 2020-08-02 18:12:09 +02:00
mbmpo.py [RLlib] Unify the way we create local replay buffer for all agents (#19627) 2021-10-26 20:56:02 +02:00
mbmpo_torch_policy.py [RLlib] Issue #13342: Add validate_spaces to MB-MPO. (#14038) 2021-02-11 11:36:53 +01:00
model_ensemble.py [RLlib] Issue #13507: Fix MB-MPO CartPole Env's reward function as well as MB-MPO running into a traj. view API related issue. (#14037) 2021-02-11 18:58:46 +01:00
README.md [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) 2020-12-29 18:45:55 -05:00
utils.py [RLlib] MB-MPO cleanup (comments, docstrings, type annotations). (#11033) 2020-10-06 20:28:16 +02:00

Model-based Meta-Policy Optimization (MB-MPO)

Code in this package is adapted from https://github.com/jonasrothfuss/model_ensemble_meta_learning.

Overview

MBMPO is an on-policy model-based algorithm. On a high level, MBMPO is model-based MAML. On top of MAML, MBMPO learns an ensemble of dynamics models. MBMPO trains the dynamics models with real-life data and the actor/critic networks with fake data generated by the dynamics models. The actor and critic are updated via the MAML algorithm. For the distributed execution plan, MBMPO alternates between training the dynanmics model and training the actor and critic network.

More details can be found here.

Documentation & Implementation:

MBMPO.

Detailed Documentation

Implementation