ray/rllib/tuned_examples/marwil/cartpole-marwil.yaml

# To generate training data, first run:
# $ ./train.py --run=PPO --env=CartPole-v0 \
#      --stop='{"timesteps_total": 50000}' \
#      --config='{"output": "/tmp/out", "batch_mode": "complete_episodes"}'
cartpole-marwil:
    env: CartPole-v0
    run: MARWIL
    stop:
        timesteps_total: 500000
    config:
        # Works for both torch and tf.
        framework: tf
        # In order to evaluate on an actual environment, use these following
        # settings:
        evaluation_num_workers: 1
        evaluation_interval: 1
        evaluation_config:
            input: sampler
        beta: 1.0  # Compare to behavior cloning (beta=0.0).
        # The historic (offline) data file from the PPO run (at the top).
        input: /tmp/out
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`# To generate training data, first run:`
			`# $ ./train.py --run=PPO --env=CartPole-v0 \`
			`# --stop='{"timesteps_total": 50000}' \`
			`# --config='{"output": "/tmp/out", "batch_mode": "complete_episodes"}'`
[RLlib] Benchmark and regression test yaml cleanup and restructuring. (#8414) 2020-05-26 11:10:27 +02:00			`cartpole-marwil:`
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`env: CartPole-v0`
			`run: MARWIL`
			`stop:`
			`timesteps_total: 500000`
			`config:`
[RLlib] Auto-framework, retire `use_pytorch` in favor of `framework=...` (#8520) 2020-05-27 16:19:13 +02:00			`# Works for both torch and tf.`
			`framework: tf`
[RLlib] Issue 9402 MARWIL producing nan rewards. (#9429) 2020-07-14 05:07:16 +02:00			`# In order to evaluate on an actual environment, use these following`
			`# settings:`
			`evaluation_num_workers: 1`
			`evaluation_interval: 1`
			`evaluation_config:`
			`input: sampler`
[RLlib] Behavioral Cloning (from MARWIL). (#10619) 2020-09-09 17:33:21 +02:00			`beta: 1.0 # Compare to behavior cloning (beta=0.0).`
[RLlib] Issue 9402 MARWIL producing nan rewards. (#9429) 2020-07-14 05:07:16 +02:00			`# The historic (offline) data file from the PPO run (at the top).`
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`input: /tmp/out`