ray/rllib/tuned_examples/ppo/cartpole-appo-vtrace-separate-losses.yaml

cartpole-appo-vtrace-separate-losses:
    env: CartPole-v0
    run: APPO
    stop:
        episode_reward_mean: 150
        timesteps_total: 200000
    config:
        # Only works for tf|tf2 so far.
        framework: tf
        # Switch on >1 loss/optimizer API for TFPolicy and EagerTFPolicy.
        _tf_policy_handles_more_than_one_loss: true
        # APPO will produce two separate loss terms:
        # policy loss + value function loss.
        _separate_vf_optimizer: true
        # Separate learning rate for the value function branch.
        _lr_vf: 0.00075

        num_envs_per_worker: 5
        num_workers: 1
        num_gpus: 0
        observation_filter: MeanStdFilter
        num_sgd_iter: 6
        vf_loss_coeff: 0.01
        vtrace: true
        model:
            fcnet_hiddens: [32]
            fcnet_activation: linear
            # Make sure we really have completely separate branches.
            vf_share_layers: false
[RLlib] Add all simple learning tests as `framework=tf2`. (#19273) * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and Tune tests have been moved to python 3.7 * fix tune test_sampler::testSampleBoundsAx * fix re-install ray for py3.7 tests Co-authored-by: avnishn <avnishn@uw.edu> 2021-11-02 12:10:17 +01:00			`cartpole-appo-vtrace-separate-losses:`
			`env: CartPole-v0`
			`run: APPO`
			`stop:`
			`episode_reward_mean: 150`
			`timesteps_total: 200000`
			`config:`
			`# Only works for tf\|tf2 so far.`
			`framework: tf`
			`# Switch on >1 loss/optimizer API for TFPolicy and EagerTFPolicy.`
			`_tf_policy_handles_more_than_one_loss: true`
			`# APPO will produce two separate loss terms:`
			`# policy loss + value function loss.`
			`_separate_vf_optimizer: true`
			`# Separate learning rate for the value function branch.`
			`_lr_vf: 0.00075`

			`num_envs_per_worker: 5`
			`num_workers: 1`
			`num_gpus: 0`
			`observation_filter: MeanStdFilter`
			`num_sgd_iter: 6`
			`vf_loss_coeff: 0.01`
			`vtrace: true`
			`model:`
			`fcnet_hiddens: [32]`
			`fcnet_activation: linear`
			`# Make sure we really have completely separate branches.`
			`vf_share_layers: false`