ray/rllib/optimizers/async_gradients_optimizer.py

import ray
from ray.rllib.evaluation.metrics import get_learner_stats
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
from ray.rllib.utils.annotations import override
from ray.rllib.utils.timer import TimerStat
from ray.rllib.utils.memory import ray_get_and_free


class AsyncGradientsOptimizer(PolicyOptimizer):
    """An asynchronous RL optimizer, e.g. for implementing A3C.

    This optimizer asynchronously pulls and applies gradients from remote
    workers, sending updated weights back as needed. This pipelines the
    gradient computations on the remote workers.
    """

    def __init__(self, workers, grads_per_step=100):
        """Initialize an async gradients optimizer.

        Arguments:
            grads_per_step (int): The number of gradients to collect and apply
                per each call to step(). This number should be sufficiently
                high to amortize the overhead of calling step().
        """
        PolicyOptimizer.__init__(self, workers)

        self.apply_timer = TimerStat()
        self.wait_timer = TimerStat()
        self.dispatch_timer = TimerStat()
        self.grads_per_step = grads_per_step
        self.learner_stats = {}
        if not self.workers.remote_workers():
            raise ValueError(
                "Async optimizer requires at least 1 remote workers")

    @override(PolicyOptimizer)
    def step(self):
        weights = ray.put(self.workers.local_worker().get_weights())
        pending_gradients = {}
        num_gradients = 0

        # Kick off the first wave of async tasks
        for e in self.workers.remote_workers():
            e.set_weights.remote(weights)
            future = e.compute_gradients.remote(e.sample.remote())
            pending_gradients[future] = e
            num_gradients += 1

        while pending_gradients:
            with self.wait_timer:
                wait_results = ray.wait(
                    list(pending_gradients.keys()), num_returns=1)
                ready_list = wait_results[0]
                future = ready_list[0]

                gradient, info = ray_get_and_free(future)
                e = pending_gradients.pop(future)
                self.learner_stats = get_learner_stats(info)

            if gradient is not None:
                with self.apply_timer:
                    self.workers.local_worker().apply_gradients(gradient)
                self.num_steps_sampled += info["batch_count"]
                self.num_steps_trained += info["batch_count"]

            if num_gradients < self.grads_per_step:
                with self.dispatch_timer:
                    e.set_weights.remote(
                        self.workers.local_worker().get_weights())
                    future = e.compute_gradients.remote(e.sample.remote())

                    pending_gradients[future] = e
                    num_gradients += 1

    @override(PolicyOptimizer)
    def stats(self):
        return dict(
            PolicyOptimizer.stats(self), **{
                "wait_time_ms": round(1000 * self.wait_timer.mean, 3),
                "apply_time_ms": round(1000 * self.apply_timer.mean, 3),
                "dispatch_time_ms": round(1000 * self.dispatch_timer.mean, 3),
                "learner": self.learner_stats,
            })
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`import ray`
[rllib] Ensure stats are consistently reported across all algos (#4445) 2019-03-27 15:40:15 -07:00			`from ray.rllib.evaluation.metrics import get_learner_stats`
[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post (#1708) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 2018-03-15 15:57:31 -07:00			`from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer`
[rllib] Better document which methods are abstract and which ones are overrides (#3480) 2018-12-08 16:28:58 -08:00			`from ray.rllib.utils.annotations import override`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`from ray.rllib.utils.timer import TimerStat`
[rllib] Replace ray.get() with ray_get_and_free() to optimize memory usage (#4586) 2019-04-17 20:30:03 -04:00			`from ray.rllib.utils.memory import ray_get_and_free`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00

[rllib] Rename optimizers for clarity (#2303) * rename * fix * update * mgpu * Update a3c.py * Update bc.py * Update a3c.py * Update test_optimizers.py * Update a3c.py 2018-06-27 02:30:15 -07:00			`class AsyncGradientsOptimizer(PolicyOptimizer):`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`"""An asynchronous RL optimizer, e.g. for implementing A3C.`

			`This optimizer asynchronously pulls and applies gradients from remote`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`workers, sending updated weights back as needed. This pipelines the`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`gradient computations on the remote workers.`
			`"""`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`def __init__(self, workers, grads_per_step=100):`
[rllib] Document ModelV2 and clean up the models/ directory (#5277) 2019-07-27 02:08:16 -07:00			`"""Initialize an async gradients optimizer.`

			`Arguments:`
			`grads_per_step (int): The number of gradients to collect and apply`
			`per each call to step(). This number should be sufficiently`
			`high to amortize the overhead of calling step().`
			`"""`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`PolicyOptimizer.__init__(self, workers)`
[rllib] Clean up concepts documentation and policy optimizer creation (#4592) 2019-04-12 21:03:26 -07:00
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`self.apply_timer = TimerStat()`
			`self.wait_timer = TimerStat()`
			`self.dispatch_timer = TimerStat()`
[rllib] Ape-X implementation and DQN refactor to handle replay in policy optimizer (#1604) * minimal apex checkin * cleanup dqn options * actor utils * Sun Feb 25 17:39:54 PST 2018 * update * compression refactor * fix * add test * fix models * Sun Feb 25 21:46:27 PST 2018 * Wed Feb 28 10:26:34 PST 2018 * Wed Feb 28 10:28:09 PST 2018 * Wed Feb 28 10:42:59 PST 2018 * refactor * Wed Feb 28 11:17:19 PST 2018 * Wed Feb 28 11:42:08 PST 2018 * Wed Feb 28 11:42:13 PST 2018 * Wed Feb 28 11:59:02 PST 2018 * Wed Feb 28 11:59:58 PST 2018 * Wed Feb 28 12:00:08 PST 2018 * Wed Feb 28 12:02:19 PST 2018 * Wed Feb 28 13:44:31 PST 2018 * Wed Feb 28 17:01:20 PST 2018 * Sat Mar 3 14:55:59 PST 2018 * make optimizer construction explicit * Sat Mar 3 18:23:08 PST 2018 * Sat Mar 3 18:24:28 PST 2018 * Sat Mar 3 18:49:28 PST 2018 * Sat Mar 3 18:50:42 PST 2018 * Sat Mar 3 18:56:10 PST 2018 2018-03-04 12:25:25 -08:00			`self.grads_per_step = grads_per_step`
[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504) Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer Add AsyncSamplesOptimizer that implements the IMPALA architecture integrate V-trace with a3c policy graph audit V-trace integration benchmark compare vs A3C and with V-trace on/off PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C. 2018-08-01 20:53:53 -07:00			`self.learner_stats = {}`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`if not self.workers.remote_workers():`
[rllib] Part 2 of multiagent support (#2286) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * fix obs filter * pass thru worker index * fix * fix log action * debug name * fix sphinx 2018-06-25 22:33:57 -07:00			`raise ValueError(`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`"Async optimizer requires at least 1 remote workers")`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
[rllib] Better document which methods are abstract and which ones are overrides (#3480) 2018-12-08 16:28:58 -08:00			`@override(PolicyOptimizer)`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`def step(self):`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`weights = ray.put(self.workers.local_worker().get_weights())`
[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`pending_gradients = {}`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`num_gradients = 0`

			`# Kick off the first wave of async tasks`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`for e in self.workers.remote_workers():`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`e.set_weights.remote(weights)`
[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`future = e.compute_gradients.remote(e.sample.remote())`
			`pending_gradients[future] = e`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`num_gradients += 1`

[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`while pending_gradients:`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`with self.wait_timer:`
[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`wait_results = ray.wait(`
			`list(pending_gradients.keys()), num_returns=1)`
			`ready_list = wait_results[0]`
			`future = ready_list[0]`

[rllib] Replace ray.get() with ray_get_and_free() to optimize memory usage (#4586) 2019-04-17 20:30:03 -04:00			`gradient, info = ray_get_and_free(future)`
[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`e = pending_gradients.pop(future)`
[rllib] Ensure stats are consistently reported across all algos (#4445) 2019-03-27 15:40:15 -07:00			`self.learner_stats = get_learner_stats(info)`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
			`if gradient is not None:`
			`with self.apply_timer:`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`self.workers.local_worker().apply_gradients(gradient)`
[rllib] Count actual sample batch size instead of configured batch size in A3C. (#2399) This fixes a metrics accounting bug where the sample count is not reported correctly. 2018-07-18 08:59:52 +02:00			`self.num_steps_sampled += info["batch_count"]`
			`self.num_steps_trained += info["batch_count"]`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
			`if num_gradients < self.grads_per_step:`
			`with self.dispatch_timer:`
[rllib] Rename PolicyEvaluator => RolloutWorker (#4820) 2019-06-03 06:49:24 +08:00			`e.set_weights.remote(`
			`self.workers.local_worker().get_weights())`
[rllib] use `ray.wait` to get next worker result in async sample optimizer (#2993) 2018-10-17 17:44:51 -07:00			`future = e.compute_gradients.remote(e.sample.remote())`

			`pending_gradients[future] = e`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`num_gradients += 1`

[rllib] Better document which methods are abstract and which ones are overrides (#3480) 2018-12-08 16:28:58 -08:00			`@override(PolicyOptimizer)`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`def stats(self):`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00			`return dict(`
			`PolicyOptimizer.stats(self), **{`
			`"wait_time_ms": round(1000 * self.wait_timer.mean, 3),`
			`"apply_time_ms": round(1000 * self.apply_timer.mean, 3),`
			`"dispatch_time_ms": round(1000 * self.dispatch_timer.mean, 3),`
[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504) Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer Add AsyncSamplesOptimizer that implements the IMPALA architecture integrate V-trace with a3c policy graph audit V-trace integration benchmark compare vs A3C and with V-trace on/off PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C. 2018-08-01 20:53:53 -07:00			`"learner": self.learner_stats,`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00			`})`