ray/python/ray/rllib/optimizers/sync_samples_optimizer.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import ray
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
from ray.rllib.evaluation.sample_batch import SampleBatch
from ray.rllib.utils.filter import RunningStat
from ray.rllib.utils.timer import TimerStat


class SyncSamplesOptimizer(PolicyOptimizer):
    """A simple synchronous RL optimizer.

    In each step, this optimizer pulls samples from a number of remote
    evaluators, concatenates them, and then updates a local model. The updated
    model weights are then broadcast to all remote evaluators.
    """

    def _init(self, num_sgd_iter=1, timesteps_per_batch=1):
        self.update_weights_timer = TimerStat()
        self.sample_timer = TimerStat()
        self.grad_timer = TimerStat()
        self.throughput = RunningStat()
        self.num_sgd_iter = num_sgd_iter
        self.timesteps_per_batch = timesteps_per_batch
        self.learner_stats = {}

    def step(self):
        with self.update_weights_timer:
            if self.remote_evaluators:
                weights = ray.put(self.local_evaluator.get_weights())
                for e in self.remote_evaluators:
                    e.set_weights.remote(weights)

        with self.sample_timer:
            samples = []
            while sum(s.count for s in samples) < self.timesteps_per_batch:
                if self.remote_evaluators:
                    samples.extend(
                        ray.get([
                            e.sample.remote() for e in self.remote_evaluators
                        ]))
                else:
                    samples.append(self.local_evaluator.sample())
            samples = SampleBatch.concat_samples(samples)
            self.sample_timer.push_units_processed(samples.count)

        with self.grad_timer:
            for i in range(self.num_sgd_iter):
                fetches = self.local_evaluator.compute_apply(samples)
                if "stats" in fetches:
                    self.learner_stats = fetches["stats"]
                if self.num_sgd_iter > 1:
                    print(i, fetches)
            self.grad_timer.push_units_processed(samples.count)

        self.num_steps_sampled += samples.count
        self.num_steps_trained += samples.count
        return fetches

    def stats(self):
        return dict(
            PolicyOptimizer.stats(self), **{
                "sample_time_ms": round(1000 * self.sample_timer.mean, 3),
                "grad_time_ms": round(1000 * self.grad_timer.mean, 3),
                "update_time_ms": round(1000 * self.update_weights_timer.mean,
                                        3),
                "opt_peak_throughput": round(self.grad_timer.mean_throughput,
                                             3),
                "sample_peak_throughput": round(
                    self.sample_timer.mean_throughput, 3),
                "opt_samples": round(self.grad_timer.mean_units_processed, 3),
                "learner": self.learner_stats,
            })
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

			`import ray`
[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post (#1708) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 2018-03-15 15:57:31 -07:00			`from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer`
[rllib] Document "v2" APIs (#2316) * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * wip * tf * update * fix * cleanup * cleanup * spacing * model * fix * dqn * fix ddpg * doc * keep names * update * fix * com * docs * clarify model outputs * Update torch_policy_graph.py * fix obs filter * pass thru worker index * fix * rename * vlad torch comments * fix log action * debug name * fix lstm * remove unused ddpg net * remove conv net * revert lstm * wip * wip * cast * wip * works * fix a3c * works * lstm util test * doc * clean up * update * fix lstm check * move to end * fix sphinx * fix cmd * remove bad doc * envs * vec * doc prep * models * rl * alg * up * clarify * copy * async sa * fix * comments * fix a3c conf * tune lstm * fix reshape * fix * back to 16 * tuned a3c update * update * tuned * optional * merge * wip * fix up * move pg class * rename env * wip * update * tip * alg * readme * fix catalog * readme * doc * context * remove prep * comma * add env * link to paper * paper * update * rnn * update * wip * clean up ev creation * fix * fix * fix * fix lint * up * no comma * ma * Update run_multi_node_tests.sh * fix * sphinx is stupid * sphinx is stupid * clarify torch graph * no horizon * fix config * sb * Update test_optimizers.py 2018-07-01 00:05:08 -07:00			`from ray.rllib.evaluation.sample_batch import SampleBatch`
[rllib] Ape-X implementation and DQN refactor to handle replay in policy optimizer (#1604) * minimal apex checkin * cleanup dqn options * actor utils * Sun Feb 25 17:39:54 PST 2018 * update * compression refactor * fix * add test * fix models * Sun Feb 25 21:46:27 PST 2018 * Wed Feb 28 10:26:34 PST 2018 * Wed Feb 28 10:28:09 PST 2018 * Wed Feb 28 10:42:59 PST 2018 * refactor * Wed Feb 28 11:17:19 PST 2018 * Wed Feb 28 11:42:08 PST 2018 * Wed Feb 28 11:42:13 PST 2018 * Wed Feb 28 11:59:02 PST 2018 * Wed Feb 28 11:59:58 PST 2018 * Wed Feb 28 12:00:08 PST 2018 * Wed Feb 28 12:02:19 PST 2018 * Wed Feb 28 13:44:31 PST 2018 * Wed Feb 28 17:01:20 PST 2018 * Sat Mar 3 14:55:59 PST 2018 * make optimizer construction explicit * Sat Mar 3 18:23:08 PST 2018 * Sat Mar 3 18:24:28 PST 2018 * Sat Mar 3 18:49:28 PST 2018 * Sat Mar 3 18:50:42 PST 2018 * Sat Mar 3 18:56:10 PST 2018 2018-03-04 12:25:25 -08:00			`from ray.rllib.utils.filter import RunningStat`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`from ray.rllib.utils.timer import TimerStat`


[rllib] Rename optimizers for clarity (#2303) * rename * fix * update * mgpu * Update a3c.py * Update bc.py * Update a3c.py * Update test_optimizers.py * Update a3c.py 2018-06-27 02:30:15 -07:00			`class SyncSamplesOptimizer(PolicyOptimizer):`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`"""A simple synchronous RL optimizer.`

			`In each step, this optimizer pulls samples from a number of remote`
			`evaluators, concatenates them, and then updates a local model. The updated`
			`model weights are then broadcast to all remote evaluators.`
			`"""`

[rllib] Support the timesteps_per_batch in simple optimizer PPO mode (#2558) * support ts * doc * Update sync_samples_optimizer.py 2018-08-06 12:10:59 -07:00			`def _init(self, num_sgd_iter=1, timesteps_per_batch=1):`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00			`self.update_weights_timer = TimerStat()`
			`self.sample_timer = TimerStat()`
			`self.grad_timer = TimerStat()`
[rllib] Ape-X implementation and DQN refactor to handle replay in policy optimizer (#1604) * minimal apex checkin * cleanup dqn options * actor utils * Sun Feb 25 17:39:54 PST 2018 * update * compression refactor * fix * add test * fix models * Sun Feb 25 21:46:27 PST 2018 * Wed Feb 28 10:26:34 PST 2018 * Wed Feb 28 10:28:09 PST 2018 * Wed Feb 28 10:42:59 PST 2018 * refactor * Wed Feb 28 11:17:19 PST 2018 * Wed Feb 28 11:42:08 PST 2018 * Wed Feb 28 11:42:13 PST 2018 * Wed Feb 28 11:59:02 PST 2018 * Wed Feb 28 11:59:58 PST 2018 * Wed Feb 28 12:00:08 PST 2018 * Wed Feb 28 12:02:19 PST 2018 * Wed Feb 28 13:44:31 PST 2018 * Wed Feb 28 17:01:20 PST 2018 * Sat Mar 3 14:55:59 PST 2018 * make optimizer construction explicit * Sat Mar 3 18:23:08 PST 2018 * Sat Mar 3 18:24:28 PST 2018 * Sat Mar 3 18:49:28 PST 2018 * Sat Mar 3 18:50:42 PST 2018 * Sat Mar 3 18:56:10 PST 2018 2018-03-04 12:25:25 -08:00			`self.throughput = RunningStat()`
[rllib] Add debug info back to PPO and fix optimizer compatibility (#2366) 2018-07-12 19:22:46 +02:00			`self.num_sgd_iter = num_sgd_iter`
[rllib] Support the timesteps_per_batch in simple optimizer PPO mode (#2558) * support ts * doc * Update sync_samples_optimizer.py 2018-08-06 12:10:59 -07:00			`self.timesteps_per_batch = timesteps_per_batch`
[rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700) Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments 2018-08-23 17:49:10 -07:00			`self.learner_stats = {}`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
			`def step(self):`
			`with self.update_weights_timer:`
			`if self.remote_evaluators:`
			`weights = ray.put(self.local_evaluator.get_weights())`
			`for e in self.remote_evaluators:`
			`e.set_weights.remote(weights)`

			`with self.sample_timer:`
[rllib] Support the timesteps_per_batch in simple optimizer PPO mode (#2558) * support ts * doc * Update sync_samples_optimizer.py 2018-08-06 12:10:59 -07:00			`samples = []`
			`while sum(s.count for s in samples) < self.timesteps_per_batch:`
			`if self.remote_evaluators:`
			`samples.extend(`
			`ray.get([`
			`e.sample.remote() for e in self.remote_evaluators`
			`]))`
			`else:`
			`samples.append(self.local_evaluator.sample())`
			`samples = SampleBatch.concat_samples(samples)`
[rllib] Misc fixes, A2C (#2679) A bunch of minor rllib fixes: pull in latest baselines atari wrapper changes (and use deepmind wrapper by default) move reward clipping to policy evaluator add a2c variant of a3c reduce vision network fc layer size to 256 units switch to 84x84 images doc tweaks print timesteps in tune status 2018-08-20 15:28:03 -07:00			`self.sample_timer.push_units_processed(samples.count)`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
			`with self.grad_timer:`
[rllib] Add debug info back to PPO and fix optimizer compatibility (#2366) 2018-07-12 19:22:46 +02:00			`for i in range(self.num_sgd_iter):`
			`fetches = self.local_evaluator.compute_apply(samples)`
[rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700) Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments 2018-08-23 17:49:10 -07:00			`if "stats" in fetches:`
			`self.learner_stats = fetches["stats"]`
[rllib] Add debug info back to PPO and fix optimizer compatibility (#2366) 2018-07-12 19:22:46 +02:00			`if self.num_sgd_iter > 1:`
			`print(i, fetches)`
[rllib] Ape-X implementation and DQN refactor to handle replay in policy optimizer (#1604) * minimal apex checkin * cleanup dqn options * actor utils * Sun Feb 25 17:39:54 PST 2018 * update * compression refactor * fix * add test * fix models * Sun Feb 25 21:46:27 PST 2018 * Wed Feb 28 10:26:34 PST 2018 * Wed Feb 28 10:28:09 PST 2018 * Wed Feb 28 10:42:59 PST 2018 * refactor * Wed Feb 28 11:17:19 PST 2018 * Wed Feb 28 11:42:08 PST 2018 * Wed Feb 28 11:42:13 PST 2018 * Wed Feb 28 11:59:02 PST 2018 * Wed Feb 28 11:59:58 PST 2018 * Wed Feb 28 12:00:08 PST 2018 * Wed Feb 28 12:02:19 PST 2018 * Wed Feb 28 13:44:31 PST 2018 * Wed Feb 28 17:01:20 PST 2018 * Sat Mar 3 14:55:59 PST 2018 * make optimizer construction explicit * Sat Mar 3 18:23:08 PST 2018 * Sat Mar 3 18:24:28 PST 2018 * Sat Mar 3 18:49:28 PST 2018 * Sat Mar 3 18:50:42 PST 2018 * Sat Mar 3 18:56:10 PST 2018 2018-03-04 12:25:25 -08:00			`self.grad_timer.push_units_processed(samples.count)`

			`self.num_steps_sampled += samples.count`
			`self.num_steps_trained += samples.count`
[rllib] Add debug info back to PPO and fix optimizer compatibility (#2366) 2018-07-12 19:22:46 +02:00			`return fetches`
[rllib] Refactor DQN to use an Evaluator abstraction (#1276) This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface. 2017-12-06 17:51:57 -08:00
			`def stats(self):`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00			`return dict(`
			`PolicyOptimizer.stats(self), **{`
			`"sample_time_ms": round(1000 * self.sample_timer.mean, 3),`
			`"grad_time_ms": round(1000 * self.grad_timer.mean, 3),`
			`"update_time_ms": round(1000 * self.update_weights_timer.mean,`
			`3),`
			`"opt_peak_throughput": round(self.grad_timer.mean_throughput,`
			`3),`
[rllib] Misc fixes, A2C (#2679) A bunch of minor rllib fixes: pull in latest baselines atari wrapper changes (and use deepmind wrapper by default) move reward clipping to policy evaluator add a2c variant of a3c reduce vision network fc layer size to 256 units switch to 84x84 images doc tweaks print timesteps in tune status 2018-08-20 15:28:03 -07:00			`"sample_peak_throughput": round(`
			`self.sample_timer.mean_throughput, 3),`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00			`"opt_samples": round(self.grad_timer.mean_units_processed, 3),`
[rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700) Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments 2018-08-23 17:49:10 -07:00			`"learner": self.learner_stats,`
[rllib] format with yapf (#2427) * initial yapf * manual fix yapf bugs 2018-07-19 15:30:36 -07:00			`})`