* Treat actor creation like a regular task.
* Small cleanups.
* Change semantics of actor resource handling.
* Bug fix.
* Minor linting
* Bug fix
* Fix jenkins test.
* Fix actor tests
* Some cleanups
* Bug fix
* Fix bug.
* Remove cached actor tasks when a driver is removed.
* Add more info to taskspec in global state API.
* Fix cyclic import bug in tune.
* Fix
* Fix linting.
* Fix linting.
* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.
* Bug fix.
* Add test for 0 CPU case
* Fix linting
* Address comments.
* Fix typos and add comment.
* Add assertion and fix test.
* Fri Feb 16 13:53:50 PST 2018
* Sat Feb 17 15:32:08 PST 2018
* Sat Feb 17 15:44:59 PST 2018
* fix
* Sun Feb 18 14:46:24 PST 2018
* Sun Feb 18 14:46:37 PST 2018
* Sun Feb 18 14:55:52 PST 2018
* Sun Feb 18 15:14:32 PST 2018
* Wed Feb 21 17:34:17 PST 2018
* Sun Feb 25 17:51:17 PST 2018
* Sun Feb 25 22:18:40 PST 2018
* Wed Feb 28 13:19:05 PST 2018
* Wed Feb 28 13:22:13 PST 2018
* Wed Feb 28 13:33:29 PST 2018
* Wed Feb 28 13:35:33 PST 2018
* add ex
* Fri Mar 2 12:50:17 PST 2018
* Fri Mar 2 12:54:31 PST 2018
* Allow passing in --object-store-memory to ray start.
* Allow setting ports for the redis shards.
* Reorder arguments and infer number of shards from ports.
* Move code block into only the head node case.
* Add test.
* spillback policy implementation: global + local scheduler
* modernize global scheduler policy state; factor out random number engine and generator
* Minimal version.
* Fix test.
* Make load balancing test less strenuous.
* Expose calls to get and set the actor frontier
* Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded
* Prototype for actor checkpointing
* Filter out duplicate tasks on the local scheduler
* Clean up some of the Python checkpointing code
* More cleanups
* Documentation
* cleanup and fix unit test
* Allow remote checkpoint calls through actor handle
* Check whether object is local before reconstructing
* Enable checkpointing for distributed actor handles, refactor tests
* Fix local scheduler tests
* lint
* Address comments
* lint
* Skip tests that fail on new GCS
* style
* Don't put same object twice when setting the actor frontier
* Address Philipp's comments, cleaner fbs naming
* patch up pbt
* Sat Jan 27 01:00:03 PST 2018
* Sat Jan 27 01:04:14 PST 2018
* Sat Jan 27 01:04:21 PST 2018
* Sat Jan 27 01:15:15 PST 2018
* Sat Jan 27 01:15:42 PST 2018
* Sat Jan 27 01:16:14 PST 2018
* Sat Jan 27 01:38:42 PST 2018
* Sat Jan 27 01:39:21 PST 2018
* add pbt
* Sat Jan 27 01:41:19 PST 2018
* Sat Jan 27 01:44:21 PST 2018
* Sat Jan 27 01:45:46 PST 2018
* Sat Jan 27 16:54:42 PST 2018
* Sat Jan 27 16:57:53 PST 2018
* clean up test
* Sat Jan 27 18:01:15 PST 2018
* Sat Jan 27 18:02:54 PST 2018
* Sat Jan 27 18:11:18 PST 2018
* Sat Jan 27 18:11:55 PST 2018
* Sat Jan 27 18:14:09 PST 2018
* review
* try out a ppo example
* some tweaks to ppo example
* add postprocess hook
* Sun Jan 28 15:00:40 PST 2018
* clean up custom explore fn
* Sun Jan 28 15:10:21 PST 2018
* Sun Jan 28 15:14:53 PST 2018
* Sun Jan 28 15:17:04 PST 2018
* Sun Jan 28 15:33:13 PST 2018
* Sun Jan 28 15:56:40 PST 2018
* Sun Jan 28 15:57:36 PST 2018
* Sun Jan 28 16:00:35 PST 2018
* Sun Jan 28 16:02:58 PST 2018
* Sun Jan 28 16:29:50 PST 2018
* Sun Jan 28 16:30:36 PST 2018
* Sun Jan 28 16:31:44 PST 2018
* improve tune doc
* concepts
* update humanoid
* Fri Feb 2 18:03:33 PST 2018
* fix example
* show error file
Adds a Population-Based Training (as described in https://arxiv.org/abs/1711.09846) scheduler to Ray.tune. Currently mutates hyperparameters according to either a user-defined list of possible values to mutate to (necessary if hyperparameters can only be certain values ex. sgd_batch_size), or by a factor of 0.8 or 1.2.
* Bring cloudpickle version 0.5.2 inside the repo.
* Use internal copy of cloudpickle everywhere.
* Fix linting.
* Import ordering.
* Change __init__.py.
* Set pickler in serialization context.
* Don't check ray location.
Remove rllib dep: trainable is now a standalone abstract class that can be easily subclassed.
Clean up hyperband: fix debug string and add an example.
Remove YAML api / ScriptRunner: this was never really used.
Move ray.init() out of run_experiments(): This provides greater flexibility and should be less confusing since there isn't an implicit init() done there. Note that this is a breaking API change for tune.
* Add failing unit test for nondeterministic reconstruction
* Retry scheduling actor tasks if reassigned to local scheduler
* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction
* Fix bug for updating checkpoint task execution dependencies
* Update comments for deterministic reconstruction
* cleanup
* Add (and skip) failing test case for nondeterministic reconstruction
* Suppress test output
* working multi action distribution and multiagent model
* currently working but the splits arent done in the right place
* added shared models
* added categorical support and mountain car example
* now compatible with generalized advantage estimation
* working multiagent code with discrete and continuous example
* moved reshaper to utils
* code review changes made, ppo action placeholder moved to model catalog, all multiagent code moved out of fcnet
* added examples in
* added PEP8 compliance
* examples are mostly pep8 compliant
* removed all flake errors
* added examples to jenkins tests
* fixed custom options bug
* added lines to let docker file find multiagent tests
* shortened example run length
* corrected nits
* fixed flake errors
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
* Adding dataframe object and minor APIs
* Adding reduce functionality
* Adding some print and making reduce work on current Ray
* Cleanup
* Added new functionality and docs.
* Adding more functionality.
* New functionality with older cleanup
* Complying with flake8 formatting
* Added tests and addressed reviewer comments
* Complying with flake8.
* Adding pandas to travis and requirements doc
* Fixing flake8 failures
* Fixing flake8 errors from imports
* Fixing import error
* Fixing import errors
* Addressing reviewer comments
* Addressing lint error
* trying to fix jenkins tests
* comment out more tests
* remove pytorch stuff
* use non-monotonic clock (monotonic not supported on python 2.7)
* whitespace
This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
* Enable scheduling with custom resource labels.
* Fix.
* Minor fixes and ref counting fix.
* Linting
* Use .data() instead of .c_str().
* Fix linting.
* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.
* Sleep in test so that all tasks are submitted before any completes.
* Give error if a worker has a version mismatch for Python Ray, or cloudpickle.
* Check version when attaching driver to cluster.
* Only do check if the version info is present.
* Bug fix.
* Fix typo.