This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
* revamp saving
* smaller jpgs
* hide verbose
* Tue Dec 19 22:25:01 PST 2017
* make sure temp dirs sort lexiographically
* save total reward too
* zero pad i
* 160x160 dqn
* ever higher res dqn
* Adding dataframe object and minor APIs
* Adding reduce functionality
* Adding some print and making reduce work on current Ray
* Cleanup
* Added new functionality and docs.
* Adding more functionality.
* New functionality with older cleanup
* Complying with flake8 formatting
* Added tests and addressed reviewer comments
* Complying with flake8.
* Adding pandas to travis and requirements doc
* Fixing flake8 failures
* Fixing flake8 errors from imports
* Fixing import error
* Fixing import errors
* Addressing reviewer comments
* Addressing lint error
* Define execution dependencies flatbuffer and add to Redis commands
* Convert TaskSpec to TaskExecutionSpec
* Add execution dependencies to Python bindings
* Submitting actor tasks uses execution dependency API instead of dummy argument
* Fix dependency getters and some cleanup for fetching missing dependencies
* C++ convention
* Make TaskExecutionSpec a C++ class
* Convert local scheduler to use TaskExecutionSpec class
* Convert some pointers to references
* Finish conversion to TaskExecutionSpec class
* fix
* Fix
* Fix memory errors?
* Cast flatbuffers GetSize to size_t
* Fixes
* add more retries in global scheduler unit test
* fix linting and cast fbb.GetSize to size_t
* Style and doc
* Fix linting and simplify from_flatbuf.
* Raise exception if pyarrow is imported before ray.
* Pip install pyarrow when building doc so we don't have to mock it.
* Raise ImportError instead of Exception.
* trying to fix jenkins tests
* comment out more tests
* remove pytorch stuff
* use non-monotonic clock (monotonic not supported on python 2.7)
* whitespace
This introduces rllib.Evaluator and rllib.Optimizer classes. Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
* Enable scheduling with custom resource labels.
* Fix.
* Minor fixes and ref counting fix.
* Linting
* Use .data() instead of .c_str().
* Fix linting.
* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.
* Sleep in test so that all tasks are submitted before any completes.
* Check version info in ray start for non-head nodes.
* Small fix.
* Fix
* Push error to all drivers when worker has version mismatch.
* Linting
* Linting
* Fix
* Unify methods.
* Fix bug.