mirror of
https://github.com/vale981/ray
synced 2025-03-10 05:16:49 -04:00
![]() This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged. |
||
---|---|---|
.. | ||
jenkins_tests | ||
travis-ci | ||
actor_test.py | ||
array_test.py | ||
autoscaler_test.py | ||
component_failures_test.py | ||
cython_test.py | ||
dataframe.py | ||
failure_test.py | ||
microbenchmarks.py | ||
monitor_test.py | ||
multi_node_test.py | ||
recursion_test.py | ||
runtest.py | ||
stress_tests.py | ||
tensorflow_test.py | ||
trial_runner_test.py | ||
trial_scheduler_test.py |