Specifically, subtracts 1 from the target number of workers, taking into
account that the head node has some computational resources.
Do not kill an idle node if it would drop us below the target number of
nodes (in which case we just immediately relaunch).
* Google Cloud Platform scaffolding
* Add minimal gcp config example
* Add googleapiclient discoveries, update gcp.config constants
* Rename and update gcp.config key pair name function
* Implement gcp.config._configure_project
* Fix the create project get project flow
* Implement gcp.config._configure_iam_role
* Implement service account iam binding
* Implement gcp.config._configure_key_pair
* Implement rsa key pair generation
* Implement gcp.config._configure_subnet
* Save work-in-progress gcp.config._configure_firewall_rules.
These are likely to be not needed at all. Saving them if we happen to
need them later.
* Remove unnecessary firewall configuration
* Update example-minimal.yaml configuration
* Add new wait_for_compute_operation, rename old wait_for_operation
* Temporarily rename autoscaler tags due to gcp incompatibility
* Implement initial gcp.node_provider.nodes
* Still missing filter support
* Implement initial gcp.node_provider.create_node
* Implement another compute wait
operation (wait_For_compute_zone_operation). TODO: figure out if we
can remove the function.
* Implement initial gcp.node_provider._node and node status functions
* Implement initial gcp.node_provider.terminate_node
* Implement node tagging and ip getter methods for nodes
* Temporarily rename tags due to gcp incompatibility
* Tiny tweaks for autoscaler.updater
* Remove unused config from gcp node_provider
* Add new example-full example to gcp, update load_gcp_example_config
* Implement label filtering for gcp.node_provider.nodes
* Revert unnecessary change in ssh command
* Revert "Temporarily rename tags due to gcp incompatibility"
This reverts commit e2fe634c5d11d705c0f5d3e76c80c37394bb23fb.
* Revert "Temporarily rename autoscaler tags due to gcp incompatibility"
This reverts commit c938ee435f4b75854a14e78242ad7f1d1ed8ad4b.
* Refactor autoscaler tagging to support multiple tag specs
* Remove missing cryptography imports
* Update quote function import
* Fix threading issue in gcp.config with the compute discovery object
* Add gcs support for log_sync
* Fix the labels/tags naming discrepancy
* Add expanduser to file_mounts hashing
* Fix gcp.node_provider.internal_ip
* Add uuid to node name
* Remove 'set -i' from updater ssh command
* Also add TODO with the context and reason for the change.
* Update ssh key creation in autoscaler.gcp.config
* Fix wait_for_compute_zone_operation's threading issue
Google discovery api's compute object is not thread safe, and thus
needs to be recreated for each thread. This moves the
`wait_for_compute_zone_operation` under `autoscaler.gcp.config`, and
adds compute as its argument.
* Address pr feedback from @ericl
* Expand local file mount paths in NodeUpdater
* Add ssh_user name to key names
* Update updater ssh to attempt 'set -i' and fall back if that fails
* Update gcp/example-full.yaml
* Fix wait crm operation in gcp.config
* Update gcp/example-minimal.yaml to match aws/example-minimal.yaml
* Fix gcp/example-full.yaml comment indentation
* Add gcp/example-full.yaml to setup files
* Update example-full.yaml command
* Revert "Refactor autoscaler tagging to support multiple tag specs"
This reverts commit 9cf48409ca2e5b66f800153853072c706fa502f6.
* Update tag spec to only use characters [0-9a-z_-]
* Change the tag values to conform gcp spec
* Add project_id in the ssh key name
* Replace '_' with '-' in autoscaler tag names
* Revert "Update updater ssh to attempt 'set -i' and fall back if that fails"
This reverts commit 23a0066c5254449e49746bd5e43b94b66f32bfb4.
* Revert "Remove 'set -i' from updater ssh command"
This reverts commit 5fa034cdf79fa7f8903691518c0d75699c630172.
* Add fallback to `set -i` in force_interactive command
* Update autoscaler tests to match current implementation
* Update GCPNodeProvider.create_node to include hash in instance name
* Add support for creating multiple instance on one create_node call
* Clean TODOs
* Update styles
* Replace single quotes with double quotes
* Some minor indentation fixes etc.
* Remove unnecessary comment. Fix indentation.
* Yapfify files that fail flake8 test
* Yapfify more files
* Update project_id handling in gcp node provider
* temporary yapf mod
* Revert "temporary yapf mod"
This reverts commit b6744e4e15d4d936d1a14f4bf155ed1d3bb14126.
* Fix autoscaler/updater.py lint error, remove unused variable
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.