* prepare for head node
* move command runner interface outside _private
* remove space
* Eric
* flake
* min_workers in multi node type
* fixing edge cases
* eric not idle
* fix target_workers to consider min_workers of node types
* idle timeout
* minor
* minor fix
* test
* lint
* eric v2
* eric 3
* min_workers constraint before bin packing
* Update resource_demand_scheduler.py
* Revert "Update resource_demand_scheduler.py"
This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.
* reducing diff
* make get_nodes_to_launch return a dict
* merge
* weird merge fix
* auto fill instance types for AWS
* Alex/Eric
* Update doc/source/cluster/autoscaling.rst
* merge autofill and input from user
* logger.exception
* make the yaml use the default autofill
* docs Eric
* remove test_autoscaler_yaml from windows tests
* lets try changing the test a bit
* return test
* lets see
* edward
* Limit max launch concurrency
* commenting frac TODO
* move to resource demand scheduler
* use STATUS UP TO DATE
* Eric
* make logger of gc freed refs debug instead of info
* add cluster name to docker mount prefix directory
* grrR
* fix tests
* moving docker directory to sdk
* move the import to prevent circular dependency
* smallf fix
* ian
* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running
* small fix
* huh?
* set initialized status for head when launching head node
* test
* patch
* fix lint
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
* Dashboard select port; Fix dashboard may hangs when exit
* Add test case
* Fix
* Fix test_stats_collector.py::test_get_all_node_details
* Refine dashboard error messages
* Refine code
* Refine code
* Show last 10 lines of dashboard log if start dashboard failed
* Fix ValueError: too many values to unpack (expected 2) when getsockname
* Fix test_multi_node_3.py::test_calling_start_ray_head may fail
* Fix Windows CI
* Disable dashboard in C++ test
* Refine code
* Fix issue 7084
Co-authored-by: 刘宝 <po.lb@antfin.com>
* make pg creation sync
* return successful immediately when pg registeration
* hold on
* fix ut
* make collection for callback
* make pg registration vector
* fix new cpp ut
* fix named py ut
* fix python ut bug
* fix python ut
* fix lint
* modify comment
* fix comment
* fix comment
* add new ut and fix old lint issue
* fix comment
* update comment
* fix conflict
* Refactor placement group factory object to accept placement_group arguments instead of callables
* Convert resources to pgf
* Enable placement groups per default
* Fix tests WIP
* Fix stop/resume with placement groups
* Fix progress reporter test
* Fix trial executor tests
* Check resource for trial, not resource object
* Move ENV vars into class
* Fix tests
* Sphinx
* Wait for trial start in PBT
* Revert merge errors
* Support trial reuse with placement groups
* Better check for just staged trials
* Fix trial queuing
* Wait for pg after trial termination
* Clean up PGs before tune run
* No PG settings in pbt scheduler
* Fix buffering tests
* Skip test if ray reports erroneous available resources
* Disable PG for cluster resource counting test
* Debug output for tests
* Output in-use resources for placement groups
* Don't start new trial on trial start failure
* Add docs
* Cleanup PGs once futures returned
* Fix placement group shutdown
* Use updated_queue flag
* Apply suggestions from code review
* Apply suggestions from code review
* Update docs
* Reuse placement groups independently from actors
* Do not remove placement groups for paused trials
* Only continue enqueueing trials if it didn't fail the first time
* Rename parameter
* Fix pause trial
* Code review + try_recover
* Update python/ray/tune/utils/placement_groups.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Move placement group lifecycle management
* Move total used resources to pg manager
* Update FAQ example
* Requeue trial if start was unsuccessful
* Do not cleanup pgs at start of run
* Revert "Do not cleanup pgs at start of run"
This reverts commit 933d9c4c
* Delayed PG removal
* Fix trial requeue test
* Trigger pg cleanup on status update
* Fix tests
* Fix docs
* fix-test
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Small improvements to the Ray Cluster docs
* Update quickstart.rst
Changed title for quick start
Co-authored-by: Javier Redondo <javier@Anyscale-MacBook-Pro.local>
* Add `ray get-logs` CLI command to fetch logs and state from nodes in a cluster
* Add dataclasses for py < 3.7
* Remove dataclasses dependency in setup.py
* Rename command, print what is collected
* Remove dataclass dependency
* Typo
* Lint
* Apply suggestions fom code review