* scaffold of the code
* some scratch and options change
* NCCL mostly done, supporting API#1
* interface 2.1 2.2 scratch
* put code into ray and fix some importing issues
* add an addtional Rendezvous class to safely meet at named actor
* fix some small bugs in nccl_util
* some small fix
* scaffold of the code
* some scratch and options change
* NCCL mostly done, supporting API#1
* interface 2.1 2.2 scratch
* put code into ray and fix some importing issues
* add an addtional Rendezvous class to safely meet at named actor
* fix some small bugs in nccl_util
* some small fix
* add a Backend class to make Backend string more robust
* add several useful APIs
* add some tests
* added allreduce test
* fix typos
* fix several bugs found via unittests
* fix and update torch test
* changed back actor
* rearange a bit before importing distributed test
* add distributed test
* remove scratch code
* auto-linting
* linting 2
* linting 2
* linting 3
* linting 4
* linting 5
* linting 6
* 2.1 2.2
* fix small bugs
* minor updates
* linting again
* auto linting
* linting 2
* final linting
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Update python/ray/util/collective_utils.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* added actor test
* lint
* remove local sh
* address most of richard's comments
* minor update
* remove the actor.option() interface to avoid changes in ray core
* minor updates
Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* wrote code to enable cancellation of queued non-actor tasks
* minor changes
* bug fixes
* added comments
* rev1
* linting
* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error
* bug fix
* added two unit tests
* linting
* iterating through pending_normal_tasks starting from end
* fixup! iterating through pending_normal_tasks starting from end
* fixup! fixup! iterating through pending_normal_tasks starting from end
* post merge fixes
* added debugging instructions, pulled Accept() out of guarded loop
* removed debugging instructions, linting
* prepare for head node
* move command runner interface outside _private
* remove space
* Eric
* flake
* min_workers in multi node type
* fixing edge cases
* eric not idle
* fix target_workers to consider min_workers of node types
* idle timeout
* minor
* minor fix
* test
* lint
* eric v2
* eric 3
* min_workers constraint before bin packing
* Update resource_demand_scheduler.py
* Revert "Update resource_demand_scheduler.py"
This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.
* reducing diff
* make get_nodes_to_launch return a dict
* merge
* weird merge fix
* auto fill instance types for AWS
* Alex/Eric
* Update doc/source/cluster/autoscaling.rst
* merge autofill and input from user
* logger.exception
* make the yaml use the default autofill
* docs Eric
* remove test_autoscaler_yaml from windows tests
* lets try changing the test a bit
* return test
* lets see
* edward
* Limit max launch concurrency
* commenting frac TODO
* move to resource demand scheduler
* use STATUS UP TO DATE
* Eric
* make logger of gc freed refs debug instead of info
* add cluster name to docker mount prefix directory
* grrR
* fix tests
* moving docker directory to sdk
* move the import to prevent circular dependency
* smallf fix
* ian
* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running
* small fix
* deflake test_joblib
* lint
* placement groups bypass
* remove space
* Eric
* first ocmmit
* lint
* exmaple
* documentation
* hmm
* file path fix
* fix test
* some format issue in docs
* modified docs
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>