Commit graph

3646 commits

Author SHA1 Message Date
Amog Kamsetty
6b477dd37a
[CI] Split test_multi_node to avoid timeouts (#13712) 2021-01-26 12:06:19 -08:00
Barak Michener
0c46d09940
[ray_client]: Monitor client stream errors (#13386) 2021-01-26 10:56:56 -08:00
Ian Rodney
5d82654022
[CLI] Fix Ray Status with ENV Variable set (#13707) 2021-01-26 10:29:42 -08:00
Dmitri Gekhtman
ddcbd229ba
Rename the ray.operator module to ray.ray_operator (#13705)
* Rename ray.operator module

* mypy
2021-01-26 10:29:07 -08:00
dependabot[bot]
148b1022d6
[tune](deps): Bump autogluon-core in /python/requirements (#13698)
Bumps [autogluon-core](https://github.com/awslabs/autogluon) from 0.0.16b20210122 to 0.0.16b20210125.
- [Release notes](https://github.com/awslabs/autogluon/releases)
- [Changelog](https://github.com/awslabs/autogluon/blob/master/docs/ReleaseInstructions.md)
- [Commits](https://github.com/awslabs/autogluon/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-01-26 11:32:56 +01:00
dependabot[bot]
ef1f7e4d42
[tune](deps): Bump smart-open[s3] in /python/requirements (#13699)
Bumps [smart-open[s3]](https://github.com/piskvorky/smart_open) from 4.0.1 to 4.1.2.
- [Release notes](https://github.com/piskvorky/smart_open/releases)
- [Changelog](https://github.com/RaRe-Technologies/smart_open/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/piskvorky/smart_open/compare/4.0.1...v4.1.2)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-01-26 11:32:17 +01:00
Hao Zhang
7a78f4e959
[Collective][PR 4/6] NCCL Communicator caching and preliminary stream management (#13030)
Co-authored-by: Dacheng Li <dal177@ucsd.edu>
2021-01-26 01:05:21 -08:00
Simon Mo
8b8d6b984b
[Buildkite] Add all Python tests (#13566) 2021-01-25 16:05:59 -08:00
dependabot[bot]
0d75f37c1f
[tune](deps): Bump distributed in /python/requirements (#13643)
Bumps [distributed](https://github.com/dask/distributed) from 2020.12.0 to 2021.1.1.
- [Release notes](https://github.com/dask/distributed/releases)
- [Changelog](https://github.com/dask/distributed/blob/master/docs/release-procedure.md)
- [Commits](https://github.com/dask/distributed/compare/2020.12.0...2021.01.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-01-26 00:03:38 +01:00
Amog Kamsetty
9feae90e3b
skip test_spill (#13693) 2021-01-25 14:37:07 -08:00
Dmitri Gekhtman
79209110c5
[kubernetes][operator][hotfix] Dictionary fix (#13663) 2021-01-25 10:40:59 -06:00
Sven Mika
9423930bcc
[RLlib] MAML: Add cartpole mass test for PyTorch. (#13679) 2021-01-25 12:32:41 +01:00
Ameer Haj Ali
4dabf017ee
Close #12031 (Autoscaler is overriding your resource for same quantity) (#13671) 2021-01-24 16:31:53 -08:00
SangBin Cho
edbb2937d3
[Object Spilling] Multi node file spilling V2. (#13542)
* done.

* done.

* Fix a mistake.

* Ready.

* Fix issues.

* fix.

* Finished the first round of code review.

* formatting.

* In progress.

* Formatting.

* Addressed code review.

* Formatting

* Fix tests.

* fix bugs.

* Skip flaky tests for now.
2021-01-23 23:15:32 -08:00
Barak Michener
e675e5b75a
[ray_client]: Add more retry logic (#13478) 2021-01-23 23:11:39 -08:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Kai Fricke
17760e1510
[tune] update Optuna integration to 2.4.0 API (#13631)
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-01-23 00:32:37 -08:00
Amog Kamsetty
01d74af89d
[horovod] Horovod+Ray Pytorch Lightning Accelerator (#13458) 2021-01-22 16:30:10 -08:00
Amog Kamsetty
25e1b78eed
[Dependencies] Move requirements.txt to requirements directory. (#13636) 2021-01-22 16:29:05 -08:00
architkulkarni
0c3d9a3eaa
[Metrics] Fix serialization for custom metrics (#13571) 2021-01-22 14:11:59 -06:00
Dmitri Gekhtman
7fec19dad2
[kubernetes][operator][minutiae] Backwards compatibility of operator (#13623) 2021-01-22 14:07:25 -06:00
architkulkarni
da5928304a
[Metrics] Cache metrics ports in a file at each node (#13501)
* cache metric ports in a file at each node

* remove old assignment of export port

* lint

* lint

* move e2e test to top of file to avoid shutdown bug
2021-01-22 09:59:20 -08:00
Amog Kamsetty
00c14ce4a4
[Object Spilling] Skip flaky tests (#13628)
* skip flaky tests

* lint

* skip one more

* fix
2021-01-22 00:31:33 -08:00
Amog Kamsetty
39755fdb20
Revert "[Serve] Refactor BackendState" (#13626)
This reverts commit 68038741ac.
2021-01-21 23:06:15 -08:00
Ameer Haj Ali
1fbb752f42
[autoscaler] remove worker_default_node_type that is useless. (#13588) 2021-01-21 17:04:38 -08:00
Nikita Vemuri
4e01a9ec38
[Autoscaler] Ensure ubuntu is owner of docker host mount folder (#13579)
* change ownership to ubuntu if root

* use ssh user in cluster config

* formatting

Co-authored-by: Nikita Vemuri <nikitavemuri@Nikitas-MacBook-Pro.local>
2021-01-21 17:01:55 -08:00
Stephanie Wang
0998d69968
[core] Admission control for pulling objects to the local node (#13514)
* Admission control, TODO: tests, object size

* Unit tests for admission control and some bug fixes

* Add object size to object table, only activate pull if object size is known

* Some fixes, reset timer on eviction

* doc

* update

* Trigger OOM from the pull manager

* don't spam

* doc

* Update src/ray/object_manager/pull_manager.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Remove useless tests

* Fix test

* osx build

* Skip broken test

* tests

* Skip failing tests

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-01-21 16:46:42 -08:00
Amog Kamsetty
ccc901f662
add 3.8 (#13608) 2021-01-21 16:38:51 -08:00
Amog Kamsetty
20acc3b05e
Revert "Inline small objects in GetObjectStatus response. (#13309)" (#13615)
This reverts commit a82fa80f7b.
2021-01-21 16:10:34 -08:00
Dmitri Gekhtman
87ca102c93
[Kubernetes] Unit test for cluster launch and teardown using K8s Operator (#13437) 2021-01-21 12:00:37 -06:00
Ian Rodney
68038741ac
[serve] Refactor BackendState to use ReplicaState classes (#13406) 2021-01-21 11:16:02 -06:00
Clark Zinzow
a82fa80f7b
Inline small objects in GetObjectStatus response. (#13309) 2021-01-21 09:15:18 -08:00
Alex Wu
b9ac3878ae
[Autoscaler] Display node status tag in autsocaler status (#13561)
* .

* .

* .

* .

* .

* lint

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-01-20 19:20:54 -08:00
Edward Oakes
b796de4104
[metrics] Check that all tag_keys are set when recording (#13420) 2021-01-20 13:09:44 -06:00
dmatch01
fd6882176a
Fix for operator role definition to add raycluster/finalizer (#13567) 2021-01-20 13:02:02 -06:00
Eric Liang
e6412efdf5
Extra fix ray client newline (#13577) 2021-01-20 09:23:14 -08:00
Kai Fricke
6c23bef2a7
[tune] Allow actor reuse for new trials (#13549)
* Allow actor reuse for new trials

* Fix tests and update conf when starting new trial

* Move magic config to `reset_trial`
2021-01-20 11:25:33 +01:00
Daan Klijn
800304acfb
[tune] wandb - WandbLogger now also accepts wandb.data_types.Video (#13169) 2021-01-20 01:19:54 -08:00
Eric Liang
d0f224d5cf
Revert "Pipe monitor.err logs to driver" (#13574)
This reverts commit a0d08c2cc6.
2021-01-20 00:29:19 -08:00
Eric Liang
a0d08c2cc6
Pipe monitor.err logs to driver 2021-01-19 12:27:07 -08:00
Simon Mo
c963cbc038
Fix Docker Permission for Serve release test again (#13543) 2021-01-19 12:23:30 -08:00
Dmitri Gekhtman
7b4a97c610
Make AWSNodeProvider.create_node return nodes created (#13498)
* Make AWSNodeProvider.create_node return node config

* return-dict

* Node provider interface create node return type Any

* Type clarification.

* Delete debug code

* Oops reset example-full changes

* Return type specified. GCP create node returns None.

* Article
2021-01-19 12:17:46 -08:00
Amog Kamsetty
20016c983f
[Tune] MLflow Credentials (#13533) 2021-01-19 11:55:13 -08:00
Edward Oakes
9b071eb449
[metrics] Better validation for tags (#13421) 2021-01-19 13:26:51 -06:00
SangBin Cho
99375c4cfc
[Object Spilling] Remove retries and use a timer instead. (#13175) 2021-01-19 11:01:45 -08:00
Sven Mika
e74947cc94
[RLlib] Env directory cleanup and tests. (#13082) 2021-01-19 10:09:39 +01:00
Todd A. Anderson
2506a6cd0e
Remove PYTHON_MODE that is not defined in Ray so that import * will work from other packages. (#13544) 2021-01-18 23:07:01 -08:00
Richard Liaw
7a2997ea8c
[tune] support experiment checkpointing for grid search (#13357) 2021-01-18 19:24:36 -08:00
Ameer Haj Ali
1fbc3ddfac
Add ability to not start Monitor when calling ray start (#13505) 2021-01-18 18:31:53 -08:00
Simon Mo
6341f1fa2e
[Serve] Allow ObjectRef for Composition (#12592) 2021-01-18 15:26:35 -08:00