Commit graph

7362 commits

Author SHA1 Message Date
architkulkarni
ba4b7ccfe8
[Serve] [Doc] Add basic Serve tutorial (#14256) 2021-02-25 14:10:08 -06:00
Guy Khazma
e3f3269b15
[doc] Fixes to RayDP docs (#14309)
* minor fix to raydp docs

* fix pytorch and tensorflow samples

* fix: minor fixes
2021-02-25 11:23:10 -08:00
Sven Mika
6cd0cd3bd9
[RLlib + Tune] Add placement group support to RLlib. (#14289) 2021-02-25 16:01:31 +01:00
Sven Mika
8000258333
[RLlib] R2D2 Implementation. (#13933) 2021-02-25 12:18:11 +01:00
SangBin Cho
4357055305
[Shuffle] Emulate multi node in shuffle.py (#14331)
* done.

* Formatting.

* done.

* Addressed code review.

* Addressed code review 2.
2021-02-24 23:49:29 -08:00
Kai Fricke
d9e5d5f47a
[RLlib] Cast fcnet_hiddens to list for DQN models (list vs tuple mismatch error) (#14308) 2021-02-25 08:06:08 +01:00
Eric Liang
adbdacae58
add more io workers (#14330) 2021-02-24 22:00:31 -08:00
Clark Zinzow
c1a1be1da6
[Core] Locality-aware leasing: Milestone 2 - Owned refs, cached locations (#14282)
* Adds locality-aware leasing for cached owned refs.

* Add tests for locality-aware leasing on cached owned refs.
2021-02-24 21:24:10 -08:00
Hao Zhang
11e721c9b3
[Collective] Address some comments and minor updates before merging multistream (#14302) 2021-02-24 20:43:42 -08:00
Kathryn Zhou
456d9aab47
Add Cypress test for Ray Dashboard (#14253) 2021-02-24 20:41:52 -08:00
Richard Liaw
80657e5dfe
Revert "[Core]Pull off timers out of heartbeat in raylet (#13963)" (#14319) 2021-02-24 19:44:31 -08:00
ZhuSenlin
be28e8fae4
use iterator to instead of operator[] to avoid garbage (#14275) 2021-02-25 11:37:36 +08:00
niole
488f63efe3
[Dashboard] Make requests sent by the dashboard reverse proxy compatible (#14012) 2021-02-24 18:31:59 -08:00
architkulkarni
ef96193b8b
fix servehandle docstring for sync/async (#14312) 2021-02-24 16:41:15 -08:00
Kai Fricke
021ed92e8a
Add debug_state.txt to cluster dump (#14310) 2021-02-24 22:47:26 +01:00
dependabot[bot]
aa36a6622d
[tune](deps): Bump xgboost in /python/requirements (#14225)
Bumps [xgboost](https://github.com/dmlc/xgboost) from 1.3.0.post0 to 1.3.3.
- [Release notes](https://github.com/dmlc/xgboost/releases)
- [Changelog](https://github.com/dmlc/xgboost/blob/master/NEWS.md)
- [Commits](https://github.com/dmlc/xgboost/commits/v1.3.3)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-02-24 13:43:19 -08:00
Richard Liaw
4dd5c9e541
[tune] fix placement group timeout (#14313)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-24 13:35:13 -08:00
Richard Liaw
fd128a4533
disable object-spilling test (#14318) 2021-02-24 12:22:25 -08:00
Clark Zinzow
c867054f0c
Skip GCS fault-tolerance test on Windows. (#14311) 2021-02-24 11:44:41 -08:00
Eric Liang
4bae0c9228
[client] Allow ignoring version mismatch with env var for debugging (#14295) 2021-02-24 11:36:16 -08:00
Ameer Haj Ali
5155673404
set STATUS_UNINITIALIZED TAG launching head node (#14293)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* huh?

* set initialized status for head when launching head node

* test

* patch

* fix lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-24 18:34:05 +02:00
dependabot[bot]
94d9e0f35d
[tune](deps): Bump torchvision from 0.8.1 to 0.8.2 in /python/requirements (#14226)
* [tune](deps): Bump torchvision in /python/requirements

Bumps [torchvision](https://github.com/pytorch/vision) from 0.8.1 to 0.8.2.
- [Release notes](https://github.com/pytorch/vision/releases)
- [Commits](https://github.com/pytorch/vision/compare/v0.8.1...v0.8.2)

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements_tune.txt

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-02-24 16:36:12 +01:00
fangfengbin
482a00278b
[GCS]Fix flaky testcase: ServiceBasedGcsClientTest (#14248) 2021-02-24 20:35:30 +08:00
Tao Wang
6af0291347
[Core]Pull off timers out of heartbeat in raylet (#13963) 2021-02-24 11:59:13 +08:00
Amog Kamsetty
739f653983
[Tune] WandbLoggerCallback compatibility with Ray Client (#14280) 2021-02-23 18:31:19 -08:00
fyrestone
5e76a51d56
[Dashboard] Select port in dashboard (#13763)
* Dashboard select port; Fix dashboard may hangs when exit

* Add test case

* Fix

* Fix test_stats_collector.py::test_get_all_node_details

* Refine dashboard error messages

* Refine code

* Refine code

* Show last 10 lines of dashboard log if start dashboard failed

* Fix ValueError: too many values to unpack (expected 2) when getsockname

* Fix test_multi_node_3.py::test_calling_start_ray_head may fail

* Fix Windows CI

* Disable dashboard in C++ test

* Refine code

* Fix issue 7084

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-02-23 16:27:48 -08:00
SangBin Cho
b7c56b8a71
[Core] Improve the server startup error message. (#14267)
* Improve the error message further.

* fix comment.

* Fix comment 2.

* improve messages to be even more high level.

* Address code review.
2021-02-23 16:26:06 -08:00
DK.Pino
911b028c54
[Placement Group] Make the creation of placement group sync (#13858)
* make pg creation sync

* return successful immediately when pg registeration

* hold on

* fix ut

* make collection for callback

* make pg registration vector

* fix new cpp ut

* fix named py ut

* fix python ut bug

* fix python ut

* fix lint

* modify comment

* fix comment

* fix comment

* add new ut and fix old lint issue

* fix comment

* update comment

* fix conflict
2021-02-23 16:11:43 -08:00
Alex Wu
fe8a500e98
[Monitor] Log some diagnosis information on startup (#14287)
* .

* done?

* .

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-23 14:30:27 -08:00
Alex Wu
96fe6481ec
[autoscaler] fix summary when tags and instance creation aren't atomic (#14286)
* .

* done?

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-23 13:56:11 -08:00
Clark Zinzow
d344e77109
Revert "Revert "Inline small objects in GetObjectStatus response. (#13309)" (#13615)" (#13618)
This reverts commit 20acc3b05e.
2021-02-23 12:06:37 -08:00
SangBin Cho
be68a78b3f
[Object Spilling] Support multiple directories for spilling. (#14240)
* Finish the initial implementation.

* Improve the doc.

* Addressed comment.

* lint.

* f
2021-02-23 11:51:57 -08:00
Richard Liaw
acd2b202b3
[tune] fix pbt test (#14281)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-23 11:48:09 -08:00
Clark Zinzow
488ba5e1fa
[Core] Start the non-primary Redis shard port range at a high random port. (#14266) 2021-02-23 11:00:31 -08:00
SangBin Cho
77fdc274eb
[LogMonitor] Add assertion on os.kill type (#14271) 2021-02-23 10:57:09 -08:00
Simon Mo
f7232f23e1
[Tune] Fix test_convergence_gaussian_process in Buildkite (#14263) 2021-02-23 10:41:40 -08:00
Simon Mo
dfd5eb4b0d
[Core] fix gcs use-after-free from ASAN (#14199) 2021-02-23 10:37:31 -08:00
Simon Mo
ed12114d70
[Buildkite] Fix java and lint failure (#14259) 2021-02-23 10:33:45 -08:00
Kai Fricke
757866ec01
[tune] enable placement groups per default (#13906)
* Refactor placement group factory object to accept placement_group arguments instead of callables

* Convert resources to pgf

* Enable placement groups per default

* Fix tests WIP

* Fix stop/resume with placement groups

* Fix progress reporter test

* Fix trial executor tests

* Check resource for trial, not resource object

* Move ENV vars into class

* Fix tests

* Sphinx

* Wait for trial start in PBT

* Revert merge errors

* Support trial reuse with placement groups

* Better check for just staged trials

* Fix trial queuing

* Wait for pg after trial termination

* Clean up PGs before tune run

* No PG settings in pbt scheduler

* Fix buffering tests

* Skip test if ray reports erroneous available resources

* Disable PG for cluster resource counting test

* Debug output for tests

* Output in-use resources for placement groups

* Don't start new trial on trial start failure

* Add docs

* Cleanup PGs once futures returned

* Fix placement group shutdown

* Use updated_queue flag

* Apply suggestions from code review

* Apply suggestions from code review

* Update docs

* Reuse placement groups independently from actors

* Do not remove placement groups for paused trials

* Only continue enqueueing trials if it didn't fail the first time

* Rename parameter

* Fix pause trial

* Code review + try_recover

* Update python/ray/tune/utils/placement_groups.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Move placement group lifecycle management

* Move total used resources to pg manager

* Update FAQ example

* Requeue trial if start was unsuccessful

* Do not cleanup pgs at start of run

* Revert "Do not cleanup pgs at start of run"

This reverts commit 933d9c4c

* Delayed PG removal

* Fix trial requeue test

* Trigger pg cleanup on status update

* Fix tests

* Fix docs

* fix-test

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-23 18:46:02 +01:00
Eric Liang
5107eabe1b
update (#14274) 2021-02-23 18:38:06 +02:00
javi-redondo
0408fe6a69
Small improvements to the Ray Cluster docs (#14241)
* Small improvements to the Ray Cluster docs

* Update quickstart.rst

Changed title for quick start

Co-authored-by: Javier Redondo <javier@Anyscale-MacBook-Pro.local>
2021-02-23 13:44:28 +02:00
ZhuSenlin
8be107196d
fix retry leasing worker (#14272) 2021-02-23 19:38:40 +08:00
Clark Zinzow
5ce9b93f47
[Core] Ownership-based Object Directory - Enabled by default (#14254) 2021-02-22 22:09:41 -08:00
Alex Wu
79653049d2
[core] Start less worker processes (#14202) 2021-02-22 22:01:38 -08:00
Farzan Taj
cf1bc66fb1
[logging] Don't try to kill autoscaler during log monitor cleanup (#14261) 2021-02-22 21:05:04 -08:00
ZhuSenlin
8e0b2d07f4
[Core] synchronize job config to worker when it registers to raylet (#13402) 2021-02-23 11:48:54 +08:00
Kai Fricke
fcd0dee581
[cli] Add ray cluster-dump CLI command to fetch logs (#14212)
* Add `ray get-logs` CLI command to fetch logs and state from nodes in a cluster

* Add dataclasses for py < 3.7

* Remove dataclasses dependency in setup.py

* Rename command, print what is collected

* Remove dataclass dependency

* Typo

* Lint

* Apply suggestions fom code review
2021-02-22 19:38:33 -08:00
Ian Rodney
54ab6d2801
set additional_properties to false (#14244) 2021-02-23 00:58:31 +02:00
dependabot[bot]
49c901e33d
[tune](deps): Bump wandb from 0.10.12 to 0.10.19 in /python/requirements (#14224)
Bumps [wandb](https://github.com/wandb/client) from 0.10.12 to 0.10.19.
- [Release notes](https://github.com/wandb/client/releases)
- [Changelog](https://github.com/wandb/client/blob/master/CHANGELOG.md)
- [Commits](https://github.com/wandb/client/compare/v0.10.12...v0.10.19)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-02-22 14:54:09 -08:00
Simon Mo
f6a8a9be59
[Serve] Add RLlib tutorial (#14194) 2021-02-22 13:23:12 -08:00