Commit graph

3968 commits

Author SHA1 Message Date
dependabot[bot]
94d9e0f35d
[tune](deps): Bump torchvision from 0.8.1 to 0.8.2 in /python/requirements (#14226)
* [tune](deps): Bump torchvision in /python/requirements

Bumps [torchvision](https://github.com/pytorch/vision) from 0.8.1 to 0.8.2.
- [Release notes](https://github.com/pytorch/vision/releases)
- [Commits](https://github.com/pytorch/vision/compare/v0.8.1...v0.8.2)

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements_tune.txt

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-02-24 16:36:12 +01:00
Amog Kamsetty
739f653983
[Tune] WandbLoggerCallback compatibility with Ray Client (#14280) 2021-02-23 18:31:19 -08:00
fyrestone
5e76a51d56
[Dashboard] Select port in dashboard (#13763)
* Dashboard select port; Fix dashboard may hangs when exit

* Add test case

* Fix

* Fix test_stats_collector.py::test_get_all_node_details

* Refine dashboard error messages

* Refine code

* Refine code

* Show last 10 lines of dashboard log if start dashboard failed

* Fix ValueError: too many values to unpack (expected 2) when getsockname

* Fix test_multi_node_3.py::test_calling_start_ray_head may fail

* Fix Windows CI

* Disable dashboard in C++ test

* Refine code

* Fix issue 7084

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-02-23 16:27:48 -08:00
DK.Pino
911b028c54
[Placement Group] Make the creation of placement group sync (#13858)
* make pg creation sync

* return successful immediately when pg registeration

* hold on

* fix ut

* make collection for callback

* make pg registration vector

* fix new cpp ut

* fix named py ut

* fix python ut bug

* fix python ut

* fix lint

* modify comment

* fix comment

* fix comment

* add new ut and fix old lint issue

* fix comment

* update comment

* fix conflict
2021-02-23 16:11:43 -08:00
Alex Wu
fe8a500e98
[Monitor] Log some diagnosis information on startup (#14287)
* .

* done?

* .

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-23 14:30:27 -08:00
Alex Wu
96fe6481ec
[autoscaler] fix summary when tags and instance creation aren't atomic (#14286)
* .

* done?

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-23 13:56:11 -08:00
Clark Zinzow
d344e77109
Revert "Revert "Inline small objects in GetObjectStatus response. (#13309)" (#13615)" (#13618)
This reverts commit 20acc3b05e.
2021-02-23 12:06:37 -08:00
SangBin Cho
be68a78b3f
[Object Spilling] Support multiple directories for spilling. (#14240)
* Finish the initial implementation.

* Improve the doc.

* Addressed comment.

* lint.

* f
2021-02-23 11:51:57 -08:00
Richard Liaw
acd2b202b3
[tune] fix pbt test (#14281)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-23 11:48:09 -08:00
Clark Zinzow
488ba5e1fa
[Core] Start the non-primary Redis shard port range at a high random port. (#14266) 2021-02-23 11:00:31 -08:00
SangBin Cho
77fdc274eb
[LogMonitor] Add assertion on os.kill type (#14271) 2021-02-23 10:57:09 -08:00
Simon Mo
f7232f23e1
[Tune] Fix test_convergence_gaussian_process in Buildkite (#14263) 2021-02-23 10:41:40 -08:00
Kai Fricke
757866ec01
[tune] enable placement groups per default (#13906)
* Refactor placement group factory object to accept placement_group arguments instead of callables

* Convert resources to pgf

* Enable placement groups per default

* Fix tests WIP

* Fix stop/resume with placement groups

* Fix progress reporter test

* Fix trial executor tests

* Check resource for trial, not resource object

* Move ENV vars into class

* Fix tests

* Sphinx

* Wait for trial start in PBT

* Revert merge errors

* Support trial reuse with placement groups

* Better check for just staged trials

* Fix trial queuing

* Wait for pg after trial termination

* Clean up PGs before tune run

* No PG settings in pbt scheduler

* Fix buffering tests

* Skip test if ray reports erroneous available resources

* Disable PG for cluster resource counting test

* Debug output for tests

* Output in-use resources for placement groups

* Don't start new trial on trial start failure

* Add docs

* Cleanup PGs once futures returned

* Fix placement group shutdown

* Use updated_queue flag

* Apply suggestions from code review

* Apply suggestions from code review

* Update docs

* Reuse placement groups independently from actors

* Do not remove placement groups for paused trials

* Only continue enqueueing trials if it didn't fail the first time

* Rename parameter

* Fix pause trial

* Code review + try_recover

* Update python/ray/tune/utils/placement_groups.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Move placement group lifecycle management

* Move total used resources to pg manager

* Update FAQ example

* Requeue trial if start was unsuccessful

* Do not cleanup pgs at start of run

* Revert "Do not cleanup pgs at start of run"

This reverts commit 933d9c4c

* Delayed PG removal

* Fix trial requeue test

* Trigger pg cleanup on status update

* Fix tests

* Fix docs

* fix-test

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-23 18:46:02 +01:00
Eric Liang
5107eabe1b
update (#14274) 2021-02-23 18:38:06 +02:00
Farzan Taj
cf1bc66fb1
[logging] Don't try to kill autoscaler during log monitor cleanup (#14261) 2021-02-22 21:05:04 -08:00
ZhuSenlin
8e0b2d07f4
[Core] synchronize job config to worker when it registers to raylet (#13402) 2021-02-23 11:48:54 +08:00
Kai Fricke
fcd0dee581
[cli] Add ray cluster-dump CLI command to fetch logs (#14212)
* Add `ray get-logs` CLI command to fetch logs and state from nodes in a cluster

* Add dataclasses for py < 3.7

* Remove dataclasses dependency in setup.py

* Rename command, print what is collected

* Remove dataclass dependency

* Typo

* Lint

* Apply suggestions fom code review
2021-02-22 19:38:33 -08:00
Ian Rodney
54ab6d2801
set additional_properties to false (#14244) 2021-02-23 00:58:31 +02:00
dependabot[bot]
49c901e33d
[tune](deps): Bump wandb from 0.10.12 to 0.10.19 in /python/requirements (#14224)
Bumps [wandb](https://github.com/wandb/client) from 0.10.12 to 0.10.19.
- [Release notes](https://github.com/wandb/client/releases)
- [Changelog](https://github.com/wandb/client/blob/master/CHANGELOG.md)
- [Commits](https://github.com/wandb/client/compare/v0.10.12...v0.10.19)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-02-22 14:54:09 -08:00
Simon Mo
f6a8a9be59
[Serve] Add RLlib tutorial (#14194) 2021-02-22 13:23:12 -08:00
Ameer Haj Ali
da6dbb0bfc
node_ip changes while sorting the node ips based on last used (#14234) 2021-02-22 12:30:24 -08:00
Clark Zinzow
708fb6b061
[Core - Autoscaler] Upon autoscaler failure, propagate error message to all current and future drivers (#14219) 2021-02-22 12:25:42 -08:00
Antoni Baum
ffbba8e699
[Tune] Batch suggestions for HEBO (#14246)
* Batch suggestions for HEBO

* Better documentation
2021-02-22 14:24:37 +01:00
Sven Mika
3d20d58c90
[RLlib] Tune trial + checkpoint selection example. (#14209) 2021-02-22 12:52:37 +01:00
SangBin Cho
de8d9d3e44
[Test] Skip test_load_balancing_under_constrained_memory on Windows (#14242)
* Skip the window test.

* Remove unrelated changes.

* Remove unrelated changes.
2021-02-21 23:32:48 -08:00
Kai Yang
e75b143faf
[Core] Some small fixes and improvements (#14210) 2021-02-22 12:02:30 +08:00
Dmitri Gekhtman
090970bdf5
[autoscaler] Max worker default infinity (#14201)
* random doc typo

* max-worker-default-inf

* fix

* -1 means infinity

* doc

* comment tweak

* fix random typo

* Cluster max-worker default

* fix

* typo

* test

* Git add the test

* doc-tweak

* rest of the test logistics

* periods in doc

* Address comments

* docstring
2021-02-22 05:14:00 +02:00
Richard Liaw
9eb79727aa
[tune] Support extending BOHB/Hyperband runs past max_t (#14171)
* initial-commit-to-support

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* basic-test

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* ok

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* smoke-test

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-20 14:28:59 -08:00
Kai Yang
d8c32be449
[Core] Simplify system config passing from Raylet to workers (#13860) 2021-02-20 20:20:13 +08:00
Stephanie Wang
a4d7792c0e
[core] Fix bugs in admission control again (#14222)
* Track which pull bundle requests are ready to run

* Regression test

* Reset retry timer on pull activation, don't count created objects towards memory usage, abort objects on pull deactivation

* Revert "Track which pull bundle requests are ready to run"

This reverts commit b5d0714783fa2fc842bdd4e2d2802228e25f03c2.

* Check object active before receiving chunk

* lint

* debug, unit test, fix race condition

* lint

* update

* lint

* fix

* fix build

* fix test

* remove print

* Fix bug in bytes accounting

* Split
2021-02-19 18:07:57 -08:00
SangBin Cho
5fcbf02bae
Fix. (#14218) 2021-02-19 18:06:34 -08:00
SangBin Cho
296792f963
Revert "[core] Fix bugs in admission control (#14157)" (#14217)
This reverts commit 94a819d00e.
2021-02-19 11:58:17 -08:00
Eric Liang
6a0b306221
fix stack (#14193) 2021-02-19 11:52:40 -08:00
Eric Liang
cc156f7b3c
Fix deadlock in unhandled exception handler and re-merge (#3) (#14192) 2021-02-19 11:52:09 -08:00
Amog Kamsetty
3ffe375a09
[Tune] Raise error when PBT is used with search algorithm (#14176) 2021-02-19 09:41:30 -08:00
Kai Yang
ec344b87c7
[Core] Fix grpc server is started check (#14183) 2021-02-19 16:48:28 +08:00
Stephanie Wang
94a819d00e
[core] Fix bugs in admission control (#14157)
* Track which pull bundle requests are ready to run

* Regression test

* Reset retry timer on pull activation, don't count created objects towards memory usage, abort objects on pull deactivation

* Revert "Track which pull bundle requests are ready to run"

This reverts commit b5d0714783fa2fc842bdd4e2d2802228e25f03c2.

* Check object active before receiving chunk

* lint

* debug, unit test, fix race condition

* lint

* update

* lint

* fix

* fix build

* fix test

* remove print

* Fix bug in bytes accounting
2021-02-18 20:39:00 -08:00
Kai Yang
66f6c3944d
[Java] Re-enable remaining skipped Java test cases (#13979)
Co-authored-by: loushang.ls <loushang.ls@antfin.com>
2021-02-19 10:57:28 +08:00
SangBin Cho
8b9e0d1e6c
Add tqdm to windows build. (#14197) 2021-02-18 16:01:04 -08:00
Simon Mo
3fb6b07aea
[Buildkite] Add wheels, jars, and docker builds. (#14190) 2021-02-18 14:19:28 -08:00
Kai Fricke
a3dc92ead6
[tune] fix specifying nested metrics in progress reporter (#14189) 2021-02-18 22:26:03 +01:00
Barak Michener
50ccd41cbf
fix and test the errors, limited to pickling (#14174)
Change-Id: I95c4715c0f54b1d5909aeb8eb96403db22aa0f07
2021-02-18 11:13:15 -08:00
SangBin Cho
3ad05337f7
[Shuffle] Use progress bar for experimental.shuffle (#14179)
* done.

* Add time.
2021-02-18 11:05:54 -08:00
architkulkarni
6d88036340
[ray_client]: Skip flaky test_cancel_chain on Windows (#14167)
* skip test_cancel_chain on windows

* lint

* lint
2021-02-18 10:43:15 -08:00
SangBin Cho
66f93a3d63
Revert "Fix OSX error and re-merge unhandled exceptions handling (#14138)" (#14180)
This reverts commit ee584e8328.
2021-02-18 10:35:38 -08:00
Qing Wang
b579186791
Fix reset load_code_from_local in 2nd session. (#13985)
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2021-02-18 13:52:36 +08:00
Siyuan (Ryans) Zhuang
af8c0c1add
fix numpy ufunc serialization failures (#14143) 2021-02-17 21:28:21 -08:00
dependabot[bot]
323c7da70c
[tune](deps): Bump matplotlib from 3.3.3 to 3.3.4 in /python/requirements (#14087)
Bumps [matplotlib](https://github.com/matplotlib/matplotlib) from 3.3.3 to 3.3.4.
- [Release notes](https://github.com/matplotlib/matplotlib/releases)
- [Commits](https://github.com/matplotlib/matplotlib/compare/v3.3.3...v3.3.4)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-02-17 19:31:07 -08:00
Amog Kamsetty
be7114639d
[Tune] Update Transformers Example (#14150)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-6-151.us-west-2.compute.internal>
2021-02-17 18:37:27 -08:00
EscapeReality846089495
5ce1d262a3
[tune] Fixed atomic_save w/ os.replace (#14089)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-17 15:48:39 -08:00