Commit graph

9281 commits

Author SHA1 Message Date
Sven Mika
599e589481
[RLlib] Move existing fake multi-GPU learning tests into separate buildkite job. (#18065) 2021-08-31 14:56:53 +02:00
Sven Mika
4888d7c9af
[RLlib] Replay buffers: Add config option to store contents in checkpoints. (#17999) 2021-08-31 12:21:49 +02:00
Kai Fricke
012f9eb687
[buildkite] Fix jar upload directory (#18253) 2021-08-31 11:18:34 +02:00
Simon Mo
2e0b816d64
[Buildkite] Upload jars to os specific dir (#18229) 2021-08-31 09:32:01 +02:00
SangBin Cho
eab506cc37
[Test] Disable non streaming shuffle 5000 partitions (#18224)
* Disable non streaming shuffle 5000 partitions

* increase timeout for 5000 partition shuffle
2021-08-31 00:28:15 -07:00
Chen Shen
5f3ec7634b
Fix off by one test bug (#18239) 2021-08-31 00:07:03 -07:00
Clark Zinzow
e154f87cab
Added split_at_indices to DatasetPipeline. (#18243) 2021-08-31 00:06:35 -07:00
Eric Liang
db9b5f142d
Disable worker logs temporarily during driver breakpoints (#18192) 2021-08-30 20:26:16 -07:00
Stephanie Wang
8e06db7280
Revert "[Core] revert: revert Unified worker starter (#18008)" (#18228)
This reverts commit b9978dd02b.
2021-08-30 17:28:41 -07:00
Tim Hopper
fd2a8a6b9c
[docs] Fix broken urls (#18206) 2021-08-30 17:24:06 -07:00
Yi Cheng
7a65815108
[workflow] Defer input preparation until run (#18225) 2021-08-30 16:37:34 -07:00
Antoni Baum
5be6bda4cf
[tests] Add Ludwig CI test (#18126) 2021-08-30 12:27:39 -07:00
SangBin Cho
2ee1b90c17
[Core] Batch obod location updates (#18016)
* Batch impl

* done

* Remove a client pool

* in progress

* Added unit tests.

* Handle owner failure case.

* Fix unit tests

* Addressed code review.
2021-08-30 11:04:08 -07:00
SangBin Cho
dfbad8668a
Support better infra failure detection + stable flag (#18202) 2021-08-30 10:51:03 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port (#17855)" (#18217)
This reverts commit 53ddb551d5.
2021-08-30 10:46:37 -07:00
Yi Cheng
f579822790
[workflow] Workflow inside virtual actor (#18066) 2021-08-30 10:40:22 -07:00
Alex Wu
ca86098680
Revert "[core] Refactor test_many_tasks (#18169)" (#18216)
This reverts commit eb6fd20d53.
2021-08-30 10:35:23 -07:00
Stephanie Wang
eb6fd20d53
[core] Refactor test_many_tasks (#18169)
* Improve test

test

* lint
2021-08-30 10:33:23 -07:00
Chen Shen
7631d042bb
[Test] increase timeout for object spilling test caused by EBS cold storage issue (#18200) 2021-08-30 00:28:26 -07:00
SangBin Cho
0e968c1e82
[Core] Reduce spilling threshold (#17910)
* Lower the threshold

* ip

* Handle test failure

* lint

* last fix

* .

* Retry
2021-08-30 00:09:35 -07:00
Zhi Lin
d3786ac131
Bump Java version to 2.0.0-SNAPSHOT (#15394)
* bump java version to 2.0.0-SNAPSHOT

* update
2021-08-30 12:25:30 +08:00
fyrestone
53ddb551d5
Auto discover dashboard agent port (#17855) 2021-08-30 12:06:28 +08:00
Stephanie Wang
7bc1ef0dd9
[core] Prestart workers up to available CPU limit (#18166)
* Prestart workers according to num available CPUs

* lint

* Prestart min(available CPU, backlog)

* Fix test, adjust policy

* debug

* retry

* lint
2021-08-29 14:11:53 -07:00
Yi Cheng
d5cd95364b
[workflow] Some usability issues fixing (#18133) 2021-08-28 16:51:00 -07:00
Amog Kamsetty
3b77840c1b
PyTorch Lightning Updates (#17876) 2021-08-27 23:15:51 -07:00
Richard Liaw
0621fc49d4
serve (#18178) 2021-08-27 20:29:34 -07:00
Antoni Baum
e7bbadb920
[tune] Extend Tune Callback API (#17794)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-08-27 18:05:12 -07:00
Antoni Baum
714193ce6f
[SGDv2] Tensorboard Callback (#17824)
* [SGD] save checkpoints to disk

* fix test; add logs

* Extend SGDv2 callback API

* Move json file creation to JsonLoggerCallback

* TBXLoggerCallback

* Simplify, fix linear example

* rename log_dir to logdir for consistency with tune

* Add test

* Fix

* Break up logging classes

* Fix error

* Update type hint for results

* Refactor

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-08-27 17:50:26 -07:00
Eric Liang
95b5ad12ba
Initial version of workflow documentation (#18138) 2021-08-27 16:20:48 -07:00
Jiao
c7e38ceb10
[serve] Better constructor failure handling (#16922) 2021-08-27 18:05:22 -05:00
mwtian
26679d62c5
[Core][ObjectRef] Change default to not record call stack during ObjectRef creation (#18078) 2021-08-27 15:45:34 -07:00
Clark Zinzow
c0598de82a
[Datasets] Port write APIs to use file-based datasources. (#18135) 2021-08-27 15:24:54 -07:00
Chen Shen
28e6ae5ce0
[Test] fix object spilling 2 (#18141) 2021-08-27 13:52:42 -07:00
Clark Zinzow
aee7ba2510
[Datasets] Add from_numpy() and to_numpy() APIs (#18146) 2021-08-27 13:33:11 -07:00
Yi Cheng
ed7124663a
[workflow] Fix nested workflow with catch exception bug (#18145) 2021-08-27 10:53:15 -07:00
Chen Shen
feeb20e920
[CI][rfc] Fix flaky test_multi_node:test_cleanup_on_driver_exit 2021-08-27 10:51:01 -07:00
Joseph Suarez
8136d2912b
[RLlib] Add policies arg to callback: on_episode_step (already exists in all other episode-related callbacks) (#18119) 2021-08-27 16:12:19 +02:00
Tao Wang
7620afb8be
[Deploy]Don't start shard redis in local if we specify external redis. (#17856)
* Don't start shard redis in local if we specify external redis

* lint

* reuse primary as shard

* add test

* lint

* lint

* lint
2021-08-27 16:45:09 +08:00
SangBin Cho
a25cc47399
[Core] Set keepalive only at gcs (#18086) 2021-08-27 01:26:51 -07:00
Antoni Baum
56089ae926
[tune] Add max_concurrent_trials argument to tune.run (#17905)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-08-27 09:12:50 +02:00
xwjiang2010
cc45d3a725
[tune] Update trial resources on resume. (#17975)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-08-27 09:12:18 +02:00
Eric Liang
d52ffd926e
Add task / actor name to driver log prefix (#18105)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-26 16:18:04 -07:00
Edward Oakes
f16bb9849e
[serve] Update code sample in Ray README (#18129) 2021-08-26 16:34:31 -05:00
Edward Oakes
5c4c735119
[runtime_env] Make log message when deleting runtime_env INFO instead of ERROR (#18083) 2021-08-26 15:21:59 -05:00
Edward Oakes
6fa05ed708
[runtime_env] Better error message for working_dirs that exceed the max size (#18092) 2021-08-26 15:21:12 -05:00
Edward Oakes
3dc3f6102f
[serve] Remove unused ServeRequest codepath (#18120) 2021-08-26 15:08:00 -05:00
Dmitri Gekhtman
5608a4e441
fix (#18123) 2021-08-26 14:14:09 -04:00
Chen Shen
a7365b74e6
[CI][easy] run test_nested_id both inlined and from plasma store. (#18081) 2021-08-26 10:32:06 -07:00
SangBin Cho
405418f8e8
[Object Spilling] Unpin before updating URL (#17994)
* Unpin before updating URL

* Remove unnecessary logs.

* update compiling issue

* Check the consistent local state instead of stale information from obod.

* Fix the test

* Addressed code review.
2021-08-26 10:23:53 -07:00
architkulkarni
ea4f54f8ef
[Serve] [doc] Add model URI to deployment example (#18085) 2021-08-26 11:37:32 -05:00