Commit graph

5333 commits

Author SHA1 Message Date
Guyang Song
337005d5a5
[C++ API][hotfix] fix C++ worker dynamic library loading issue on macOS (#18877)
* fix C++ worker in macox

* fix
2021-09-24 23:39:00 +08:00
Simon Mo
565131a854
[Serve] Support http_location=FixedNumber (#18731) 2021-09-23 15:59:12 -07:00
Simon Mo
5aa1e08633
[Serve] Exit run_forever when actor shutdown (#18820) 2021-09-23 15:17:31 -07:00
Yi Cheng
b5ccee6ad3
Skip failed actor test (#18815) 2021-09-23 11:02:02 -07:00
Kai Fricke
2d46e0e14b
[tune] Fix Analysis.dataframe() documentation and enable passing of mode=None (#18850) 2021-09-23 18:27:54 +01:00
Stephanie Wang
7b1e594412
[core] Fix bug in ref counting protocol for nested objects (#18821)
* Fix assertion crash

* test, lint

* todo

* tests

* protocol

* test

* fix

* lint

* header

* recursive

* note

* forward test

* lock

* lint

* unneeded check
2021-09-23 09:45:12 -07:00
Alex Wu
5d57eed598
[Workflow] Serialization cleanup (#18328)
* notes

* notes

* .

* seems to work?

* .

* seems to work

* needs tests

* needs tests

* parallelize uploads

* fixed

* fixed

* .

* dumb test

* .

* .

* fix festsg

* .

* works

* .:

* .

* .

* .

* Update common.py

* .

* almost removed special case for inputs

* lint

* lint

* .

* handle edge case

* .

* .

* lint

* needs dedupe

* needs dedupe

* still need to not leak cache

* still need to not leak cache

* probably fails edge cases?

* probably fails edge cases?

* works?

* cleanup

* passes test?

* ???

* done?

* may work?

* may work?

* .

* .

* Revert "."

This reverts commit 6aee40630637783d1756e226861b518668112337.

* Revert "."

This reverts commit 040a0e59e731d1f4e3b85ca2153474fc97963ae8.

* Revert "may work?"

This reverts commit fc26b54627c3c72dfdbaf0e79ba89d7503db4a94.

* Revert "may work?"

This reverts commit 85f48bb11a5c1764ef2cf3701ec41eb948fc7fc1.

* Revert "done?"

This reverts commit 573f4e0cb98417494b30c7a36987391d9bb8d064.

* passs tests

* lint

* cleanup

* bug fix

* bug fix

* print

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-09-23 09:18:59 -07:00
Carl Assmann
882f7d3863
[tune] OptunaSearch: check compatibility of search space with evaluated_rewards (#18625)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-09-23 16:00:11 +01:00
Antoni Baum
361cae4d1c
[tune] Add save and restore methods for searchers that were missing it & test (#18760) 2021-09-23 09:45:47 +01:00
Eric Liang
2c15215833
Implement zip() function for dataset (#18833) 2021-09-23 00:12:29 -07:00
Guyang Song
237a2ade76
[wheel][cpp] recover cpp extra (#18597) 2021-09-23 12:10:03 +08:00
Amog Kamsetty
d354161528
[SGD] Link ray.sgd namespace to ray.util.sgd.v2 (#18732)
* wip

* add symlink

* update

* remove from init

* no require tune

* try fix

* change

* * import

* fix docs

* address comment
2021-09-22 18:49:41 -07:00
mwtian
e41109a5e7
[Client] Use async rpc for remote call and actor creation (#18298)
* Use async rpc for remote calls, task and actor creations.

* fix

* check placement

* check placement group. wait for id in destructor

* fix

* fix exception in destructor

* Add test

* revert change

* Fix comment

* fix
2021-09-22 18:30:50 -07:00
Amog Kamsetty
00dd190df9
[SGD] Retry sgd.local_rank() (#18824)
* finish

* fix

* wip

* address comment

* update

* fix test

* fix failing test

* address comments

* fix test

* fix
2021-09-22 15:48:38 -07:00
gjoliver
e6511bcf56
Revert "Upgrade default bazel installation to ver 4.2.1 (#18714)" (#18825) 2021-09-22 13:54:48 -07:00
Amog Kamsetty
d9b166252b
Revert "[SGD] sgd.local_rank" (#18822) 2021-09-22 13:50:00 -07:00
Chen Shen
9b1cd5d1ad
Disable spill test on macOS (#18801) 2021-09-22 09:57:53 -07:00
Amog Kamsetty
39bcbe03bc
[SGD] sgd.local_rank (#18686)
* finish

* fix

* wip

* address comment

* update

* fix test

* fix failing test

* address comments

* fix test
2021-09-22 08:10:49 -07:00
Kai Fricke
bbb207c36e
[sgd/v1] Add API annotations (#18790)
* [sgd/v1] Add API annotations

* Remove unnecessary annotations
2021-09-22 08:10:28 -07:00
Kai Fricke
f86fc277d6
[tune/rllib] Only disable ipython in remote actors (#18789) 2021-09-22 11:05:06 +01:00
gjoliver
eb3620898c
Upgrade default bazel installation to ver 4.2.1 (#18714) 2021-09-22 00:24:41 -07:00
Eric Liang
cf0bd00cc2
Improve the error message for failed task/actor imports on workers (#18792) 2021-09-21 19:49:59 -07:00
Antoni Baum
3106fc5365
[tune] Depreciate max_concurrent in TuneBOHB (#18770)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-09-21 19:17:19 +01:00
architkulkarni
aa6625e62a
[Serve] gate __del__ call behind hasattr check (#18773) 2021-09-21 10:48:40 -07:00
Antoni Baum
f4666f3a6d
[tune] Add on_trial_result to ConcurrencyLimiter (#18766) 2021-09-21 15:30:02 +01:00
Antoni Baum
ca3fabc4cb
[tune] Ensure arguments passed to tune remote_run match (#18733) 2021-09-21 15:29:29 +01:00
Clark Zinzow
0704b825ff
[Datasets] Add spread resource prefix for manual round-robin resource-based task load balancing. (#18776) 2021-09-20 22:41:11 -07:00
Eric Liang
361a13602c
Actor repr for log prefix should be computed after init, not before (#18749) 2021-09-20 21:34:53 -07:00
DK.Pino
d329101469
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." (#18735)
* fix conflict

* cxx lint
2021-09-20 20:18:12 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722) 2021-09-20 19:17:00 -07:00
Sasha Sobol
65c1c8bb9e
Add an integration test for scheduler_avoid_gpu_nodes (#18763) 2021-09-20 17:20:42 -07:00
Jiao
9bb4a87031
[runtime_env] Add experimental job yaml (#18768) 2021-09-20 18:00:25 -05:00
Stephanie Wang
eafe6d5c79
Fix ref counting assertion check (#18752)
* Fix assertion crash

* test, lint

* todo

* x
2021-09-20 15:16:19 -07:00
Kai Fricke
cee18152f1
[tune] Remove deprecated features, promote warnings to errors (#18595) 2021-09-20 22:54:28 +01:00
Kai Fricke
2e99fb215f
[tune] Cache unstaged placement groups for potential re-use (#18706) 2021-09-20 20:23:35 +01:00
Ian Rodney
8d6ddcee53
[GCP] Add conda to the path when possible. (#18653) 2021-09-19 23:06:48 -07:00
Eric Liang
2fa9648ef0
Revert "add integration test for gpu scheduling/avoidance (#18729)" (#18754)
This reverts commit 57edc0c607.
2021-09-19 17:05:05 -07:00
Dmitri Gekhtman
ffe533b297
[autoscaler] Log ips and ids when terminating nodes, code structure (#18180)
* recovery failure uses same termination function

* More cleanup

* More cleanup

* ips

* wip

* wip

* wip

* Fix tests

* tweak
2021-09-19 18:44:38 -04:00
xwjiang2010
5551cdac19
[Tune] Break from loop after warning msg is logged. (#18720) 2021-09-18 16:33:44 -07:00
mwtian
32f71765e9
[Client] Allow Client{Object,Actor}Ref to accept a future. (#18677)
* Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously.

* rename callback, fix
2021-09-18 16:32:02 -07:00
Sasha Sobol
57edc0c607
add integration test for gpu scheduling/avoidance (#18729) 2021-09-18 01:32:18 -07:00
Chen Shen
eab1d28fd3
fix test (#18737) 2021-09-18 00:57:34 -07:00
Jiao
948508efb8
[Serve] Add checkpoint options and custom storage option (#18657) 2021-09-18 00:04:29 -07:00
DK.Pino
4ef8fd6942
remove the legacy retry mechanism (#18589) 2021-09-18 11:11:19 +08:00
Amog Kamsetty
0211101e6f
[SGD] Redo Class API (#18728)
* wip

* wip

* add horovod example

* add example

* lint

* fix

* address comments

* updates

* lint

* update example

* address comment

* address comment

* update

* fix

* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* add back name mangling

* fix tests

* Update python/ray/util/sgd/v2/trainer.py

* fix

* lint

* fix

* fix docstring

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* update

* fix failing test

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-17 18:17:51 -07:00
Clark Zinzow
aaa097c293
[Datasets] Shuffled data loading support (#18678) 2021-09-17 16:08:53 -07:00
Simon Mo
f2ea6c4e68
[Serve] Call Callable.__del__ explicit during graceful shutdown (#18446) 2021-09-17 15:19:57 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures (#18329)
* wip

* client tests working again

* extra prints

* start reconnect logic for proxier

* local proxy more wip

* delay cleanup logic working on proxy

* Fix up dataservicer logic

* lint + fix proxy data servicer exit logic

* hmmm

* delay cleanup always in dataservicer

* fix last_seen check

* cancel channel on error

* explicitly request cleanup

* cleanup request fixes

* fix dataclient proxy

* start idempotence logic

* change default channel state

* add backoff logic

* move connection logic back into worker.__init__

* add logic for replay cache case where request was received but response hasn't been fully resolved

* new proto entries for data stream caching

* start replay_cache logic, increase cleanup delay

* hardcode retries

* Let data channel attempt reconnects

* manually reset queue, remove replay_cache logic

* reduce cleanup delay to 5 minutes

* fix local tests

* Remove async cache logic

* retry async requests

* simplify backoff logic

* Fix ray client proto

* Configurable reconnect grace period

* Basic logsclient fix?

* Configure grace through environment variable

* Use stopped event to force faster datapath cleanup

* Better connect+reconnect logic

* fix reconnect_grace_period default

* init fixes for reconnect_grace_period

* cleanup

* fix _get_client_id_from_context call

* add logic for pathological cache cases

* less intrusive data channel error message

* fix tests

* Make stuff less painful to read

* add ordered replay cache for dataservicer, replay cache tests

* fix ordering import, start_reconnect test

* add middleman testing logic

* enforce ordering of dataclient requests

* retry wheels

* grace period through env only, restore test_dataclient_disconnect

* minor fixes

* force rerun

* less intrusive error msgs

* address review

* replay->response cache

* remove unneeded sleep

* typing

* extra response cache test

* fix error msg

* remove TODO

* add _reconnect_channel

* add grace period test

* store thread_id and req_id in metadata

* Revert "store thread_id and req_id in metadata"

This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.

* Revert "Revert "store thread_id and req_id in metadata""

This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.

* fix metadata check

* remove comment

* removed unused cv

* cast back to int

* refactor Datapath for readability

* Revert refactor

This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.

* fix comment

* merge fixes

* refactor _shutdown

* address reviews

* log errors in both cases

* add comments

* address reviews

* move reconnect test to medium

* Always propogate error to callbacks

* readability

* formatting

* Faster cleanup on uncaught dataservicer errors

* delete tmp file

* offset commit

* rrefactor

* propagate data servicer error message

* Stricter handling/propagation of errors

* remove tmp file

* better docs

* forward reconnecting metadata

* add annotation

* fix invalidate + add test

* fix docstrings and types

* disable retries and caching if reconnect grace period is set to 0

* update comments

* address review, increase ack batch size and skip ack's if reconnect isn't enabled

* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
xwjiang2010
9c8c6c09cb
Revert "[SGD] v2 Class API (#18571)" (#18715)
This reverts commit de050e8187.
2021-09-17 10:34:36 -07:00