Antoni Baum
3106fc5365
[tune] Depreciate max_concurrent
in TuneBOHB
( #18770 )
...
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-09-21 19:17:19 +01:00
architkulkarni
aa6625e62a
[Serve] gate __del__ call behind hasattr check ( #18773 )
2021-09-21 10:48:40 -07:00
Antoni Baum
f4666f3a6d
[tune] Add on_trial_result to ConcurrencyLimiter ( #18766 )
2021-09-21 15:30:02 +01:00
Antoni Baum
ca3fabc4cb
[tune] Ensure arguments passed to tune remote_run
match ( #18733 )
2021-09-21 15:29:29 +01:00
Yi Cheng
fc6a739e4b
[nightly] Deflaky nightly test many_nodes_actor_test ( #18582 )
2021-09-20 22:43:48 -07:00
Clark Zinzow
0704b825ff
[Datasets] Add spread resource prefix for manual round-robin resource-based task load balancing. ( #18776 )
2021-09-20 22:41:11 -07:00
Eric Liang
361a13602c
Actor repr for log prefix should be computed after init, not before ( #18749 )
2021-09-20 21:34:53 -07:00
DK.Pino
d329101469
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." ( #18735 )
...
* fix conflict
* cxx lint
2021-09-20 20:18:12 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring ( #18009 )" ( #18719 )" ( #18722 )
2021-09-20 19:17:00 -07:00
Ameer Haj Ali
9efbd80733
[core] avoid scheduling on gpu nodes by default ( #18743 )
...
* [core] avoid scheduling on gpu nodes by default
* Fix cluster_task_manager_test tests.
Made most tests in cluster_task_manager_test not use GPU on the head
node.
Also added another test to scheduling_policy_test.
Co-authored-by: Sasha Sobol <sasha@asobol.com>
2021-09-20 17:38:40 -07:00
Sasha Sobol
65c1c8bb9e
Add an integration test for scheduler_avoid_gpu_nodes ( #18763 )
2021-09-20 17:20:42 -07:00
Jiao
9bb4a87031
[runtime_env] Add experimental job yaml ( #18768 )
2021-09-20 18:00:25 -05:00
Stephanie Wang
eafe6d5c79
Fix ref counting assertion check ( #18752 )
...
* Fix assertion crash
* test, lint
* todo
* x
2021-09-20 15:16:19 -07:00
Kai Fricke
cee18152f1
[tune] Remove deprecated features, promote warnings to errors ( #18595 )
2021-09-20 22:54:28 +01:00
gjoliver
5b6d69d61a
Minor change to switch result checking order so there is no artificial delay. ( #18555 )
...
Co-authored-by: Jun Gong <jungong@mbpro.local>
2021-09-20 22:49:17 +01:00
Simon Mo
29f89d8af7
[Serve] Doc: Mock ray.serve.generated package for doc building ( #18767 )
2021-09-20 14:33:33 -07:00
Kai Fricke
2e99fb215f
[tune] Cache unstaged placement groups for potential re-use ( #18706 )
2021-09-20 20:23:35 +01:00
Sven Mika
e6aae61487
[RLlib; testing] Fix bug in stress tests not handling >1 trials per experiment (due to grid-search in IMPALA stress tests). ( #18705 )
2021-09-20 15:31:57 +02:00
Ian Rodney
8d6ddcee53
[GCP] Add conda
to the path when possible. ( #18653 )
2021-09-19 23:06:48 -07:00
Eric Liang
85aaca8d45
Update the contribution guide / style guide ( #18753 )
2021-09-19 20:14:51 -07:00
Chen Shen
b321abc560
[Core] fix another thread safety issue in instrumented_io_context
2021-09-19 17:44:31 -07:00
Eric Liang
2fa9648ef0
Revert "add integration test for gpu scheduling/avoidance ( #18729 )" ( #18754 )
...
This reverts commit 57edc0c607
.
2021-09-19 17:05:05 -07:00
Dmitri Gekhtman
ffe533b297
[autoscaler] Log ips and ids when terminating nodes, code structure ( #18180 )
...
* recovery failure uses same termination function
* More cleanup
* More cleanup
* ips
* wip
* wip
* wip
* Fix tests
* tweak
2021-09-19 18:44:38 -04:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor ( #18746 )
2021-09-19 12:01:31 -07:00
xwjiang2010
5551cdac19
[Tune] Break from loop after warning msg is logged. ( #18720 )
2021-09-18 16:33:44 -07:00
mwtian
32f71765e9
[Client] Allow Client{Object,Actor}Ref to accept a future. ( #18677 )
...
* Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously.
* rename callback, fix
2021-09-18 16:32:02 -07:00
Eric Liang
d6ff390858
Task failure should not log error ( #18742 )
2021-09-18 13:26:32 -07:00
Sasha Sobol
57edc0c607
add integration test for gpu scheduling/avoidance ( #18729 )
2021-09-18 01:32:18 -07:00
qicosmos
64c25987f3
[C++ Worker]Simple kv store example ( #18613 )
2021-09-18 16:02:44 +08:00
Chen Shen
eab1d28fd3
fix test ( #18737 )
2021-09-18 00:57:34 -07:00
Qing Wang
6f1d3f94db
Publish actor state PENDING_CREATION for dashboard showing. ( #18666 )
2021-09-18 15:44:58 +08:00
Jiao
948508efb8
[Serve] Add checkpoint options and custom storage option ( #18657 )
2021-09-18 00:04:29 -07:00
DK.Pino
4ef8fd6942
remove the legacy retry mechanism ( #18589 )
2021-09-18 11:11:19 +08:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI ( #18622 )
...
* Add Bazel config for building with llvm. Upgrade C++ std to 17.
* Fix redis. Try fixing asan and tsan
* Fix asan and format
* Update comments.
Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Amog Kamsetty
0211101e6f
[SGD] Redo Class API ( #18728 )
...
* wip
* wip
* add horovod example
* add example
* lint
* fix
* address comments
* updates
* lint
* update example
* address comment
* address comment
* update
* fix
* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* address comments
* add back name mangling
* fix tests
* Update python/ray/util/sgd/v2/trainer.py
* fix
* lint
* fix
* fix docstring
* Update python/ray/util/sgd/v2/tests/test_trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* update
* fix failing test
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-17 18:17:51 -07:00
Clark Zinzow
aaa097c293
[Datasets] Shuffled data loading support ( #18678 )
2021-09-17 16:08:53 -07:00
Simon Mo
f2ea6c4e68
[Serve] Call Callable.__del__ explicit during graceful shutdown ( #18446 )
2021-09-17 15:19:57 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures ( #18329 )
...
* wip
* client tests working again
* extra prints
* start reconnect logic for proxier
* local proxy more wip
* delay cleanup logic working on proxy
* Fix up dataservicer logic
* lint + fix proxy data servicer exit logic
* hmmm
* delay cleanup always in dataservicer
* fix last_seen check
* cancel channel on error
* explicitly request cleanup
* cleanup request fixes
* fix dataclient proxy
* start idempotence logic
* change default channel state
* add backoff logic
* move connection logic back into worker.__init__
* add logic for replay cache case where request was received but response hasn't been fully resolved
* new proto entries for data stream caching
* start replay_cache logic, increase cleanup delay
* hardcode retries
* Let data channel attempt reconnects
* manually reset queue, remove replay_cache logic
* reduce cleanup delay to 5 minutes
* fix local tests
* Remove async cache logic
* retry async requests
* simplify backoff logic
* Fix ray client proto
* Configurable reconnect grace period
* Basic logsclient fix?
* Configure grace through environment variable
* Use stopped event to force faster datapath cleanup
* Better connect+reconnect logic
* fix reconnect_grace_period default
* init fixes for reconnect_grace_period
* cleanup
* fix _get_client_id_from_context call
* add logic for pathological cache cases
* less intrusive data channel error message
* fix tests
* Make stuff less painful to read
* add ordered replay cache for dataservicer, replay cache tests
* fix ordering import, start_reconnect test
* add middleman testing logic
* enforce ordering of dataclient requests
* retry wheels
* grace period through env only, restore test_dataclient_disconnect
* minor fixes
* force rerun
* less intrusive error msgs
* address review
* replay->response cache
* remove unneeded sleep
* typing
* extra response cache test
* fix error msg
* remove TODO
* add _reconnect_channel
* add grace period test
* store thread_id and req_id in metadata
* Revert "store thread_id and req_id in metadata"
This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.
* Revert "Revert "store thread_id and req_id in metadata""
This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.
* fix metadata check
* remove comment
* removed unused cv
* cast back to int
* refactor Datapath for readability
* Revert refactor
This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.
* fix comment
* merge fixes
* refactor _shutdown
* address reviews
* log errors in both cases
* add comments
* address reviews
* move reconnect test to medium
* Always propogate error to callbacks
* readability
* formatting
* Faster cleanup on uncaught dataservicer errors
* delete tmp file
* offset commit
* rrefactor
* propagate data servicer error message
* Stricter handling/propagation of errors
* remove tmp file
* better docs
* forward reconnecting metadata
* add annotation
* fix invalidate + add test
* fix docstrings and types
* disable retries and caching if reconnect grace period is set to 0
* update comments
* address review, increase ack batch size and skip ack's if reconnect isn't enabled
* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Jiajun Yao
ffe7108eae
Fix cpp api doc ( #18671 )
2021-09-17 14:01:23 -07:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring ( #18009 )" ( #18719 )
...
This reverts commit 8ce01ea2cc
.
2021-09-17 13:34:12 -07:00
xwjiang2010
09e760a1fd
[Release] Change all cpus_per_actor in xgboost test. ( #18717 )
2021-09-17 12:57:21 -07:00
xwjiang2010
2c92f737f9
Fix dask_xgboost_test ( #18713 )
2021-09-17 11:25:54 -07:00
xwjiang2010
9c8c6c09cb
Revert "[SGD] v2 Class API ( #18571 )" ( #18715 )
...
This reverts commit de050e8187
.
2021-09-17 10:34:36 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring ( #18009 )
2021-09-17 10:03:35 -07:00
Clark Zinzow
1da83c828c
[Datasets] Properly support fs inference on path with space. ( #18644 )
2021-09-17 10:02:43 -07:00
Sven Mika
fd13bac9b3
[RLlib] Add worker
arg (optional) to policy_mapping_fn
. ( #18184 )
2021-09-17 12:07:11 +02:00
Guyang Song
89ce8a3a02
support 'CustomFields' tooltip in dashboard ( #18698 )
2021-09-17 17:48:32 +08:00
architkulkarni
a9cce8a34b
[serve] Add basic calculate_desired_num_replicas function for autoscaling ( #18658 )
2021-09-17 00:18:51 -07:00
Qing Wang
11291029b1
Add Codeowners for Java API. ( #18663 )
2021-09-17 14:48:55 +08:00
Simon Mo
3029812b8b
[Serve] Autoscaling metric store take 2 ( #18683 )
2021-09-16 22:28:13 -07:00