Clark Zinzow
|
0704b825ff
|
[Datasets] Add spread resource prefix for manual round-robin resource-based task load balancing. (#18776)
|
2021-09-20 22:41:11 -07:00 |
|
Eric Liang
|
361a13602c
|
Actor repr for log prefix should be computed after init, not before (#18749)
|
2021-09-20 21:34:53 -07:00 |
|
DK.Pino
|
d329101469
|
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." (#18735)
* fix conflict
* cxx lint
|
2021-09-20 20:18:12 -07:00 |
|
Yi Cheng
|
07babd807c
|
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722)
|
2021-09-20 19:17:00 -07:00 |
|
Sasha Sobol
|
65c1c8bb9e
|
Add an integration test for scheduler_avoid_gpu_nodes (#18763)
|
2021-09-20 17:20:42 -07:00 |
|
Jiao
|
9bb4a87031
|
[runtime_env] Add experimental job yaml (#18768)
|
2021-09-20 18:00:25 -05:00 |
|
Stephanie Wang
|
eafe6d5c79
|
Fix ref counting assertion check (#18752)
* Fix assertion crash
* test, lint
* todo
* x
|
2021-09-20 15:16:19 -07:00 |
|
Kai Fricke
|
cee18152f1
|
[tune] Remove deprecated features, promote warnings to errors (#18595)
|
2021-09-20 22:54:28 +01:00 |
|
Kai Fricke
|
2e99fb215f
|
[tune] Cache unstaged placement groups for potential re-use (#18706)
|
2021-09-20 20:23:35 +01:00 |
|
Ian Rodney
|
8d6ddcee53
|
[GCP] Add conda to the path when possible. (#18653)
|
2021-09-19 23:06:48 -07:00 |
|
Eric Liang
|
2fa9648ef0
|
Revert "add integration test for gpu scheduling/avoidance (#18729)" (#18754)
This reverts commit 57edc0c607 .
|
2021-09-19 17:05:05 -07:00 |
|
Dmitri Gekhtman
|
ffe533b297
|
[autoscaler] Log ips and ids when terminating nodes, code structure (#18180)
* recovery failure uses same termination function
* More cleanup
* More cleanup
* ips
* wip
* wip
* wip
* Fix tests
* tweak
|
2021-09-19 18:44:38 -04:00 |
|
xwjiang2010
|
5551cdac19
|
[Tune] Break from loop after warning msg is logged. (#18720)
|
2021-09-18 16:33:44 -07:00 |
|
mwtian
|
32f71765e9
|
[Client] Allow Client{Object,Actor}Ref to accept a future. (#18677)
* Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously.
* rename callback, fix
|
2021-09-18 16:32:02 -07:00 |
|
Sasha Sobol
|
57edc0c607
|
add integration test for gpu scheduling/avoidance (#18729)
|
2021-09-18 01:32:18 -07:00 |
|
Chen Shen
|
eab1d28fd3
|
fix test (#18737)
|
2021-09-18 00:57:34 -07:00 |
|
Jiao
|
948508efb8
|
[Serve] Add checkpoint options and custom storage option (#18657)
|
2021-09-18 00:04:29 -07:00 |
|
DK.Pino
|
4ef8fd6942
|
remove the legacy retry mechanism (#18589)
|
2021-09-18 11:11:19 +08:00 |
|
Amog Kamsetty
|
0211101e6f
|
[SGD] Redo Class API (#18728)
* wip
* wip
* add horovod example
* add example
* lint
* fix
* address comments
* updates
* lint
* update example
* address comment
* address comment
* update
* fix
* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* address comments
* add back name mangling
* fix tests
* Update python/ray/util/sgd/v2/trainer.py
* fix
* lint
* fix
* fix docstring
* Update python/ray/util/sgd/v2/tests/test_trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* update
* fix failing test
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
|
2021-09-17 18:17:51 -07:00 |
|
Clark Zinzow
|
aaa097c293
|
[Datasets] Shuffled data loading support (#18678)
|
2021-09-17 16:08:53 -07:00 |
|
Simon Mo
|
f2ea6c4e68
|
[Serve] Call Callable.__del__ explicit during graceful shutdown (#18446)
|
2021-09-17 15:19:57 -07:00 |
|
Chris K. W
|
8858489e2f
|
[client] let ray client reconnect on grpc failures (#18329)
* wip
* client tests working again
* extra prints
* start reconnect logic for proxier
* local proxy more wip
* delay cleanup logic working on proxy
* Fix up dataservicer logic
* lint + fix proxy data servicer exit logic
* hmmm
* delay cleanup always in dataservicer
* fix last_seen check
* cancel channel on error
* explicitly request cleanup
* cleanup request fixes
* fix dataclient proxy
* start idempotence logic
* change default channel state
* add backoff logic
* move connection logic back into worker.__init__
* add logic for replay cache case where request was received but response hasn't been fully resolved
* new proto entries for data stream caching
* start replay_cache logic, increase cleanup delay
* hardcode retries
* Let data channel attempt reconnects
* manually reset queue, remove replay_cache logic
* reduce cleanup delay to 5 minutes
* fix local tests
* Remove async cache logic
* retry async requests
* simplify backoff logic
* Fix ray client proto
* Configurable reconnect grace period
* Basic logsclient fix?
* Configure grace through environment variable
* Use stopped event to force faster datapath cleanup
* Better connect+reconnect logic
* fix reconnect_grace_period default
* init fixes for reconnect_grace_period
* cleanup
* fix _get_client_id_from_context call
* add logic for pathological cache cases
* less intrusive data channel error message
* fix tests
* Make stuff less painful to read
* add ordered replay cache for dataservicer, replay cache tests
* fix ordering import, start_reconnect test
* add middleman testing logic
* enforce ordering of dataclient requests
* retry wheels
* grace period through env only, restore test_dataclient_disconnect
* minor fixes
* force rerun
* less intrusive error msgs
* address review
* replay->response cache
* remove unneeded sleep
* typing
* extra response cache test
* fix error msg
* remove TODO
* add _reconnect_channel
* add grace period test
* store thread_id and req_id in metadata
* Revert "store thread_id and req_id in metadata"
This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.
* Revert "Revert "store thread_id and req_id in metadata""
This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.
* fix metadata check
* remove comment
* removed unused cv
* cast back to int
* refactor Datapath for readability
* Revert refactor
This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.
* fix comment
* merge fixes
* refactor _shutdown
* address reviews
* log errors in both cases
* add comments
* address reviews
* move reconnect test to medium
* Always propogate error to callbacks
* readability
* formatting
* Faster cleanup on uncaught dataservicer errors
* delete tmp file
* offset commit
* rrefactor
* propagate data servicer error message
* Stricter handling/propagation of errors
* remove tmp file
* better docs
* forward reconnecting metadata
* add annotation
* fix invalidate + add test
* fix docstrings and types
* disable retries and caching if reconnect grace period is set to 0
* update comments
* address review, increase ack batch size and skip ack's if reconnect isn't enabled
* Don't terminate data stream on missing reconnecting metadata
|
2021-09-18 01:11:00 +03:00 |
|
Yi Cheng
|
cf64ab5b90
|
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc .
|
2021-09-17 13:34:12 -07:00 |
|
xwjiang2010
|
9c8c6c09cb
|
Revert "[SGD] v2 Class API (#18571)" (#18715)
This reverts commit de050e8187 .
|
2021-09-17 10:34:36 -07:00 |
|
Yi Cheng
|
8ce01ea2cc
|
[core] Async submitting actor registerring (#18009)
|
2021-09-17 10:03:35 -07:00 |
|
Clark Zinzow
|
1da83c828c
|
[Datasets] Properly support fs inference on path with space. (#18644)
|
2021-09-17 10:02:43 -07:00 |
|
architkulkarni
|
a9cce8a34b
|
[serve] Add basic calculate_desired_num_replicas function for autoscaling (#18658)
|
2021-09-17 00:18:51 -07:00 |
|
Simon Mo
|
3029812b8b
|
[Serve] Autoscaling metric store take 2 (#18683)
|
2021-09-16 22:28:13 -07:00 |
|
Eric Liang
|
c9ca980c83
|
Check dataset pipeline is not read multiple times by accident (#18682)
|
2021-09-16 20:33:24 -07:00 |
|
Amog Kamsetty
|
84e958f330
|
[ML] Consolidate and upgrade Deep Learning Dependencies (#18574)
* wip
'
* upgrade requirements
* add file
* fix
* fixes
* Apply suggestions from code review
Try mlagents==0.21.0 for now (works with torch 1.9).
* Apply suggestions from code review
* wip
* wip
* fix
* fix
* upgrade lightning bolts
* address comment
Co-authored-by: Sven Mika <sven@anyscale.io>
|
2021-09-16 20:16:40 -07:00 |
|
Amog Kamsetty
|
de050e8187
|
[SGD] v2 Class API (#18571)
* wip
* wip
* add horovod example
* add example
* lint
* fix
* address comments
* updates
* lint
* update example
* address comment
* address comment
* update
* fix
* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* address comments
* add back name mangling
* fix tests
* Update python/ray/util/sgd/v2/trainer.py
* fix
* lint
* fix
* fix docstring
* Update python/ray/util/sgd/v2/tests/test_trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* update
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
|
2021-09-16 12:33:38 -07:00 |
|
Simon Mo
|
eeaae5aa08
|
Revert "[Serve] Add InMemoryMetricsStore for Autoscaling (#18458)" (#18675)
This reverts commit a024effac7 .
|
2021-09-16 11:37:31 -07:00 |
|
Simon Mo
|
a024effac7
|
[Serve] Add InMemoryMetricsStore for Autoscaling (#18458)
|
2021-09-16 11:08:42 -07:00 |
|
Simon Mo
|
317a34c523
|
[Serve] Use BackendConfig Protobuf (#17835)
|
2021-09-16 11:08:23 -07:00 |
|
Edward Oakes
|
e7ea1f9a82
|
[runtime_env] Remove global logger from working_dir code (#18605)
|
2021-09-16 10:37:45 -05:00 |
|
Jernej Makovsek
|
b5c5247ad4
|
Update example yaml file for running local clusters (#18530)
|
2021-09-16 02:24:45 -07:00 |
|
xwjiang2010
|
ea48b1227f
|
[Tune] Do not crash when resources are insufficient. (#18611)
|
2021-09-15 23:00:53 -07:00 |
|
Stephanie Wang
|
be7cb70c30
|
[core] Fix ref counting during actor construction (#18646)
* test
* fix
* cpp
* skip windows
Co-authored-by: Eric Liang <ekhliang@gmail.com>
|
2021-09-15 22:16:53 -07:00 |
|
Chris K. W
|
7df3441ae9
|
[client] Fix credential generation when secure=True but no credentials provided (#18636)
* set self._credentials if not provided
* fix credential generation
|
2021-09-16 00:37:33 +03:00 |
|
Antoni Baum
|
7e95f330d5
|
[ci] Fix xgboost_ray install from git (#18640)
|
2021-09-15 18:07:15 +01:00 |
|
Antoni Baum
|
d50ff16ccf
|
[ci] Fix HEBO breaking Tune tests (#18629)
|
2021-09-15 10:01:29 -07:00 |
|
Kai Fricke
|
0223ae9605
|
[xgboost] Bump xgboost_ray requirements_upstream.txt version to 0.1.3 (#18632)
|
2021-09-15 18:01:15 +01:00 |
|
Edward Oakes
|
7736cdd91d
|
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214)
|
2021-09-15 11:17:15 -05:00 |
|
Edward Oakes
|
7d0a2b39e3
|
[runtime_env] Remove dynamically imported setup_hook (#18601)
|
2021-09-15 10:19:55 -05:00 |
|
Antoni Baum
|
eeb67a42cc
|
pip install xgboost_ray -> xgboost_ray[default] (#18607)
Co-authored-by: Kai Fricke <kai@anyscale.com>
|
2021-09-15 14:45:56 +01:00 |
|
Sven Mika
|
8a00154038
|
[RLlib] Bump tf version in ML docker to tf==2.5.0; add tfp to ML-docker. (#18544)
|
2021-09-15 08:46:37 +02:00 |
|
SangBin Cho
|
0684531e22
|
[Test] Break down placement group tests (#18612)
|
2021-09-14 21:55:18 -07:00 |
|
Chris K. W
|
cc1d7b8174
|
[client] Refactors for Reconnect PR (#18484)
* add refactors
* add worker annotation
* Regenerate credentials by default
* use self._secure
* infer secure if credentials provided
* separate _shutdown
|
2021-09-14 16:13:35 -07:00 |
|
Eric Liang
|
15512c27c2
|
Revert "Revert "Route core worker ERROR/FATAL logs to driver logs (#1… (#18604)
|
2021-09-14 13:32:07 -07:00 |
|
SangBin Cho
|
31e1638fb3
|
[CLI] Improve ray status for placement groups (#18289)
|
2021-09-14 11:29:13 -07:00 |
|