Commit graph

9522 commits

Author SHA1 Message Date
Jiajun Yao
db7f4d7f30 1.7.0rc0 2021-09-17 15:20:23 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures (#18329)
* wip

* client tests working again

* extra prints

* start reconnect logic for proxier

* local proxy more wip

* delay cleanup logic working on proxy

* Fix up dataservicer logic

* lint + fix proxy data servicer exit logic

* hmmm

* delay cleanup always in dataservicer

* fix last_seen check

* cancel channel on error

* explicitly request cleanup

* cleanup request fixes

* fix dataclient proxy

* start idempotence logic

* change default channel state

* add backoff logic

* move connection logic back into worker.__init__

* add logic for replay cache case where request was received but response hasn't been fully resolved

* new proto entries for data stream caching

* start replay_cache logic, increase cleanup delay

* hardcode retries

* Let data channel attempt reconnects

* manually reset queue, remove replay_cache logic

* reduce cleanup delay to 5 minutes

* fix local tests

* Remove async cache logic

* retry async requests

* simplify backoff logic

* Fix ray client proto

* Configurable reconnect grace period

* Basic logsclient fix?

* Configure grace through environment variable

* Use stopped event to force faster datapath cleanup

* Better connect+reconnect logic

* fix reconnect_grace_period default

* init fixes for reconnect_grace_period

* cleanup

* fix _get_client_id_from_context call

* add logic for pathological cache cases

* less intrusive data channel error message

* fix tests

* Make stuff less painful to read

* add ordered replay cache for dataservicer, replay cache tests

* fix ordering import, start_reconnect test

* add middleman testing logic

* enforce ordering of dataclient requests

* retry wheels

* grace period through env only, restore test_dataclient_disconnect

* minor fixes

* force rerun

* less intrusive error msgs

* address review

* replay->response cache

* remove unneeded sleep

* typing

* extra response cache test

* fix error msg

* remove TODO

* add _reconnect_channel

* add grace period test

* store thread_id and req_id in metadata

* Revert "store thread_id and req_id in metadata"

This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.

* Revert "Revert "store thread_id and req_id in metadata""

This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.

* fix metadata check

* remove comment

* removed unused cv

* cast back to int

* refactor Datapath for readability

* Revert refactor

This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.

* fix comment

* merge fixes

* refactor _shutdown

* address reviews

* log errors in both cases

* add comments

* address reviews

* move reconnect test to medium

* Always propogate error to callbacks

* readability

* formatting

* Faster cleanup on uncaught dataservicer errors

* delete tmp file

* offset commit

* rrefactor

* propagate data servicer error message

* Stricter handling/propagation of errors

* remove tmp file

* better docs

* forward reconnecting metadata

* add annotation

* fix invalidate + add test

* fix docstrings and types

* disable retries and caching if reconnect grace period is set to 0

* update comments

* address review, increase ack batch size and skip ack's if reconnect isn't enabled

* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Jiajun Yao
ffe7108eae
Fix cpp api doc (#18671) 2021-09-17 14:01:23 -07:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
xwjiang2010
09e760a1fd
[Release] Change all cpus_per_actor in xgboost test. (#18717) 2021-09-17 12:57:21 -07:00
xwjiang2010
2c92f737f9
Fix dask_xgboost_test (#18713) 2021-09-17 11:25:54 -07:00
xwjiang2010
9c8c6c09cb
Revert "[SGD] v2 Class API (#18571)" (#18715)
This reverts commit de050e8187.
2021-09-17 10:34:36 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring (#18009) 2021-09-17 10:03:35 -07:00
Clark Zinzow
1da83c828c
[Datasets] Properly support fs inference on path with space. (#18644) 2021-09-17 10:02:43 -07:00
Sven Mika
fd13bac9b3
[RLlib] Add worker arg (optional) to policy_mapping_fn. (#18184) 2021-09-17 12:07:11 +02:00
Guyang Song
89ce8a3a02
support 'CustomFields' tooltip in dashboard (#18698) 2021-09-17 17:48:32 +08:00
architkulkarni
a9cce8a34b
[serve] Add basic calculate_desired_num_replicas function for autoscaling (#18658) 2021-09-17 00:18:51 -07:00
Qing Wang
11291029b1
Add Codeowners for Java API. (#18663) 2021-09-17 14:48:55 +08:00
Simon Mo
3029812b8b
[Serve] Autoscaling metric store take 2 (#18683) 2021-09-16 22:28:13 -07:00
qicosmos
4af3d86d8a
Remove abi flag (#18538) 2021-09-17 12:13:56 +08:00
Eric Liang
c9ca980c83
Check dataset pipeline is not read multiple times by accident (#18682) 2021-09-16 20:33:24 -07:00
Amog Kamsetty
84e958f330
[ML] Consolidate and upgrade Deep Learning Dependencies (#18574)
* wip
'

* upgrade requirements

* add file

* fix

* fixes

* Apply suggestions from code review

Try mlagents==0.21.0 for now (works with torch 1.9).

* Apply suggestions from code review

* wip

* wip

* fix

* fix

* upgrade lightning bolts

* address comment

Co-authored-by: Sven Mika <sven@anyscale.io>
2021-09-16 20:16:40 -07:00
DK.Pino
12b3b1f723
[core] Log resource name not id (#18598) 2021-09-16 16:28:09 -07:00
Amog Kamsetty
de050e8187
[SGD] v2 Class API (#18571)
* wip

* wip

* add horovod example

* add example

* lint

* fix

* address comments

* updates

* lint

* update example

* address comment

* address comment

* update

* fix

* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* add back name mangling

* fix tests

* Update python/ray/util/sgd/v2/trainer.py

* fix

* lint

* fix

* fix docstring

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* update

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-16 12:33:38 -07:00
Simon Mo
eeaae5aa08
Revert "[Serve] Add InMemoryMetricsStore for Autoscaling (#18458)" (#18675)
This reverts commit a024effac7.
2021-09-16 11:37:31 -07:00
Simon Mo
a024effac7
[Serve] Add InMemoryMetricsStore for Autoscaling (#18458) 2021-09-16 11:08:42 -07:00
Simon Mo
317a34c523
[Serve] Use BackendConfig Protobuf (#17835) 2021-09-16 11:08:23 -07:00
Jiao
ca3be60291
[Releaes] change headnode type for serve benchmark (#18672)
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-09-16 10:57:36 -07:00
Sven Mika
ba1c489b79
[RLlib Testing] Lower --smoke-test "time_total_s" to make sure it doesn't time out. (#18670) 2021-09-16 18:22:23 +02:00
Edward Oakes
e7ea1f9a82
[runtime_env] Remove global logger from working_dir code (#18605) 2021-09-16 10:37:45 -05:00
Guyang Song
187e4a86ca
[C++ API] expose C++ task failure event (#18596) 2021-09-16 19:20:16 +08:00
Jernej Makovsek
b5c5247ad4
Update example yaml file for running local clusters (#18530) 2021-09-16 02:24:45 -07:00
Sasha Sobol
2f0e22aa4e
prioritize non-gpu nodes when scheduling CPU-only requests (#18615) 2021-09-16 09:57:24 +01:00
gjoliver
df32ed35fd
Extend --smoke-test deadlines for learning and stress regression tests. (#18667) 2021-09-16 09:18:39 +01:00
DK.Pino
99043e5045
[Hotfix] [Issue template] Fix the yaml grammer in feature request issue template (#18624) 2021-09-15 23:01:48 -07:00
xwjiang2010
ea48b1227f
[Tune] Do not crash when resources are insufficient. (#18611) 2021-09-15 23:00:53 -07:00
Stephanie Wang
be7cb70c30
[core] Fix ref counting during actor construction (#18646)
* test

* fix

* cpp

* skip windows

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-09-15 22:16:53 -07:00
liuyang-my
ed04ab7140
Define protobuf for RequestMetadata and HTTPRequestWrapper (#18203) 2021-09-15 14:39:27 -07:00
Chris K. W
7df3441ae9
[client] Fix credential generation when secure=True but no credentials provided (#18636)
* set self._credentials if not provided

* fix credential generation
2021-09-16 00:37:33 +03:00
Sven Mika
8a72824c63
[RLlib Testig] Split and unflake more CI tests (make sure all jobs are < 30min). (#18591) 2021-09-15 22:16:48 +02:00
Chen Shen
28c9c1fd98
fix windows pg test by skipping (#18649) 2021-09-15 11:39:13 -07:00
Antoni Baum
7e95f330d5
[ci] Fix xgboost_ray install from git (#18640) 2021-09-15 18:07:15 +01:00
Antoni Baum
d50ff16ccf
[ci] Fix HEBO breaking Tune tests (#18629) 2021-09-15 10:01:29 -07:00
Kai Fricke
0223ae9605
[xgboost] Bump xgboost_ray requirements_upstream.txt version to 0.1.3 (#18632) 2021-09-15 18:01:15 +01:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Edward Oakes
7d0a2b39e3
[runtime_env] Remove dynamically imported setup_hook (#18601) 2021-09-15 10:19:55 -05:00
Antoni Baum
eeb67a42cc
pip install xgboost_ray -> xgboost_ray[default] (#18607)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-09-15 14:45:56 +01:00
Kai Fricke
15a83d104d
[ci/release] remove legacy release tests (#18592) 2021-09-15 14:42:58 +01:00
Kai Fricke
c186253fc5
[github] fix feature request template (#18627) 2021-09-15 11:33:19 +01:00
DK.Pino
9d41aafcce
Adapt GitHub new issue template (#18516) 2021-09-15 00:57:57 -07:00
Sven Mika
8a00154038
[RLlib] Bump tf version in ML docker to tf==2.5.0; add tfp to ML-docker. (#18544) 2021-09-15 08:46:37 +02:00
Sven Mika
c5d20849ae
[RLlib] Rename rllib rollout into rllib evaluate (backward compatible) to match Trainer API. (#18467) 2021-09-15 08:45:17 +02:00
qicosmos
d7c631209b
[C++ Worker]Add api get placement group (#18535) 2021-09-15 14:11:31 +08:00
qicosmos
15881acffd
[C++ Worker]Update cpp worker doc (#18537) 2021-09-15 14:11:17 +08:00
Simon Mo
497c5f56fa
[CI] Temporary disable worker-in-container test (#18606)
* revert again

* disable tmp
2021-09-14 22:38:20 -07:00