Commit graph

10167 commits

Author SHA1 Message Date
Simon Mo
3e038aebb2
[CI] Allow release tests infra to accept buildkite artifacts (#19803) 2021-10-27 13:04:01 -07:00
Yi Cheng
98961d1ee2
[core] Fix the wrong error message in gcs for worker exits (#19774) 2021-10-27 12:55:27 -07:00
matthewdeng
aa5499ef0f
[Train] implement CheckpointStrategy (#19111)
* [SGD] implement CheckpointStrategy

* address comments

* update docs

* Update doc/source/train/user_guide.rst

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* best checkpoint

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-10-27 11:31:04 -07:00
Amog Kamsetty
5d54412f1c
[Docker] Alias ray-ml:nightly to ray-ml:nightly-gpu (#19726)
* wip

* wip

* update

* finish

* deprecate

* debug

* fix and address comments

* try catch

* fix

* split tests

* force

* merge

* docs

* wip

* fix and check

* update readme

* fix

* fix

* fix sanity checking

* format

* alias

* fix

* comment
2021-10-27 11:30:49 -07:00
Edward Oakes
1f681981af
[serve] Bump controller max concurrency to 15k, make long poll timeout random (#19790) 2021-10-27 13:28:16 -05:00
Yi Cheng
abec07700a
[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779)
## Why are these changes needed?
We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging.

## Related issue number
#19438
2021-10-27 10:46:13 -07:00
Edward Oakes
acc5702535
[runtime_env] Fix hash length in URI (#19777) 2021-10-27 12:22:20 -05:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
Sven Mika
80eeb13175
[RLlib; Docs overhaul] Docstring cleanup: Trainer, trainer_template, Callbacks. (#19758) 2021-10-27 19:15:35 +02:00
Sven Mika
f2cb2ed203
[RLlib; Docs overhaul] Docstring cleanup: Policies, policy_templates. (#19759) 2021-10-27 19:14:39 +02:00
SangBin Cho
418b4a94e6
[Core] Remove legacy scheduler code (#19780)
* Remove unused worker APIs

* Remove unused scheduling resources.

* lint
2021-10-27 06:57:08 -07:00
Simon Mo
40d52edabc
[CI] Upload wheels to artifact store in all jobs (#19778) 2021-10-27 10:27:56 +01:00
Simon Mo
6afbd1f558
[Serve] /api/snapshot works with all Serve KVStores (#19772) 2021-10-26 23:27:38 -07:00
Jiao
3f628d4f6b
increase long poll timeout and wrk trial cpu resource (#19768) 2021-10-26 21:31:39 -07:00
SangBin Cho
bcd27b708f
[Test] Mark many ppo as unstable (#19769) 2021-10-26 21:27:43 -07:00
Qing Wang
7647ea3512
[Java] Add helper method to build driver process. (#19740)
We make the buildDriver() process as a helpful util to avoid duplicate code.
2021-10-27 10:17:37 +08:00
architkulkarni
6bd49a8cd5
[runtime env] Improve working dir messaging (#18893) 2021-10-26 20:58:02 -05:00
Amog Kamsetty
db863aafc0
Revert "Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)" (#19763)
This reverts commit e58fcca404.
2021-10-26 17:32:56 -07:00
Jiajun Yao
47744d282c
[data] Fix arrow dataset sort on empty blocks (#19707) 2021-10-26 15:30:23 -07:00
SangBin Cho
3e81506d90
[Threaded actor] Fix threaded actor race condition (#19751) 2021-10-26 15:17:53 -07:00
Eric Liang
2652ae7905
[client] Put of a list should not return a list, this is a client bug (#19737) 2021-10-26 13:51:37 -07:00
Amog Kamsetty
e58fcca404
Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)
This reverts commit f0053d405b.
2021-10-26 12:55:20 -07:00
Yi Cheng
2ec9a70e24
[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling (#19664)
## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.

TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.


## Related issue number
#19438
2021-10-26 12:00:37 -07:00
gjoliver
99a0088233
[RLlib] Unify the way we create local replay buffer for all agents (#19627)
* [RLlib] Unify the way we create and use LocalReplayBuffer for all the agents.

This change
1. Get rid of the try...except clause when we call execution_plan(),
   and get rid of the Deprecation warning as a result.
2. Fix the execution_plan() call in Trainer._try_recover() too.
3. Most importantly, makes it much easier to create and use different types
   of local replay buffers for all our agents.
   E.g., allow us to easily create a reservoir sampling replay buffer for
   APPO agent for Riot in the near future.
* Introduce explicit configuration for replay buffer types.
* Fix is_training key error.
* actually deprecate buffer_size field.
2021-10-26 20:56:02 +02:00
xwjiang2010
ab15dfd478
[Tune release test] Set 500G disk space for rllib_tests. (#19730) 2021-10-26 10:12:03 -07:00
Avnish Narayan
ad87ddf93e
[rllib] Add deterministic test to gpu (#19306)
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-26 10:11:39 -07:00
iasoon
b5158ca0ab
[serve] Correctly set num_replicas when deploying autoscaling deployment (#19520) 2021-10-26 12:10:59 -05:00
Lixin Wei
c937950910
Add 'local' Tag to @com_github_antirez_redis//:bin (#19685)
* Build redis locally

* fix
2021-10-26 09:17:52 -07:00
SangBin Cho
00ea716ada
Revert "Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)" (#19736)
This reverts commit d453afbab8.
2021-10-26 08:25:09 -07:00
Jiao
aaef82920d
[serve] Add periodic timeouts to long poll client to avoid accumulating concurrent tasks in the controller (#19728) 2021-10-26 09:44:00 -05:00
SangBin Cho
e914ea930d
[Core] Stop reporting tasks spec to GCS that are unnecessary #19699 (#19699)
This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis.

The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used.

�[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m

�[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m
defaultdict(int,
            {b'WORKE': 1080864,
             b'ACTOR': 1470931,
             b'TASK&': 1020646,
             b'TASK:': 870551,
             b'PROFI': 360000,
             b'PLACE': 10107,
             b'JOB:\x01': 8,
             b'JOB:\x04': 8,
             b'NODE:': 99,
             b'NODE_': 126,
             b'INTER': 44,
             b'JOB:\x03': 8,
             b'redis': 16,
             b'JOB:\x02': 8,
             b'JOB:\x05': 8})
2021-10-26 04:17:58 -07:00
Kai Fricke
98244ad130
[ci/release] Report error to database on alert (#19743) 2021-10-26 10:48:02 +01:00
Kai Fricke
96ddf5b9ac
[ci/release] Choose cloud by name or ID (#19742) 2021-10-26 10:21:54 +01:00
Kai Fricke
3081488a99
[tune] Fix local checkpoint deletion for remote trials (#19632) 2021-10-26 09:18:07 +01:00
Amog Kamsetty
6e61ca623d
[CI] Infra for "user" tests (#19662) 2021-10-26 08:47:22 +01:00
SangBin Cho
ba61c436ea
Revert "Try enabling event stats by default (#19650)" (#19735)
This reverts commit 6081cf870e.
2021-10-26 14:33:40 +09:00
Eric Liang
81b0eb297c
Un-revert size estimator and fix Train test (#19719) 2021-10-25 22:09:24 -07:00
Eric Liang
10e27892c2
Suppress tsan false positive in gcs-pub-sub-test (#19727) 2021-10-25 19:52:53 -07:00
Amog Kamsetty
f0053d405b
[Docker] Support multiple CUDA Versions (#19505)
* wip

* wip

* update

* finish

* deprecate

* debug

* fix and address comments

* try catch

* fix

* split tests

* force

* merge

* docs

* wip

* fix and check

* update readme

* fix

* fix

* fix sanity checking

* format
2021-10-25 18:57:05 -07:00
SangBin Cho
d453afbab8
Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)
This reverts commit e3ced0e59e.
2021-10-26 09:14:25 +09:00
Simon Mo
5330aab27a
[CI] Deflake test metrics (#19711) 2021-10-25 16:34:20 -07:00
Alex Wu
045d72cdc0
[docs] Fix typo in installation instructions (#19721) 2021-10-25 15:30:34 -07:00
Eric Liang
66818d11b8
Revert "[data] Add serialized size estimator to block builder (#19681)" (#19717)
This reverts commit 8c37311c41.
2021-10-25 15:06:58 -07:00
Eric Liang
8c37311c41
[data] Add serialized size estimator to block builder (#19681) 2021-10-25 14:58:49 -07:00
SangBin Cho
ecd5a622ef
[Tests] Add a memory usage on dask on ray tests (#19674) 2021-10-25 14:58:26 -07:00
SangBin Cho
544f774245
[Autoscaler/Core] Drain node API (#19350)
* Initial version done. Graceful shutdown  is possible with direct raylet RPCs

* .

* .

* ip

* Done.

* done tests might fail

* fix lint + cpp tests

* fix 2

* Fix issues.

* Addressed code review.

* Fix another cpp test failure

* completed

* Skip windows tests

* Update the comment

* complete

* addressed code review.
2021-10-25 14:57:50 -07:00
Linsong Chu
13d4894789
[workflow] Add get_metadata() for workflow (#19372)
## Why are these changes needed?

Add the functionality to retrieve metadata for a workflow or workflow step.

Design:
- Similar to `get_output`, this will either return the metadata for workflow (`workflow.get_metadata(workflow_id)`) or the metadata for a specific step (`workflow.get_metadata(workflow_id, step_id)`)
- Exceptions will only be raised if workflow id or step id not exist. Canceled job, running job, etc. will return proper metadata by retrieving information from checkpoint. See [here](8c8ca609d7/python/ray/workflow/tests/test_metadata_get.py (L67)) for more details.
- Returned metadata is an aggregated result from multiple checkpoint files based on previous [discussion](https://github.com/ray-project/ray/issues/17090#issuecomment-920481789). The aggregation logic is [here for step metadata](8c8ca609d7/python/ray/workflow/workflow_storage.py (L451)) and [here for workflow metadata](8c8ca609d7/python/ray/workflow/workflow_storage.py (L484)) which can be tuned with further discussion.

Example:
```python
>>>  user_step_metadata = {"k1": "v1"}
>>>  user_run_metadata = {"k2": "v2"}
>>>  step_name = "simple_step"
>>>  workflow_id = "simple"

>>>  @workflow.step
>>>  def simple():
>>>      return 0

>>>  simple.options(name=step_name, metadata=user_step_metadata).step().run(workflow_id, metadata=user_run_metadata)

# get workflow-level metadata
>>>  workflow.get_metadata("simple")
{'status': 'SUCCESSFUL',
 'user_metadata': {'k2': 'v2'},
 'stats': {'start_time': 1634173413.116535, 'end_time': 1634173413.149051}}

# get step-level metadata
>>> workflow.get_metadata("simple", "simple_step")
{'name': '__main__.simple',
 'step_type': 'FUNCTION',
 'workflows': [],
 'max_retries': 3,
 'workflow_refs': [],
 'catch_exceptions': False,
 'ray_options': {},
 'user_metadata': {'k1': 'v1'},
 'stats': {'start_time': 1634173413.131262, 'end_time': 1634173413.1347651}}
```

## Related issue number
https://github.com/ray-project/ray/issues/17090
2021-10-25 14:52:51 -07:00
Alex Wu
58b28f04cd
[docs/usability] Apple Silicon support (#19705)
This PR puts the final touches on apple silicon support. There are 3 main caveats to supporting M1 macs right now (described in the docs):

Requires using forge.
Requires special installation instructions to get grpc working (this is an underlying grpc issue, so ideally it will be fixed upstream).
We're only publishing release wheels, not nightlies right now.
This also includes a grpc import check to ensure that we provide an actionable error message if the user tries the regular pip install ray process to properly install grpcio.
2021-10-25 14:49:28 -07:00
DK.Pino
e3ced0e59e
[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)
* fixed

* lint

* add cxx ut

* fix comment

* Revert "fix comment"

This reverts commit 32ea2558166a7674d7efe2e0c0a66ea7409c7d99.

* fix comment
2021-10-25 14:15:36 -07:00
architkulkarni
2c64b2b0e8
[Doc] Move all contribution info to getting-involved.html and link to it from CONTRIBUTING.rst (#19571) 2021-10-25 14:23:23 -05:00