Commit graph

11288 commits

Author SHA1 Message Date
Jiajun Yao
ff8af2edba
Remove TaskExecutionSpec (#22155) 2022-02-06 21:59:23 -08:00
SangBin Cho
6235b6d7e9
Revert "[Release 1.11.0][Core] avoid unnecessary work during event st… (#22144)
This reverts commit 9ac3f6879d.

Seems like this makes this test flaky, so I will revert it for now.
2022-02-06 18:19:44 -08:00
Chen Shen
e531ee907b
[microbenchmark] avoid noisy neighbor #22133
Why are these changes needed?
see #22045, add sleep between benchmark tests avoid noisy neighbor tests.
2022-02-06 17:30:56 -08:00
Yi Cheng
b729d458e2
[client] Move Client implementation of ObjectRef/ActorRef to python (#22148)
`__dealloc__` is not allowed to call python code and this leads to two problems:

- The data has already been cleaned up
- Deadlock if there are locks used.

THis PR move the implementation to python layer to avoid this
2022-02-06 13:03:51 -08:00
Sven Mika
8b678ddd68
[RLlib] Issue 22036: Client should handle concurrent episodes with one being training_enabled=False. (#22076) 2022-02-06 12:35:03 +01:00
Clark Zinzow
fb0d6e6b0b
[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067) 2022-02-05 16:59:34 -08:00
Jiao
c065e3f69e
[Ray DAG] Implement experimental Ray DAG API for task/class (#22058) 2022-02-05 15:05:07 -08:00
Jiajun Yao
88d2e21585
Disable scheduler_report_pinned_bytes_only (#22132) 2022-02-05 11:06:59 -08:00
Dmitri Gekhtman
fc00369ae5
Log resource message (#22136)
We've had multiple issues that manifest as unexpected autoscaler logs about resource demands.

To make it easier to debug such issues, this PR adds a debug flag to allow logging the entire resource message used by the autoscaler as its source of truth about the Ray internals' resource usage.

If the env AUTOSCALER_LOG_RESOURCE_BATCH_DATA=1 is set, the autoscaler will log the resource message.
2022-02-05 10:08:37 -08:00
mwtian
98be9fb5e0
[Test][Client] make sure Ray cluster shuts down after test terminates (#22128)
Apply the same fix in #21589 to another test fixture for Ray client tests. Let's see if this can reduce flakiness in unit tests.
2022-02-04 18:12:20 -08:00
SangBin Cho
dbd28cc861
[Test] Fix flaky Drain node tests (#22104)
Fix the flaky test by waiting instead of immediately verify it
2022-02-05 09:41:14 +09:00
shrekris-anyscale
a61d974dd5
[serve] Implement experimental deploy_group API (#22039)
If the declarative API issues a code change to a group of deployments at once, it needs to deploy the group of updated deployments atomically. This ensures any deployment using another deployment's handle inside its own __init__() function can access that handle regardless of the deployment order. This change adds deploy_group to the ServeController class, allowing it to deploy a list of deployments atomically. It also adds a new public API command, serve.deploy_group(), exposing the controller's functionality publicly, so atomic deployments can also be executed via Python API.

Closes #21873.
2022-02-04 18:12:14 -06:00
Jiao
a692e7d05e
[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938)
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2022-02-04 17:51:43 -06:00
Jules S. Damji
c5c5e01b5d
[Doc] [Serve] Fixed minor typo and removed extract ',' (#22101) 2022-02-04 14:51:38 -08:00
Yi Cheng
5ae8d5b8af
Revert "Revert "[client] Fix ray client object ref releasing in wrong context."" (#22091)
Reverts ray-project/ray#22090
2022-02-04 14:50:23 -08:00
Archit Kulkarni
182dbfbfdb
[runtime env] Fix bug where options (e.g. --extra-index-url) could not be specified in requirements.txt (#22065)
In https://github.com/ray-project/ray/pull/20341 the behavior of `pip` was changed to install the specified packages in the existing environment rather than in a new environment.  This posed a problem when specifying Ray libraries like "ray[serve]" in the `pip` field, because the installer would install Ray at runtime and this new Ray would take precedence over the Ray existing on the cluster.  This could cause version mismatch issues.  Skipping some details, the approach taken in the that PR was essentially to parse the `pip` list and remove Ray. 

However not every line in a `pip` `requirements.txt` file is a requirements specifier; a line can also just specify options, like `--extra-index-url my-index-url.com`. 
 This caused the parsing library to raise an exception when trying to parse the line.  This PR fixes this by catching the exception and skipping the line in this case, since it's not a line that specifies `ray` and that's all we're looking for when parsing.
2022-02-04 15:32:32 -06:00
Archit Kulkarni
d7be4e1d3c
[doc] [runtime env] Add note that referencing local files in requirements.txt is not supported (#22095) 2022-02-04 15:32:19 -06:00
SangBin Cho
ea4079465d
[Runtime Env] Support runtime env error message for actors (#22109) 2022-02-04 15:32:02 -06:00
Sven Mika
f6617506a2
[RLlib] Add on_sub_environment_created to DefaultCallbacks class. (#21893) 2022-02-04 22:22:47 +01:00
Nikita Vemuri
d9dc388082
[jobs] Support ray client format of connection string address for external module (#22116)
Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.
2022-02-04 13:35:10 -06:00
matthewdeng
014a9959f1
Revert "[train] add TorchTensorboardProfilerCallback (#21864)" (#22117)
This reverts commit f064306de9.
2022-02-04 08:54:16 -08:00
Sven Mika
38d75ce058
[RLlib] Cleanup SlateQ algo; add test + add target Q-net (#21827) 2022-02-04 17:01:12 +01:00
Avnish Narayan
0d2ba41e41
[RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments (#21685) 2022-02-04 14:59:56 +01:00
SangBin Cho
6dda196f47
Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106)
This reverts commit e3af828220.
2022-02-04 00:55:45 -08:00
SangBin Cho
a887763b38
Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni… (#22105)
This reverts commit 3f03ef8ba8.
2022-02-04 00:54:50 -08:00
Clark Zinzow
743ce65da8
[Dask-on-Ray] Add support for Dask annotations. (#22057) 2022-02-03 22:15:38 -08:00
matthewdeng
f064306de9
[train] add TorchTensorboardProfilerCallback (#21864)
Implement a TorchTensorboardProfilerCallback and corresponding TorchWorkerProfiler to support distributed PyTorch Profiler With TensorBoard integration.
2022-02-03 19:28:12 -08:00
Stephanie Wang
e3af828220
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#21719)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.
2022-02-03 17:31:27 -08:00
SangBin Cho
d7fc7d2e9d
[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032)
This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers.

The first PR only handles tasks exception.

TODO
- [x] Task (this PR)
- [ ] Actor
- [ ] Turn of runtime env logs & improve error msgs
2022-02-03 16:47:04 -08:00
Kai Fricke
dd935874ee
[ci/release] Fix job submission command (#22093)
Ray job submission does not accept quoted commands anymore (#22011). This PR updates the command to fix job submission within e2e tests.
2022-02-04 00:05:52 +01:00
Yi Cheng
7ff1cbbb12
Revert "[client] Fix ray client object ref releasing in wrong context." (#22090)
Reverts ray-project/ray#22025
2022-02-03 13:59:52 -08:00
mwtian
b528bf9202
Revert "[e2e] Remove unnecessary logic around copying results (#22034)" (#22088)
This reverts commit 92d7e9bf98.
2022-02-03 13:42:40 -08:00
mwtian
92d7e9bf98
[e2e] Remove unnecessary logic around copying results (#22034)
After #21905, some of the logic around handling result artifacts become unnecessary or incorrect (in generating error logs). They are removed.
2022-02-03 12:15:06 -08:00
mwtian
9ac3f6879d
[Release 1.11.0][Core] avoid unnecessary work during event stats collection (#22054)
This PR avoids some unnecessary copying and branching when recording event stats. It improves / recovers ~10% of `single_client_get_calls_Plasma_Store` performance. On AWS EC2 `m5.8xlarge`,
- `single_client_get_calls_Plasma_Store` current: ~5200/s
- `single_client_get_calls_Plasma_Store` with PR: ~5800/s

When `RAY_event_stats=0`,  `single_client_get_calls_Plasma_Store` can reach ~6800/s. If we want to optimize further, we can record data in opencensus only in intervals, or when the data are exported.
2022-02-03 12:01:42 -08:00
Jiajun Yao
44db41c0fb
No spreading if a node is selected for lease request due to locality (#22015)
1. If the node is selected based on locality, we always run the task on the node selected by locality if the node is available.
2. For spread scheduling strategy, we always select the local node as the first raylet to request lease, no locality involved.
2022-02-03 12:00:54 -08:00
Kai Fricke
bbc64eba32
[tune/wandb] Fix WandbTrainableMixin config for rllib trainables (#22063)
The WandbTrainableMixin doesn't work with RLLib trainables as they won't recognize the wandb parameter. Thus we should pop the wandb config before we initialize the rest of the trainable.
2022-02-03 17:20:27 +01:00
Sven Mika
3f03ef8ba8
[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. (#21356) 2022-02-03 09:32:09 +01:00
Max Pumperla
092598774a
[Docs] Executable notebook tutorial (#22030)
We're introducing the usage of [MyST Notebooks](https://myst-nb.readthedocs.io/en/latest/index.html) here and demonstrate how it works by rewriting (and extending) the RLLib Serve tutorial. Benefits:

- [x] Write notebooks in markdown. Can be converted into other formats e.g. with `jupytext`
- [x] Tutorials like this have a binderhub link added to the top nav (launch button).
- [x] Notebooks get executed when docs are built, so it's impossible to have stale docs.
- [x] But locally those builds are cached so that you don't have to wait too long.
- [x] The notebook cell outputs can be shown, hidden or removed.  In particular, we can now avoid adding expected code output as comments in our scripts (which might get outdated).

We're also clarifying  #22022. 

Old tutorial: [here](https://docs.ray.io/en/latest/serve/tutorials/rllib.html)
New tutorial (preview): [here](https://ray--22030.org.readthedocs.build/en/22030/serve/tutorials/rllib.html)

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-02-03 08:13:04 +00:00
Eric Liang
c8f93dfdec
[data] Misc: add_column() and allow specifying decoding error handling in from_text() (#21967)
This adds some utility functions to make it easier to manipulate structured data in Datasets. While in principle you can already do this with map_batches, this makes it a little easier to test things out for development.
2022-02-02 20:47:17 -08:00
birgerbr
826a8bb06c
[core] Use FileLock on ports_by_node.json (#18909)
The new code uses a file-lock before reading and writing to `ports_by_node.json`.
Without it, multiple nodes may write to ports_by_node.json at the same time.
2022-02-02 14:57:19 -06:00
SangBin Cho
3c056a6b92
Revert "[Nightly Test] Add more metadata to test result (#21990)" (#22052)
This reverts commit fd20cf3239.
2022-02-02 12:56:42 -08:00
Archit Kulkarni
78f882dbbc
[runtime env] Local uri caching for working_dir, py_modules and conda (#20273)
Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them.  In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache.  When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs.

Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit

- Adds unit tests for caching and integration tests for working_dir caching
2022-02-02 14:53:03 -06:00
Edward Oakes
e85bbfb338
[jobs] Enable default port in http:// addresses (#22014)
Closes https://github.com/ray-project/ray/issues/22012
2022-02-02 14:34:34 -06:00
Edward Oakes
8bbc5b936a
[jobs] Use subprocess.list2cmdline to properly handle quotes in CLI entrypoints (#22011) 2022-02-02 14:33:57 -06:00
Chris K. W
c95abe75a9
[client] Consistent ray.init return value (#21355)
Proposal document: https://docs.google.com/document/d/1ln7_fUST18GOz4jJnI_zN00hfczXY48V5Ajy6fCmJCE/edit# 

This PR changes the return value of ray.init when not in client mode to be a RayContext, which acts as a context manager and the same public fields as ClientContext , as well a disconnect method (calls shutdown under the hood).

To prevent breaking scripts that rely on accessing through dict methods, RayContext also subclasses collections.abc.Mapping (can be treated as an immutable dict). This behavior will be removed in 2.0, so deprecation warnings are raised when __getitem__ is used. To make migration simple, an additional dict field address_info is added with the same values as the original return value.
2022-02-02 19:39:03 +02:00
Rodrigo de Lazcano
a258f9c692
[RLlib] Neural-MMO keep_per_episode_custom_metrics patch (toward making Neuro-MMO RLlib's default massive-multi-agent learning test environment). (#22042) 2022-02-02 17:28:42 +01:00
SangBin Cho
9531887590
[Placement Group] Fix infeasible placement group not scheduled after node is added (#21993)
It looks like existing infeasible placement group in placement group manager didn't work properly. Idk how we added this feature when we cannot pass this simple test case.

But this is what has happend;

(1) PG is not scheduleable because it is infeasible
(2) New node is added
(3) After a new node is added, placement group manager tries rescheduling all infeasible pgs.
(4) Here, when we add a new node, we didn't report resources (this seems to be very weird. We are reporting resource using a separate RPC here). So when (3) happens, pg was still unschedulable. 

This PR fixes the issue by adding the resource information when the new node is added. 

Note that in the long term, we'd like to have a separate resource path from (4). This won't be addressed in this PR.
2022-02-02 06:44:42 -08:00
Jun Gong
9c95b9a5fa
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 12:15:38 +01:00
Jun Gong
87fe033f7b
[RLlib] Request CPU resources in Trainer.default_resource_request() if using dataset input. (#21948) 2022-02-02 10:20:37 +01:00
Jun Gong
a55258eb9c
[RLlib] Move bandit example scripts into examples folder. (#21949) 2022-02-02 09:20:47 +01:00