hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Jiao	293e45c527	[Doc] [Serve] Fix README's quick_start and add to test suite (#22228 )	2022-02-09 11:49:47 -08:00
Jiao	54a71e6c4f	[Ray DAG] Add `execute()` interface to take user inputs with ENTRY_POINT tag. (#22196 ) ## Diff Summary Current implementation of DAGNode pre-bind inputs and the signature of `def execute(self)` doesn't take user input yet. This PR extends the interface to take user input, mark DAG entrypoint methods as first stop of all user requests in a DAG. It's needed to unblock next step serve pipeline implementation to serve user requests. Closes #22196 #22197 Notable changes: - Added a `DAG_ENTRY_POINT` flag in ray dag API to annotate DAG entrypoint functions. Function or class method only. All marked functions will receive identical input from user as first layer of DAG. - Changed implementations of ClassNode and FunctionNode accordingly to handle different execution for a node marked as entrypoint or not. - Added a `kwargs_to_resolve` kwarg in the interface of `DAGNode` to handle args that sub-classes need to use to resolve it's implementation without exposing all terms to parent class level. - This is particularly important for ClassMethodNode binding, so we can have implementations to track method name, parent ClassNode as well as previous class method call without existiting - Changed implementation of `_copy()` to handle execution of `kwargs_to_resolve`. - Changed implementation of `_apply_and_replace_all_child_nodes()` to fetch DAGNode type in `kwargs_to_resolve`. - Added pretty printed lines for `kwargs_to_resolve`	2022-02-09 13:29:28 -06:00
Jiajun Yao	673ecd1241	Isolate ray configs for each job (#22206 ) If we run multiple jobs in the same process (this is basically the behavior of python tests), they should be isolated in the sense that system config for job 1 shouldn't affect config for job 2. ``` ray.init(_system_config={}) # job 1 ray.shutdown() ray.init(_system_config={}) # job 2 ray.shutdown() ``` Currently it's not the case, since RayConfig is a static variable and it's shared across drivers in the same process. This PR resets the configs to default value before applying job specific _system_config. Note: it's backward incompatible change if user depends on the current behavior but I'm not aware of such case.	2022-02-09 10:18:46 -08:00
Alex Wu	c9a419ac76	[Autoscaler] Remove staroid node provider (#22236 ) The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-02-09 09:18:18 -08:00
xwjiang2010	323511b716	[tune] Single wait refactor. (#21852 ) This is a down scoped change. For the full overview picture of Tune control loop, see [`Tune control loop refactoring`](https://docs.google.com/document/d/1RDsW7SVzwMPZfA0WLOPA4YTqbRyXIHGYmBenJk33HaE/edit#heading=h.2za3bbxbs5gn) 1. Previously there are separate waits on pg ready and other events. As a result, there are quite a few timing tweaks that are inefficient, hard to understand and unit test. This PR consolidates into a single wait that is handled by TrialRunner in each step. - A few event types are introduced, and their mapping into scenarios * PG_READY --> Should place a trial onto it. If somehow there is no trial to be placed there, the pg will be put in _ready momentarily. This is due to historically resources is conceptualized as a pull based model. * NO_RUNNING_TRIALS_TIME_OUT --> possibly not sufficient resources case * TRAINING_RESULT * SAVING_RESULT * RESTORING_RESULT * YIELD --> This just means that simply taking very long to train. We need to punt back to the main loop to print out status info etc. 2. Previously TrialCleanup is not very efficient and can be racing between Trainable.stop() and `return_placement_group`. This PR streamlines the Trial cleanup process by explicitly let Trainable.stop() to finish followed by `return_placement_group(pg)`. Note, graceful shutdown is needed in cases like `pause_trial` where checkpointing to memory needs to be given the time to happen before the actor is gone. 3. There are quite some env variables removed (timing tweaks), that I consider OK to proceed without deprecation cycle.	2022-02-09 15:31:17 +00:00
Clark Zinzow	f264cf800a	[Datasets] Support ignoring NaNs in aggregations. (#20787 ) Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics: 1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values 2. Mix of values and nulls - `ignore_nulls`=False: Return `None` 3. All nulls: Return `None` 4. Empty dataset: Return `None` This all null and empty dataset handling matches the semantics of NumPy and Pandas.	2022-02-09 00:07:58 -08:00
mwtian	71f63593f4	[Client] avoid locking in async send (#22193 ) As @iycheng discovered in https://github.com/ray-project/ray/issues/22082#issuecomment-1031821631, when `ClientObjectRef` is being GC'ed, `DataClient.lock` is acquired which may cause deadlock. This change avoids acquiring lock in `DataClient._async_send()`.	2022-02-08 22:14:15 -08:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
SangBin Cho	d7cead7519	[Core] Improve ray stop (#22159 ) I’d like to make a small proposal to change the behavior of ray stop ### Status quo It basically just sends a SIGTERM and finish asynchronously. It is vulnerable to have leaked processes or port conflict when you run ray stop && ray start I feel like the right behavior is as follow; ### New - Send sigterm and wait for processes to terminate - Display the progress to users - If procs are not terminated by X seconds, send SIGKILL. ### API change We will add `--grace-period` flag. The default is ray `stop --grace-period=X (10 seconds by default)` And if users don’t want to be blocked, they can use ray stop --force (which already exists. It just sends SIGKILL, so procs are guaranteed to be terminated)	2022-02-08 22:07:44 -08:00
Gagandeep Singh	000c56f764	[Serve] [Windows] Unskip all but `test_redeploy_single_replica` in `test_deploy.py` (#21391 )	2022-02-08 16:30:25 -08:00
Balaji Veeramani	31ed9e5d02	[CI] Replace YAPF disables with Black disables (#21982 )	2022-02-08 16:29:25 -08:00
Stephanie Wang	dcd96ca348	[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120 ) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.	2022-02-08 14:50:50 -08:00
Simon Mo	a3efee7ecf	[Serve] Add regression test for out of order submit (#20629 )	2022-02-08 10:38:36 -08:00
Sven Mika	c17a44cdfa	Revert "Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni…" (#22153 )	2022-02-08 16:43:00 +01:00
Gagandeep Singh	0f2a2224c2	PoolActor now uses num_cpus=0 to avoid any deadlock (#22048 ) https://github.com/ray-project/ray/issues/21488#issuecomment-1027122177 : > We discussed this issue in a bit more detail and came to the conclusion that we should set the CPU resource requirement for each actor in the actor pool to 0, to make the Ray Pool compatible/same behavior as the Python multiprocessing pool. Would that work for you @yogeveran ? (very similar to solution 4 mentioned above, but with 0.0 instead of 0.1, so it works in all cases).	2022-02-08 01:59:46 -08:00
SangBin Cho	1c41b0f566	[Test] Unflake pg test + add pg tests that weren't running (#22204 ) Unflake pg test (pg test 3 times out occasionally)+ add pg tests that weren't running	2022-02-08 01:47:22 -08:00
Sriram Sankar	d06317eb1a	[Kuberay] Updated kuberay-autoscaler.yaml to create service account (#22188 ) Added lines to autoscaler configuration yaml to create a service account that is used to give the autoscaler permissions to list and read pods and patch the cluster CRD for up/downscaling.	2022-02-07 22:04:34 -08:00
Eric Liang	8f7db1c6ab	Properly release resources of workers exiting due to max_calls (#22146 ) Previously code incorrectly assumed that an exiting worker would disconnect from the raylet promptly to release resources. This isn't the case if the worker is owning references. This PR plumbs through the right release resources call even in this scenario. Closes https://github.com/ray-project/ray/issues/10960 Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-02-07 21:57:11 -08:00
Eric Liang	00b5801d71	Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156 )	2022-02-07 14:05:44 -08:00
Edward Oakes	8806b2d5c4	[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180 )	2022-02-07 15:25:25 -06:00
Max Pumperla	5cc9355303	[Docs ] Tune docs overhaul (first part) (#22112 ) Continuing docs overhaul, tune now has: - [x] better landing page - [x] a getting started guide - [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials - [x] the new user guide contains guides to tune features and practical integrations - [x] we rewrote some of the feature guides for clarity - [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower) - [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030. - [x] Examples are tested in the new framework, of course. There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-02-07 15:47:03 +00:00
Chen Shen	e531ee907b	[microbenchmark] avoid noisy neighbor #22133 Why are these changes needed? see #22045, add sleep between benchmark tests avoid noisy neighbor tests.	2022-02-06 17:30:56 -08:00
Yi Cheng	b729d458e2	[client] Move Client implementation of ObjectRef/ActorRef to python (#22148 ) `__dealloc__` is not allowed to call python code and this leads to two problems: - The data has already been cleaned up - Deadlock if there are locks used. THis PR move the implementation to python layer to avoid this	2022-02-06 13:03:51 -08:00
Jiao	c065e3f69e	[Ray DAG] Implement experimental Ray DAG API for task/class (#22058 )	2022-02-05 15:05:07 -08:00
Jiajun Yao	88d2e21585	Disable scheduler_report_pinned_bytes_only (#22132 )	2022-02-05 11:06:59 -08:00
Dmitri Gekhtman	fc00369ae5	Log resource message (#22136 ) We've had multiple issues that manifest as unexpected autoscaler logs about resource demands. To make it easier to debug such issues, this PR adds a debug flag to allow logging the entire resource message used by the autoscaler as its source of truth about the Ray internals' resource usage. If the env AUTOSCALER_LOG_RESOURCE_BATCH_DATA=1 is set, the autoscaler will log the resource message.	2022-02-05 10:08:37 -08:00
mwtian	98be9fb5e0	[Test][Client] make sure Ray cluster shuts down after test terminates (#22128 ) Apply the same fix in #21589 to another test fixture for Ray client tests. Let's see if this can reduce flakiness in unit tests.	2022-02-04 18:12:20 -08:00
SangBin Cho	dbd28cc861	[Test] Fix flaky Drain node tests (#22104 ) Fix the flaky test by waiting instead of immediately verify it	2022-02-05 09:41:14 +09:00
shrekris-anyscale	a61d974dd5	[serve] Implement experimental deploy_group API (#22039 ) If the declarative API issues a code change to a group of deployments at once, it needs to deploy the group of updated deployments atomically. This ensures any deployment using another deployment's handle inside its own __init__() function can access that handle regardless of the deployment order. This change adds deploy_group to the ServeController class, allowing it to deploy a list of deployments atomically. It also adds a new public API command, serve.deploy_group(), exposing the controller's functionality publicly, so atomic deployments can also be executed via Python API. Closes #21873.	2022-02-04 18:12:14 -06:00
Yi Cheng	5ae8d5b8af	Revert "Revert "[client] Fix ray client object ref releasing in wrong context."" (#22091 ) Reverts ray-project/ray#22090	2022-02-04 14:50:23 -08:00
Archit Kulkarni	182dbfbfdb	[runtime env] Fix bug where options (e.g. `--extra-index-url`) could not be specified in `requirements.txt` (#22065 ) In https://github.com/ray-project/ray/pull/20341 the behavior of `pip` was changed to install the specified packages in the existing environment rather than in a new environment. This posed a problem when specifying Ray libraries like "ray[serve]" in the `pip` field, because the installer would install Ray at runtime and this new Ray would take precedence over the Ray existing on the cluster. This could cause version mismatch issues. Skipping some details, the approach taken in the that PR was essentially to parse the `pip` list and remove Ray. However not every line in a `pip` `requirements.txt` file is a requirements specifier; a line can also just specify options, like `--extra-index-url my-index-url.com`. This caused the parsing library to raise an exception when trying to parse the line. This PR fixes this by catching the exception and skipping the line in this case, since it's not a line that specifies `ray` and that's all we're looking for when parsing.	2022-02-04 15:32:32 -06:00
SangBin Cho	ea4079465d	[Runtime Env] Support runtime env error message for actors (#22109 )	2022-02-04 15:32:02 -06:00
Nikita Vemuri	d9dc388082	[jobs] Support ray client format of connection string address for external module (#22116 ) Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.	2022-02-04 13:35:10 -06:00
matthewdeng	014a9959f1	Revert "[train] add TorchTensorboardProfilerCallback (#21864 )" (#22117 ) This reverts commit `f064306de9`.	2022-02-04 08:54:16 -08:00
SangBin Cho	6dda196f47	Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106 ) This reverts commit `e3af828220`.	2022-02-04 00:55:45 -08:00
SangBin Cho	a887763b38	Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni… (#22105 ) This reverts commit `3f03ef8ba8`.	2022-02-04 00:54:50 -08:00
Clark Zinzow	743ce65da8	[Dask-on-Ray] Add support for Dask annotations. (#22057 )	2022-02-03 22:15:38 -08:00
matthewdeng	f064306de9	[train] add TorchTensorboardProfilerCallback (#21864 ) Implement a TorchTensorboardProfilerCallback and corresponding TorchWorkerProfiler to support distributed PyTorch Profiler With TensorBoard integration.	2022-02-03 19:28:12 -08:00
Stephanie Wang	e3af828220	[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#21719 ) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.	2022-02-03 17:31:27 -08:00
SangBin Cho	d7fc7d2e9d	[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032 ) This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers. The first PR only handles tasks exception. TODO - [x] Task (this PR) - [ ] Actor - [ ] Turn of runtime env logs & improve error msgs	2022-02-03 16:47:04 -08:00
Yi Cheng	7ff1cbbb12	Revert "[client] Fix ray client object ref releasing in wrong context." (#22090 ) Reverts ray-project/ray#22025	2022-02-03 13:59:52 -08:00
Jiajun Yao	44db41c0fb	No spreading if a node is selected for lease request due to locality (#22015 ) 1. If the node is selected based on locality, we always run the task on the node selected by locality if the node is available. 2. For spread scheduling strategy, we always select the local node as the first raylet to request lease, no locality involved.	2022-02-03 12:00:54 -08:00
Kai Fricke	bbc64eba32	[tune/wandb] Fix WandbTrainableMixin config for rllib trainables (#22063 ) The WandbTrainableMixin doesn't work with RLLib trainables as they won't recognize the wandb parameter. Thus we should pop the wandb config before we initialize the rest of the trainable.	2022-02-03 17:20:27 +01:00
Sven Mika	3f03ef8ba8	[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. (#21356 )	2022-02-03 09:32:09 +01:00
Max Pumperla	092598774a	[Docs] Executable notebook tutorial (#22030 ) We're introducing the usage of [MyST Notebooks](https://myst-nb.readthedocs.io/en/latest/index.html) here and demonstrate how it works by rewriting (and extending) the RLLib Serve tutorial. Benefits: - [x] Write notebooks in markdown. Can be converted into other formats e.g. with `jupytext` - [x] Tutorials like this have a binderhub link added to the top nav (launch button). - [x] Notebooks get executed when docs are built, so it's impossible to have stale docs. - [x] But locally those builds are cached so that you don't have to wait too long. - [x] The notebook cell outputs can be shown, hidden or removed. In particular, we can now avoid adding expected code output as comments in our scripts (which might get outdated). We're also clarifying #22022. Old tutorial: [here](https://docs.ray.io/en/latest/serve/tutorials/rllib.html) New tutorial (preview): [here](https://ray--22030.org.readthedocs.build/en/22030/serve/tutorials/rllib.html) Co-authored-by: simon-mo <simon.mo@hey.com>	2022-02-03 08:13:04 +00:00
Eric Liang	c8f93dfdec	[data] Misc: add_column() and allow specifying decoding error handling in from_text() (#21967 ) This adds some utility functions to make it easier to manipulate structured data in Datasets. While in principle you can already do this with map_batches, this makes it a little easier to test things out for development.	2022-02-02 20:47:17 -08:00
birgerbr	826a8bb06c	[core] Use FileLock on ports_by_node.json (#18909 ) The new code uses a file-lock before reading and writing to `ports_by_node.json`. Without it, multiple nodes may write to ports_by_node.json at the same time.	2022-02-02 14:57:19 -06:00
Archit Kulkarni	78f882dbbc	[runtime env] Local uri caching for working_dir, py_modules and conda (#20273 ) Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them. In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache. When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs. Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit - Adds unit tests for caching and integration tests for working_dir caching	2022-02-02 14:53:03 -06:00
Chris K. W	c95abe75a9	[client] Consistent ray.init return value (#21355 ) Proposal document: https://docs.google.com/document/d/1ln7_fUST18GOz4jJnI_zN00hfczXY48V5Ajy6fCmJCE/edit# This PR changes the return value of ray.init when not in client mode to be a RayContext, which acts as a context manager and the same public fields as ClientContext , as well a disconnect method (calls shutdown under the hood). To prevent breaking scripts that rely on accessing through dict methods, RayContext also subclasses collections.abc.Mapping (can be treated as an immutable dict). This behavior will be removed in 2.0, so deprecation warnings are raised when __getitem__ is used. To make migration simple, an additional dict field address_info is added with the same values as the original return value.	2022-02-02 19:39:03 +02:00
SangBin Cho	9531887590	[Placement Group] Fix infeasible placement group not scheduled after node is added (#21993 ) It looks like existing infeasible placement group in placement group manager didn't work properly. Idk how we added this feature when we cannot pass this simple test case. But this is what has happend; (1) PG is not scheduleable because it is infeasible (2) New node is added (3) After a new node is added, placement group manager tries rescheduling all infeasible pgs. (4) Here, when we add a new node, we didn't report resources (this seems to be very weird. We are reporting resource using a separate RPC here). So when (3) happens, pg was still unschedulable. This PR fixes the issue by adding the resource information when the new node is added. Note that in the long term, we'd like to have a separate resource path from (4). This won't be addressed in this PR.	2022-02-02 06:44:42 -08:00

... 3 4 5 6 7 ...

6221 commits