hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Stephanie Wang	dcd96ca348	[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120 ) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.	2022-02-08 14:50:50 -08:00
Jules S. Damji	6b7d995e64	Added a hands-on self-containted MLflow/Ray Serve deployment example (#22192 )	2022-02-08 12:10:53 -08:00
Simon Mo	a3efee7ecf	[Serve] Add regression test for out of order submit (#20629 )	2022-02-08 10:38:36 -08:00
Sven Mika	ac3e6ab411	[RLlib] Speedup A3C up to 3x (new `training_iteration` function instead of `execution_plan`) and re-instate Pong learning test. (#22126 )	2022-02-08 19:04:13 +01:00
Nikita Vemuri	d19aaf0fd3	[jobs] Add unit test for `parse_cluster_info` (#22205 ) Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.	2022-02-08 11:22:28 -06:00
Sven Mika	c17a44cdfa	Revert "Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni…" (#22153 )	2022-02-08 16:43:00 +01:00
Guyang Song	36ba514f9c	[Doc] Fix bad doc and recover doc of c++ api (#22213 )	2022-02-08 19:04:37 +08:00
Guyang Song	9f77090c1c	[Doc] Fix bad links of dask and mars in ray-libraries.rst (#22210 )	2022-02-08 19:02:49 +08:00
Gagandeep Singh	0f2a2224c2	PoolActor now uses num_cpus=0 to avoid any deadlock (#22048 ) https://github.com/ray-project/ray/issues/21488#issuecomment-1027122177 : > We discussed this issue in a bit more detail and came to the conclusion that we should set the CPU resource requirement for each actor in the actor pool to 0, to make the Ray Pool compatible/same behavior as the Python multiprocessing pool. Would that work for you @yogeveran ? (very similar to solution 4 mentioned above, but with 0.0 instead of 0.1, so it works in all cases).	2022-02-08 01:59:46 -08:00
SangBin Cho	1c41b0f566	[Test] Unflake pg test + add pg tests that weren't running (#22204 ) Unflake pg test (pg test 3 times out occasionally)+ add pg tests that weren't running	2022-02-08 01:47:22 -08:00
SangBin Cho	ac00389cbe	[Nightly test] Bring back the old way of running commands. (#22209 ) Bring back the old way of running commands for non-k8s tests. This also fixes the regression from many_drivers.py	2022-02-08 01:44:07 -08:00
Sriram Sankar	d06317eb1a	[Kuberay] Updated kuberay-autoscaler.yaml to create service account (#22188 ) Added lines to autoscaler configuration yaml to create a service account that is used to give the autoscaler permissions to list and read pods and patch the cluster CRD for up/downscaling.	2022-02-07 22:04:34 -08:00
Eric Liang	8f7db1c6ab	Properly release resources of workers exiting due to max_calls (#22146 ) Previously code incorrectly assumed that an exiting worker would disconnect from the raylet promptly to release resources. This isn't the case if the worker is owning references. This PR plumbs through the right release resources call even in this scenario. Closes https://github.com/ray-project/ray/issues/10960 Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-02-07 21:57:11 -08:00
Balaji Veeramani	ee1711fe41	[CI] Remove YAPF from `format.sh` (#21986 )	2022-02-07 16:05:27 -08:00
Jiajun Yao	56c7b74072	Delete nightly shuffle_data_loader (#22185 )	2022-02-07 15:23:34 -08:00
Archit Kulkarni	de2c950d55	[runtime env] Unify checks for empty runtime env using helper function (#22129 ) Followup from https://github.com/ray-project/ray/pull/21788. Previously we had a lot of `serialized_runtime_env == "{}" \|\| serialized_runtime_env == ""` scattered around the C++ code; this PR puts this in a helper function.	2022-02-07 17:18:51 -06:00
Eric Liang	428d594d35	Also allow auto-closing of stale PRs (#22149 ) Allow auto-close of stale PRs, with a shorter time limit.	2022-02-07 14:34:59 -08:00
Eric Liang	00b5801d71	Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156 )	2022-02-07 14:05:44 -08:00
Edward Oakes	8806b2d5c4	[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180 )	2022-02-07 15:25:25 -06:00
Guyang Song	8e1e783596	fix "team:xxx" tag of cpp tests #22163 Cpp worker tests should be part of ray core.	2022-02-07 11:33:55 -08:00
Jiajun Yao	355ee4a02c	Fix nightly shuffle_data_loader by pinning down dependencies versions (#22183 )	2022-02-07 11:25:30 -08:00
Ian Rodney	3fca295871	[Docker] Update echo in fix-docker-latest.sh (#22123 )	2022-02-07 08:50:36 -08:00
Max Pumperla	5cc9355303	[Docs ] Tune docs overhaul (first part) (#22112 ) Continuing docs overhaul, tune now has: - [x] better landing page - [x] a getting started guide - [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials - [x] the new user guide contains guides to tune features and practical integrations - [x] we rewrote some of the feature guides for clarity - [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower) - [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030. - [x] Examples are tested in the new framework, of course. There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-02-07 15:47:03 +00:00
Chen Shen	13819304d4	[Core][nightly-test] better way of calculating num features (#22158 ) * better filter of column length * address comments * more	2022-02-07 02:13:40 -08:00
Chen Shen	cc577c10ed	[refactor cluster-task-manage 0/n] move internal state into a separate header #22160 this is the first PR that refactors cluster task manager. specifically, we move those internal state into a separate header file.	2022-02-06 22:17:33 -08:00
Jiajun Yao	ff8af2edba	Remove TaskExecutionSpec (#22155 )	2022-02-06 21:59:23 -08:00
SangBin Cho	6235b6d7e9	Revert "[Release 1.11.0][Core] avoid unnecessary work during event st… (#22144 ) This reverts commit `9ac3f6879d`. Seems like this makes this test flaky, so I will revert it for now.	2022-02-06 18:19:44 -08:00
Chen Shen	e531ee907b	[microbenchmark] avoid noisy neighbor #22133 Why are these changes needed? see #22045, add sleep between benchmark tests avoid noisy neighbor tests.	2022-02-06 17:30:56 -08:00
Yi Cheng	b729d458e2	[client] Move Client implementation of ObjectRef/ActorRef to python (#22148 ) `__dealloc__` is not allowed to call python code and this leads to two problems: - The data has already been cleaned up - Deadlock if there are locks used. THis PR move the implementation to python layer to avoid this	2022-02-06 13:03:51 -08:00
Sven Mika	8b678ddd68	[RLlib] Issue 22036: Client should handle concurrent episodes with one being `training_enabled=False`. (#22076 )	2022-02-06 12:35:03 +01:00
Clark Zinzow	fb0d6e6b0b	[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067 )	2022-02-05 16:59:34 -08:00
Jiao	c065e3f69e	[Ray DAG] Implement experimental Ray DAG API for task/class (#22058 )	2022-02-05 15:05:07 -08:00
Jiajun Yao	88d2e21585	Disable scheduler_report_pinned_bytes_only (#22132 )	2022-02-05 11:06:59 -08:00
Dmitri Gekhtman	fc00369ae5	Log resource message (#22136 ) We've had multiple issues that manifest as unexpected autoscaler logs about resource demands. To make it easier to debug such issues, this PR adds a debug flag to allow logging the entire resource message used by the autoscaler as its source of truth about the Ray internals' resource usage. If the env AUTOSCALER_LOG_RESOURCE_BATCH_DATA=1 is set, the autoscaler will log the resource message.	2022-02-05 10:08:37 -08:00
mwtian	98be9fb5e0	[Test][Client] make sure Ray cluster shuts down after test terminates (#22128 ) Apply the same fix in #21589 to another test fixture for Ray client tests. Let's see if this can reduce flakiness in unit tests.	2022-02-04 18:12:20 -08:00
SangBin Cho	dbd28cc861	[Test] Fix flaky Drain node tests (#22104 ) Fix the flaky test by waiting instead of immediately verify it	2022-02-05 09:41:14 +09:00
shrekris-anyscale	a61d974dd5	[serve] Implement experimental deploy_group API (#22039 ) If the declarative API issues a code change to a group of deployments at once, it needs to deploy the group of updated deployments atomically. This ensures any deployment using another deployment's handle inside its own __init__() function can access that handle regardless of the deployment order. This change adds deploy_group to the ServeController class, allowing it to deploy a list of deployments atomically. It also adds a new public API command, serve.deploy_group(), exposing the controller's functionality publicly, so atomic deployments can also be executed via Python API. Closes #21873.	2022-02-04 18:12:14 -06:00
Jiao	a692e7d05e	[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938 ) As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address. Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2022-02-04 17:51:43 -06:00
Jules S. Damji	c5c5e01b5d	[Doc] [Serve] Fixed minor typo and removed extract ',' (#22101 )	2022-02-04 14:51:38 -08:00
Yi Cheng	5ae8d5b8af	Revert "Revert "[client] Fix ray client object ref releasing in wrong context."" (#22091 ) Reverts ray-project/ray#22090	2022-02-04 14:50:23 -08:00
Archit Kulkarni	182dbfbfdb	[runtime env] Fix bug where options (e.g. `--extra-index-url`) could not be specified in `requirements.txt` (#22065 ) In https://github.com/ray-project/ray/pull/20341 the behavior of `pip` was changed to install the specified packages in the existing environment rather than in a new environment. This posed a problem when specifying Ray libraries like "ray[serve]" in the `pip` field, because the installer would install Ray at runtime and this new Ray would take precedence over the Ray existing on the cluster. This could cause version mismatch issues. Skipping some details, the approach taken in the that PR was essentially to parse the `pip` list and remove Ray. However not every line in a `pip` `requirements.txt` file is a requirements specifier; a line can also just specify options, like `--extra-index-url my-index-url.com`. This caused the parsing library to raise an exception when trying to parse the line. This PR fixes this by catching the exception and skipping the line in this case, since it's not a line that specifies `ray` and that's all we're looking for when parsing.	2022-02-04 15:32:32 -06:00
Archit Kulkarni	d7be4e1d3c	[doc] [runtime env] Add note that referencing local files in requirements.txt is not supported (#22095 )	2022-02-04 15:32:19 -06:00
SangBin Cho	ea4079465d	[Runtime Env] Support runtime env error message for actors (#22109 )	2022-02-04 15:32:02 -06:00
Sven Mika	f6617506a2	[RLlib] Add `on_sub_environment_created` to DefaultCallbacks class. (#21893 )	2022-02-04 22:22:47 +01:00
Nikita Vemuri	d9dc388082	[jobs] Support ray client format of connection string address for external module (#22116 ) Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.	2022-02-04 13:35:10 -06:00
matthewdeng	014a9959f1	Revert "[train] add TorchTensorboardProfilerCallback (#21864 )" (#22117 ) This reverts commit `f064306de9`.	2022-02-04 08:54:16 -08:00
Sven Mika	38d75ce058	[RLlib] Cleanup SlateQ algo; add test + add target Q-net (#21827 )	2022-02-04 17:01:12 +01:00
Avnish Narayan	0d2ba41e41	[RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments (#21685 )	2022-02-04 14:59:56 +01:00
SangBin Cho	6dda196f47	Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106 ) This reverts commit `e3af828220`.	2022-02-04 00:55:45 -08:00
SangBin Cho	a887763b38	Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni… (#22105 ) This reverts commit `3f03ef8ba8`.	2022-02-04 00:54:50 -08:00

1 2 3 4 5 ...

11213 commits