hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	6313ddc47c	[tune] Refactor Syncer / deprecate Sync client (#25655 ) This PR includes / depends on #25709 The two concepts of Syncer and SyncClient are confusing, as is the current API for passing custom sync functions. This PR refactors Tune's syncing behavior. The Sync client concept is hard deprecated. Instead, we offer a well defined Syncer API that can be extended to provide own syncing functionality. However, the default will be to use Ray AIRs file transfer utilities. New API: - Users can pass `syncer=CustomSyncer` which implements the `Syncer` API - Otherwise our off-the-shelf syncing is used - As before, syncing to cloud disables syncing to driver Changes: - Sync client is removed - Syncer interface introduced - _DefaultSyncer is a wrapper around the URI upload/download API from Ray AIR - SyncerCallback only uses remote tasks to synchronize data - Rsync syncing is fully depracated and removed - Docker and kubernetes-specific syncing is fully deprecated and removed - Testing is improved to use `file://` URIs instead of mock sync clients	2022-06-14 14:46:30 +02:00
kourosh hakhamaneshi	f597e21ac8	[RLlib] Fix sample batch concat samples. (#25572 )	2022-06-14 12:47:29 +02:00
kourosh hakhamaneshi	25940cb95b	[RLlib] CRR documentation. (#25667 )	2022-06-14 12:45:36 +02:00
Avnish Narayan	804719876b	[RLlib] Remove execution plan code no longer used by RLlib. (#25624 )	2022-06-14 10:57:27 +02:00
Tao Wang	593a522abd	[Cpp worker]Support cpp call java actor (#25581 )	2022-06-14 14:17:14 +08:00
clarng	e27cb0a585	Make isort compatible with black (#25748 ) Make isort compatible with black	2022-06-13 23:00:48 -07:00
sychen52	d5b8a1caab	[docs] actor is not created in driver1 (#25749 ) call .remote() after .option	2022-06-13 21:41:14 -07:00
clarng	d971d3bde4	Fix import order that is causing CI to fail (#25728 ) Fix import ordering on master.	2022-06-13 17:36:00 -07:00
Ricky Xu	b1d0b12b4e	[Core \ State Observability] Use Submission client (#25557 ) ## Why are these changes needed? This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. See #24956 for more details. ## Summary <!-- Please give a short summary of the change and the problem this solves. --> - Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. ## Related issue number Closes #24956 Closes #25578	2022-06-13 17:11:19 -07:00
Dmitri Gekhtman	e745cd0e7b	[Docs] Note that certain features are community maintained (#25687 ) Adds notes explaining that Ray's support on Azure, Aliyun, and SLURM is community-maintained. Rephrases the mention of K8s support in the intro. This PR replaces https://github.com/ray-project/ray/pull/25504.	2022-06-13 16:10:32 -07:00
Larry	679f66eeee	[Core/PG/Schedule 1/2]Optimize the scheduling performance of actors/tasks with PG specified only for gcs schedule (#24677 ) ## Why are these changes needed? When schedule actors on pg, instead of iterating all nodes in the cluster resource, This optimize will directly queries corresponding nodes by looking at pg location index. This optimization can reduce the complexity of the algorithm from O (N) to o (1)，and N is the number of nodes. In particular, the more nodes in large-scale clusters, the better the optimization effect. This PR only optimize schedule by gcs, I will submit a PR for raylet scheduling later. In ant group, Now we have achieved the optimization in the GCS scheduling mode and obtained the following performance test results. 1、The average time of selecting nodes is reduced from 330us to 30us, and the performance is improved by about 11 times. 2、The total time of creating & executing 12,000 actors ranges from 271 (s) - > 225 (s) on average. Reduce time consumption by 17%. More detailed solution information is in the issue. ## Related issue number [Core/PG/Schedule]Optimize the scheduling performance of actors/tasks with PG specified #23881	2022-06-13 15:31:00 -07:00
Clark Zinzow	ae9285eced	[Datasets] Add outputs to data generation examples in API docstrings. (#25674 ) This PR adds outputs to data generation examples in the API docstrings, namely for `from_items()`, `range()`, `range_table()`, and `range_tensor()`.	2022-06-13 15:28:37 -07:00
Eric Liang	ff2cfbe351	[air] Add streaming BatchPredictor support (#25693 )	2022-06-13 15:22:36 -07:00
Antoni Baum	182f604d32	[docs] Fix bad argument name in PTL docs (#25736 ) Fixes bad argument name in PTL docs. This is just a quick fix - we should be testing the code snippet.	2022-06-13 15:20:24 -07:00
Eric Liang	fde61a77be	[rfc] [data] SPREAD actor pool actors evenly across the cluster by default (#25705 )	2022-06-13 15:16:14 -07:00
Eric Liang	1f90858c9e	[data] Fix stage fusion between equivalent resource args (fixes BatchPredictor) (#25706 )	2022-06-13 15:15:59 -07:00
xwjiang2010	cc53a1e28b	[air] update checkpoint.py to deal with metadata in conversion. (#25727 ) This is carved out from https://github.com/ray-project/ray/pull/25558. tlrd: checkpoint.py current doesn't support the following ``` a. from fs to dict checkpoint; b. drop some marker to dict checkpoint; c. convert back to fs checkpoint; d. convert back to dict checkpoint. Assert that the marker should still be there ```	2022-06-13 15:15:27 -07:00
clarng	73e113152b	Add import sorting to format.sh (#25678 ) It will be easier to develop if we could use a tool to organize / sort imports and not have to move them around by hand. This PR shows how we could do this with isort (black doesn't quite do this per https://github.com/psf/black/issues/333) After this PR lands everyone will need to update their formatter to include isort if they don't have it already, i.e. pip install -r ./python/requirements_linters.txt All future file changes will go through isort and may introduce a slightly larger PR the first time as it will clean up the imports. The plan is to land this PR and also clean up the rest of the code in parallel by using this PR to format the codebase (so people won't get surprised by the formatter if the file hasn't been touched yet) Co-authored-by: Clarence Ng <clarence@anyscale.com>	2022-06-13 14:08:51 -07:00
Antoni Baum	5e9a8eb5f6	[AIR/data] Move preprocessors to `ray.data` (#25599 ) Moves ray.air.Preprocessor and ray.air.preprocessors to ray.data to converge on the agreed upon package structure discussed internally.	2022-06-13 12:57:59 -07:00
Simon Mo	7727dcdac7	[AIR][Serve] Accept predictor.predict kwargs in init (#25537 )	2022-06-13 11:46:43 -07:00
Dmitri Gekhtman	5b341ee666	[KubeRay][Minor][CI] Deflake autoscaling test Minor adjustment to e2e test logic of KubeRay test.	2022-06-13 11:00:47 -07:00
shrekris-anyscale	3278763dd7	[Serve] Start all Serve actors in the `"serve"` namespace only (#25575 )	2022-06-13 10:31:28 -07:00
shrekris-anyscale	2950a4c37a	[Serve] Persist Serve config for REST API (#25651 )	2022-06-13 09:53:21 -07:00
Jimmy Yao	7bb142e3e4	[AIR] Refactor `ScalingConfig` key validation (#25549 ) Follow another approach mentioned in #25350. The scaling config is now converted to the dataclass letting us use a single function for validation of both user supplied dicts and dataclasses. This PR also fixes the fact the scaling config wasn't validated in the GBDT Trainer and validates that allowed keys set in Trainers are present in the dataclass. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-13 18:43:24 +02:00
Simon Mo	feb8c29063	Revert "Revert "Revert "use an agent-id rather than the process PID (#24968 )"… (#25376 )" (#25669 ) This reverts commit `cb151d5ad6`.	2022-06-13 09:22:52 -07:00
Kai Fricke	b574f75a8f	[tune/ci] Multinode support killing nodes in Ray client mode (#25709 ) The multi node testing utility currently does not support controlling cluster state from within Ray tasks or actors., but it currently requires Ray client. This makes it impossible to properly test e.g. fault tolerance, as the driver has to be executed on the client machine in order to control cluster state. However, this client machine is not part of the Ray cluster and can't schedule tasks on the local node - which is required by some utilities, e.g. checkpoint to driver syncing. This PR introduces a remote control API for the multi node cluster utility that utilizes a Ray queue to communicate with an execution thread. That way we can instruct cluster commands from within the Ray cluster.	2022-06-13 18:17:12 +02:00
Amog Kamsetty	7a81d488e5	[Autoscaler] Update default AMIs to latest versions (#25684 ) Closes #25588 NVIDIA recently pushed updates to the CUDA image removing support for end of life drivers. Therefore, the default AMIs that we previously had for OSS cluster launcher are not able to run the Ray GPU Docker images. This PR updates the default AMIs to the latest Deep Learning versions. In general, we should periodically update these AMIs, especially when we add support for new CUDA versions. I manually confirmed that the nightly Ray docker images work with the new AMI in us-west-2.	2022-06-13 17:00:43 +02:00
SangBin Cho	856bea31fb	[State Observability] Ray log CLI / API (#25481 ) This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done. # If there's only 1 match, print a file content. Otherwise, print all files that match glob. ray logs [glob_filter] --node-id=[head node by default] Args: --tail: Tail the last X lines --follow: Follow the new logs --actor-id: The actor id --pid --node-ip: For worker logs --node-id: The node id of the log --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)	2022-06-13 05:52:57 -07:00
Sven Mika	ca10530a1a	[Serve; RLlib; Docs] Change terms in Serve+RLlib example (Trainer -> Algorithm). (#25700 )	2022-06-13 11:43:38 +02:00
Jiao	f8b0ab7e78	[Ray DAG] Add documentation in `more options` section (#25528 )	2022-06-12 09:47:20 -07:00
Eric Liang	b52cd964cb	[docs] Move the workflows (alpha) library to the more libraries section for now (#25704 )	2022-06-11 19:47:45 -07:00
Philipp Moritz	d8ec5929b6	Exclude Bazel build files from Ray wheels (#25679 ) Including the Bazel build files in the wheel leads to problems if the Ray wheels are brought in as a dependency from another bazel workspace, since that workspace will not recurse into the directories of the wheel that contain BUILD files -- this can lead to dropped files. This only happens for macOS wheels, on linux wheels the BUILD files were already excluded.	2022-06-11 16:05:59 -07:00
Kai Fricke	736c7b13c4	[CI] Fix team to `rllib` (from `ml`) for some replay buffer API tests. (#25702 )	2022-06-11 18:05:16 +02:00
Sven Mika	130b7eeaba	[RLlib] `Trainer` to `Algorithm` renaming. (#25539 )	2022-06-11 15:10:39 +02:00
Yi Cheng	0c527b4502	[1/2][serve] Use GcsClient to replace the kv client to use timeout. (#25633 ) Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly. This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.	2022-06-10 23:41:49 -07:00
Eric Liang	d36fd77548	[air] Allow fusing task and actor stages if they have compatible resource types (#25683 )	2022-06-10 19:04:27 -07:00
Clark Zinzow	4fb92dd2f1	[Datasets] Fix `__array__` protocol on `TensorArrayElement` and `TensorArray`. (#25647 ) This PR fixes two issues with the __array__ protocol on the tensor extension: 1. The __array__ protocol on TensorArrayElement was missing the dtype parameter, causing np.asarray(tae, dtype=some_dtype) calls to fail. This PR adds support for the dtype argument. 2. TensorArray and TensorArrayElement didn't support NumPy's scalar casting semantics for single-element tensors. This PR adds support for these scalar casting semantics.	2022-06-10 16:42:16 -07:00
Richard Liaw	1dd714e0fa	[rfc][doc] Add clarity to stability guidelines (#25611 )	2022-06-10 15:19:21 -07:00
Avnish Narayan	d0f975e00f	[RLlib] Fix broken link replay buffer docs. (#25666 )	2022-06-10 21:18:59 +02:00
mwtian	dcfed617e5	[Core] fix gRPC handlers' unlimited active calls configuration (#25626 ) Ray's gRPC server wrapper configures a max active call setting for each handler. When the max active call is -1, the handler is supposed to allow handling unlimited number of requests concurrently. However in practice it is often observed that handlers configured with unlimited active calls are still handling at most 100 requests concurrently. This is a result of the existing logic: At a high level, each gRPC method is associated with a number of ServerCall objects (acting as "tags") in the gRPC completion queue. When there is no tag for a method, gRPC server thread will not be able to poll requests from the method call from the completion queue. After a request is polled from the completion queue, it is processed by the polling gRPC server thread, then queued to an eventloop. When a handler is in the "unlimited" mode, it creates when a new ServerCall object (tag) before actual processing. The problem is that new ServerCalls are created on the eventloop instead of the gRPC server thread. When the event loop runs a callback from the gRPC server, the callback creates a new ServerCall object, and can run the gRPC handler to completion if the handler does not have any async step. So overall, the event loop will not run more callbacks than the initial number of ServerCalls, which is 100 in the "unlimited" mode. The solution is to create a new ServerCall in the gRPC server thread, before sending the ServerCall to the eventloop. Running some night tests to verify the fix does not introduce instabilities: https://buildkite.com/ray-project/release-tests-branch/builds/652 Also, looking into adding gRPC server / client stress tests with large number of concurrent requests.	2022-06-10 11:28:41 -07:00
Jiao	6b9b1f135b	[Deployment Graph] Move files out of `pipeline` folder (#25630 )	2022-06-10 10:39:03 -07:00
Sihan Wang	2546fbf99d	[Serve] Autoscaling for deployment graph (#25424 )	2022-06-10 10:21:49 -07:00
mwtian	1483c4553c	use smaller instance for scheduling tests (#25635 ) m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.	2022-06-10 17:09:35 +00:00
Simon Mo	271c7d73ac	[AIR][Serve] Add support for multi-modal array input (#25609 )	2022-06-10 09:19:42 -07:00
Sven Mika	7c39aa5fac	[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076 )	2022-06-10 17:09:18 +02:00
Artur Niederfahrenhorst	94d6c212df	[RLlib] Replay Buffer API documentation. (#24683 )	2022-06-10 16:47:51 +02:00
Artur Niederfahrenhorst	c3645928ca	[RLlib] Fix no gradient clipping happening in QMix. (#25656 )	2022-06-10 13:51:26 +02:00
Avnish Narayan	730df43656	[RLlib] Issue 25503: Replace torch.range with torch.arange. (#25640 )	2022-06-10 13:21:54 +02:00
kourosh hakhamaneshi	b3a351925d	[RLlib] Added meaningful error for multi-agent failure of SampleCollector in case no agent steps in episode. (#25596 )	2022-06-10 12:30:43 +02:00
Artur Niederfahrenhorst	8af9ef8fee	[RLlib] Discussion 6432: Automatic `train_batch_size` calculation fix. (#25621 )	2022-06-10 12:15:57 +02:00

... 3 4 5 6 7 ...

13172 commits