This PR includes / depends on #25709
The two concepts of Syncer and SyncClient are confusing, as is the current API for passing custom sync functions.
This PR refactors Tune's syncing behavior. The Sync client concept is hard deprecated. Instead, we offer a well defined Syncer API that can be extended to provide own syncing functionality. However, the default will be to use Ray AIRs file transfer utilities.
New API:
- Users can pass `syncer=CustomSyncer` which implements the `Syncer` API
- Otherwise our off-the-shelf syncing is used
- As before, syncing to cloud disables syncing to driver
Changes:
- Sync client is removed
- Syncer interface introduced
- _DefaultSyncer is a wrapper around the URI upload/download API from Ray AIR
- SyncerCallback only uses remote tasks to synchronize data
- Rsync syncing is fully depracated and removed
- Docker and kubernetes-specific syncing is fully deprecated and removed
- Testing is improved to use `file://` URIs instead of mock sync clients
## Why are these changes needed?
This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based.
See #24956 for more details.
## Summary
<!-- Please give a short summary of the change and the problem this solves. -->
- Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods.
## Related issue number
Closes#24956Closes#25578
## Why are these changes needed?
When schedule actors on pg, instead of iterating all nodes in the cluster resource, This optimize will directly queries corresponding nodes by looking at pg location index.
This optimization can reduce the complexity of the algorithm from O (N) to o (1),and N is the number of nodes. In particular, the more nodes in large-scale clusters, the better the optimization effect.
**This PR only optimize schedule by gcs, I will submit a PR for raylet scheduling later.**
In ant group, Now we have achieved the optimization in the GCS scheduling mode and obtained the following performance test results.
1、The average time of selecting nodes is reduced from 330us to 30us, and the performance is improved by about 11 times.
2、The total time of creating & executing 12,000 actors ranges from 271 (s) - > 225 (s) on average. Reduce time consumption by 17%.
More detailed solution information is in the issue.
## Related issue number
[Core/PG/Schedule]Optimize the scheduling performance of actors/tasks with PG specified #23881
This is carved out from https://github.com/ray-project/ray/pull/25558.
tlrd: checkpoint.py current doesn't support the following
```
a. from fs to dict checkpoint;
b. drop some marker to dict checkpoint;
c. convert back to fs checkpoint;
d. convert back to dict checkpoint.
Assert that the marker should still be there
```
It will be easier to develop if we could use a tool to organize / sort imports and not have to move them around by hand.
This PR shows how we could do this with isort (black doesn't quite do this per https://github.com/psf/black/issues/333)
After this PR lands everyone will need to update their formatter to include isort if they don't have it already, i.e.
pip install -r ./python/requirements_linters.txt
All future file changes will go through isort and may introduce a slightly larger PR the first time as it will clean up the imports.
The plan is to land this PR and also clean up the rest of the code in parallel by using this PR to format the codebase (so people won't get surprised by the formatter if the file hasn't been touched yet)
Co-authored-by: Clarence Ng <clarence@anyscale.com>
Follow another approach mentioned in #25350.
The scaling config is now converted to the dataclass letting us use a single function for validation of both user supplied dicts and dataclasses. This PR also fixes the fact the scaling config wasn't validated in the GBDT Trainer and validates that allowed keys set in Trainers are present in the dataclass.
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
The multi node testing utility currently does not support controlling cluster state from within Ray tasks or actors., but it currently requires Ray client. This makes it impossible to properly test e.g. fault tolerance, as the driver has to be executed on the client machine in order to control cluster state. However, this client machine is not part of the Ray cluster and can't schedule tasks on the local node - which is required by some utilities, e.g. checkpoint to driver syncing.
This PR introduces a remote control API for the multi node cluster utility that utilizes a Ray queue to communicate with an execution thread. That way we can instruct cluster commands from within the Ray cluster.
Closes#25588
NVIDIA recently pushed updates to the CUDA image removing support for end of life drivers. Therefore, the default AMIs that we previously had for OSS cluster launcher are not able to run the Ray GPU Docker images.
This PR updates the default AMIs to the latest Deep Learning versions. In general, we should periodically update these AMIs, especially when we add support for new CUDA versions.
I manually confirmed that the nightly Ray docker images work with the new AMI in us-west-2.
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.
# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]
Args:
--tail: Tail the last X lines
--follow: Follow the new logs
--actor-id: The actor id
--pid --node-ip: For worker logs
--node-id: The node id of the log
--interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
Including the Bazel build files in the wheel leads to problems if the Ray wheels are brought in as a dependency from another bazel workspace, since that workspace will not recurse into the directories of the wheel that contain BUILD files -- this can lead to dropped files.
This only happens for macOS wheels, on linux wheels the BUILD files were already excluded.
Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly.
This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.
This PR fixes two issues with the __array__ protocol on the tensor extension:
1. The __array__ protocol on TensorArrayElement was missing the dtype parameter, causing np.asarray(tae, dtype=some_dtype) calls to fail. This PR adds support for the dtype argument.
2. TensorArray and TensorArrayElement didn't support NumPy's scalar casting semantics for single-element tensors. This PR adds support for these scalar casting semantics.
Currently when Raylets die, it is hard to figure out:
if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
This exposes a low-cost way to perform a pseudo global shuffle.
For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk.
This allows the workload to be spread across the nodes on which the dataset blocks are on.
Stats construction on the from_arrow and from_numpy (and from_pandas with Pandas block support disabled) is currently broken since we weren't resolving the block metadata before passing it to the stats, causing future ds.stats() calls to fail. This PR fixes this and adds some test coverage.
Drivebys:
- Adds stats for from_pandas() zero-copy path (metadata fetch only).
- Changes "from_numpy" stats stage name to "from_numpy_refs", to be consistent with stats for other from_*() APIs.