Why are these changes needed?
This adds a ray-storage based spilling backend, which can be enabled by setting the spill config to {"type": "ray_storage", "buffer_size": N}. This will cause Ray to spill to the configured storage (pyarrow FS).
In a future PR, I'll add documentation and deprecate the existing smart_open backend.
Differentiate between a "resources not available" error vs. other types of errors.
Had this happen to me when I was trying out the fake cluster- I was using Ray client incorrectly, but because we were doing a generic except Exception, this was raised as "Timed out waiting for resources"
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
- Move the URI reference logic from raylet to agent.
- Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible`
- More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528
Future works
- We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this.
- Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count.
- Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?
Some commands in the Serve CLI use Ray client and some commands ping the Ray dashboard; however, all commands read `RAY_ADDRESS` to get the address. This change raises a nice exception if the user accidentally passes a Ray client address as the Ray Dashboard address.
* Add new interface to policy for batch scheduling and unify the scheduling result and context
* Remove the dependence of GcsClient on ClusterResourceScheduler
* fix compile error
* fix lint error
Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
The product backend doesn't yet understand that nightly Ray uses GCS-Ray. (This will be fixed when the next time the product control plane is deployed.)
This PR introduces the env required to signal to the product backend that we're using GCS-Ray so that the autoscaler can startup correctly.
#23336 reverted #23283. #23283 did pass CI before merging. However, when it merged, it began to fail because it used commands that were outdated on the Master branch in `test_cli.py` (specifically `serve info` instead of `serve config`). This change restores #23283 and updates its tests commands.
Make sure Python dependencies can be imported on demand, without the background importer thread. Use cases are:
If the pubsub notification for a new export is lost, importing can still be done.
Allow not running the background importer thread, without affecting Ray's functionalities.
Add a feature flag to support forking from Python workers, by
Enable fork support in gRPC.
Disable importer thread and only leave the main thread in the Python worker. The importer thread will not run after forking anyway.
- Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@SongGuyang @Catch-Bull @edoakes I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default. My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise. So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up. Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.
This also matches the default behavior of pip which allows this situation to occur in the first place: `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
`serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail.
This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.
These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR adds support for checkpoint ser/de. In particular this is special casing the local data representation, which will be converted into a bytes checkpoint on serialization. This way checkpoint objects sent to remote tasks are guaranteed to always point to a valid data location within the remote task.
We are not detecting pickling to/from disk (e.g. to pickle files) for now.
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...