Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data.
Currently, when an actor has `max_restarts` > 0 and has crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors.
```
-> Worker A (actor)
/
/
HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor)
\
\
-> ...
```
For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and calls `ray.get()` to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and `ray.get()` will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK.
But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it's not ready to serve requests. `ray.get()` on subsequent requests sent to worker A will hang until worker A is back online (ALIVE state). The proxy won't be able to reschedule these requests to another worker because currently there's no way to know if worker A is alive or not before sending a request. We can't say worker A is not alive just based on whether `ray.get()` hangs either.
To solve this issue, we change the semantics of `max_task_retries`.
* When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages.
* When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state.
TODO
- [x] Add test cases.
- [ ] Update docs.
- [x] API change review.
This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code.
Fix the failure to unbreak nightly and unblock 1.13 release.
The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it.
IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows.
This is a notebook showing how to tune an xgboost model and analyze the results.
Also adds a `get_dataframe()` method to `ResultsGrid` to fetch the trial results.
Depends on #24483 for toctree.
Aiming to:
1. addressing the bug about concurrency group, see #19593
2. improving the stability of the ray call latency perf in online applications.
we're proposing using async post instead of `PostBlocking` in threadpool.
Note that since we have already had back pressure in the caller side, I believe this change is safe to merge and it doesn't break any behavior.
This PR uses the job id and group name as the prefix for storing meta information, aiming to provide the isolate ability for different jobs and different groups.
Before this PR, we can't use 2 groups in 1 Ray cluster, and we can not rerun a collective job once the last one is failed at initializing.
Some tests relying on AutoScalingCluster are flaky because ray.init() after AutoscalingCluster.start() is not guaranteed to work. Sometimes, it cannot find any running ray instances.
This was a holdover from when local resources for URIs were deleted directly from the runtime env agent, and the URI name itself needed to store the information of which plugin it corresponded to so the appropriate plugin's `delete()` function could be called. After the URI reference refactor, I don't think this is needed anymore.
Adds `get_dataframe()` method to pass through experiment analysis to result grid. Also fixes config dictionary returns which previously did not flatten the dict (even though the docstring suggested it would)