This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.
Content of the two docs were switched.
Unnecessary Ray Get images were correctly in `unnecessary-ray-get.rst`, which made this noticeable beyond the URL.
Duplicate for #25247.
Adds a fix for Dask-on-Ray. Previously, for tasks with multiple return values, we implicitly allowed returning a dict with the return index as the key. This was used by Dask-on-Ray, but this is not documented behavior, and we now require task returns to be iterable instead.
This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called RAY_task_max_retries which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default.
It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet).
Related issue number
Closes#24854.
Adds support for Python generators instead of just normal return functions when a task has multiple return values. This will allow developers to cut down on total memory usage for tasks, as they can free previous return values before allocating the next one on the heap.
The semantics for num_returns are about the same as usual tasks - the function will throw an error if the number of values returned by the generator does not match the number of return values specified by the user. The one difference is that if num_returns=1, the task will throw the usual Python exception that the generator cannot be pickled.
As an example, this feature will allow us to reduce memory usage in Datasets shuffle operations (see #25200 for a prototype).
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:
1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.
After this PR, the node-ip custom resource can be deprecated.
This PR makes a number of major overhauls to the Ray core docs:
Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}.
Re-org the user guide to align with key concepts.
Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections.
Minor tweaks and additional transition material.