Cherry-picks all docs changes for Serve in Ray 2.0.
I did this by overwriting the entire `doc/source/serve/` directory in addition to `doc/source/_toc.yml`. The changes should be isolated to Serve (manually verified).
Signed-off-by: Yi Cheng 74173148+iycheng@users.noreply.github.com
Why are these changes needed?
This PR update workflow doc to reflect the recent change.
Focusing on position change and others.
Documentation updates for the newly introduced HTTPEventProvider and HTTPListener in Ray 2.0.
Co-authored-by: Yuan-Chi Chang <84025022+yuanchi2807@users.noreply.github.com>
The test was written incorrectly. This root cause was that the trainer & worker both requires 1 CPU, meaning pg requires {CPU: 1} * 2 resources.
And when the max fraction is 0.001, we only allow up to 1 CPU for pg, so we cannot schedule the requested pgs in any case.
Why are these changes needed?
This PR fixes the edge cases when the max_cpu_fraction argument is used by the placement group. There was specifically an edge case where the placement group cannot be scheduled when a task or actor is scheduled and occupies the resource.
The original logic to decide if the bundle scheduling exceed CPU fraction was as follow.
Calculate max_reservable_cpus of the node.
Calculate currently_used_cpus + bundle_cpu_request (per bundle) == total_allocation of the node.
Don't schedule if total_allocation > max_reservable_cpus for the node.
However, the following scenarios caused issues because currently_used_cpus can include resources that are not allocated by placement groups (e.g., actors). As a result, when the actor was already occupying the resource, the total_allocation was incorrect. For example,
4 CPUs
0.999 max fraction (so it can reserve up to 3 CPUs)
1 Actor already created (1 CPU)
PG with CPU: 3
Now pg cannot be scheduled because total_allocation == 1 actor (1 CPU) + 3 bundles (3 CPUs) == 4 CPUs > 3 CPUs (max frac CPUs)
However, this should work because the pg can use up to 3 CPUs, and we have enough resources.
The root cause is that when we calculate the max_fraction, we should only take into account of resources allocated by bundles. To fix this, I change the logic as follow.
Calculate max_reservable_cpus of the node.
Calculate **currently_used_cpus_by_pg_bundles** + **bundle_cpu_request (sum of all bundles)** == total_allocation_from_pgs_and_bundles of the node.
Don't schedule if total_allocation_from_pgs_and_bundles > max_reservable_cpus for the node.
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).
This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call:
If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown.
If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual.
Also fixes some incidental bugs:
Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object.
Fix a circular lock dependency between task failure <> task dependency resolution.
Related issue number
Closes#27265.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu
Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number
Closes#27021.