mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
96 lines
5 KiB
Markdown
96 lines
5 KiB
Markdown
![]() |
# Aliasing
|
||
|
|
||
|
An important feature of Photon is that a remote call sent to the scheduler
|
||
|
immediately returns object references to the outputs of the task, and the actual
|
||
|
outputs of the task are only associated with the relevant object references
|
||
|
after the task has been executed and the outputs have been computed. This allows
|
||
|
the worker to continue without blocking.
|
||
|
|
||
|
However, to provide a more flexible API, we allow tasks to not only return
|
||
|
values, but to also return object references to values. As an examples, consider
|
||
|
the following code.
|
||
|
```python
|
||
|
@op.distributed([], [np.ndarray])
|
||
|
def f()
|
||
|
return np.zeros(5)
|
||
|
|
||
|
@op.distributed([], [np.ndarray])
|
||
|
def g()
|
||
|
return f()
|
||
|
|
||
|
@op.distributed([], [np.ndarray])
|
||
|
def h()
|
||
|
return g()
|
||
|
```
|
||
|
A call to `h` will immediate return an object reference `ref_h` for the return
|
||
|
value of `h`. The task of executing `h` (call it `task_h`) will then be
|
||
|
scheduled for execution. When `task_h` is executed, it will call `g`, which will
|
||
|
immediately return an object reference `ref_g` for the output of `g`. Then two
|
||
|
things will happen and can happen in any order: `AliasObjRefs(ref_h, ref_g)`
|
||
|
will be called and `task_g` will be scheduled and executed. When `task_g` is
|
||
|
executed, it will call `f`, and immediately obtain an object reference `ref_f`
|
||
|
for the output of `f`. Then two things will happen and can happen in either
|
||
|
order, `AliasObjRefs(ref_g, ref_f)` will be called, and `f` will be executed.
|
||
|
When `f` is executed, it will create an actual array and `put_object` will be
|
||
|
called, which will store the array in the object store (it will also call
|
||
|
`SchedulerService::AddCanonicalObjRef(ref_f)`).
|
||
|
|
||
|
From the scheduler's perspective, there are three important calls,
|
||
|
`AliasObjRefs(ref_h, ref_g)`, `AliasObjRefs(ref_g, ref_f)`, and
|
||
|
`AddCanonicalObjRef(ref_f)`. These three calls can happen in any order.
|
||
|
|
||
|
The scheduler maintains a data structure called `target_objrefs_`, which keeps
|
||
|
track of which object references have been aliased together (`target_objrefs_`
|
||
|
is a vector, but we can think of it as a graph). The call
|
||
|
`AliasObjRefs(ref_h, ref_g)` updates `target_objrefs_` with `ref_h -> ref_g`.
|
||
|
The call `AliasObjRefs(ref_g, ref_f)` updates it with `ref_g -> ref_f`, and the
|
||
|
call `AddCanonicalObjRef(ref_f)` updates it with `ref_f -> ref_f`. The data
|
||
|
structure is initialized with `ref -> UNINITIALIZED_ALIAS` for each object
|
||
|
reference `ref`.
|
||
|
|
||
|
We refer to `ref_f` as a "canonical" object reference. And in a pair such as
|
||
|
`ref_h -> ref_g`, we refer to `ref_h` as the "alias" object reference and to
|
||
|
`ref_g` as the "target" object reference. These details are available to the
|
||
|
scheduler, but a worker process just has an object reference and doesn't know if
|
||
|
it is canonical or not.
|
||
|
|
||
|
We also maintain a data structure `reverse_target_objrefs_`, which maps in the
|
||
|
reverse direction (in the above example, we would have `ref_g -> ref_h`,
|
||
|
`ref_f -> ref_g`, and `ref_h -> UNINITIALIZED_ALIAS`). This data structure is
|
||
|
not particuarly important for the task of aliasing, but when we do reference
|
||
|
counting and attempt to deallocate an object, we need to be able to determine
|
||
|
all of the object references that refer to the same object, and this data
|
||
|
structure comes in handy for that purpose.
|
||
|
|
||
|
## Pulls and Remote Calls
|
||
|
|
||
|
When a worker calls `pull(ref)`, it first sends a message to the scheduler
|
||
|
asking the scheduler to ship the object referred to by `ref` to the worker's
|
||
|
local object store. Then the worker asks its local object store for the object
|
||
|
referred to by `ref`. If `ref` is a canonical object reference, then that's all
|
||
|
there is too it. However, if `ref` is not a canonical object reference but
|
||
|
rather is an alias for the canonical object reference `c_ref`, then the
|
||
|
scheduler also notifies the worker's local object store that `ref` is an
|
||
|
alias for `c_ref`. This is important because the object store does not keep
|
||
|
track of aliasing on its own (it only knows the bits about aliasing that the
|
||
|
scheduler tells it). Lastly, if the scheduler does not yet have enough
|
||
|
information to determine if `ref` is canonical, or if the scheduler cannot
|
||
|
yet determine what the canonical object reference for `ref` is, then the
|
||
|
scheduler will wait until it has the relevant information.
|
||
|
|
||
|
Similar things happen when a worker performs a remote call. If an object
|
||
|
reference is passed to a remote call, the object referred to by that object
|
||
|
reference will be shipped to the local object store of the worker that executes
|
||
|
the task. The scheduler will notify that object store about any aliasing that it
|
||
|
needs to be aware of.
|
||
|
|
||
|
## Passing Object References by Value
|
||
|
Currently, the graph of aliasing looks like a collection of chains, as in the
|
||
|
above example with `ref_h -> ref_g -> ref_f -> ref_f`. In the future, we will
|
||
|
allow object references to be passed by value to remote calls (so the worker
|
||
|
has access to the object reference object and not the object that the object
|
||
|
reference refers to). If an object reference that is passed by value is then
|
||
|
returned by the task, it is possible that a given object reference could be
|
||
|
the target of multiple alias object references. In this case, the graph of
|
||
|
aliasing will be a tree.
|