## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.
TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.
## Related issue number
#19438
* [RLlib] Unify the way we create and use LocalReplayBuffer for all the agents.
This change
1. Get rid of the try...except clause when we call execution_plan(),
and get rid of the Deprecation warning as a result.
2. Fix the execution_plan() call in Trainer._try_recover() too.
3. Most importantly, makes it much easier to create and use different types
of local replay buffers for all our agents.
E.g., allow us to easily create a reservoir sampling replay buffer for
APPO agent for Riot in the near future.
* Introduce explicit configuration for replay buffer types.
* Fix is_training key error.
* actually deprecate buffer_size field.
This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis.
The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used.
�[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m
�[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m
defaultdict(int,
{b'WORKE': 1080864,
b'ACTOR': 1470931,
b'TASK&': 1020646,
b'TASK:': 870551,
b'PROFI': 360000,
b'PLACE': 10107,
b'JOB:\x01': 8,
b'JOB:\x04': 8,
b'NODE:': 99,
b'NODE_': 126,
b'INTER': 44,
b'JOB:\x03': 8,
b'redis': 16,
b'JOB:\x02': 8,
b'JOB:\x05': 8})
## Why are these changes needed?
Add the functionality to retrieve metadata for a workflow or workflow step.
Design:
- Similar to `get_output`, this will either return the metadata for workflow (`workflow.get_metadata(workflow_id)`) or the metadata for a specific step (`workflow.get_metadata(workflow_id, step_id)`)
- Exceptions will only be raised if workflow id or step id not exist. Canceled job, running job, etc. will return proper metadata by retrieving information from checkpoint. See [here](8c8ca609d7/python/ray/workflow/tests/test_metadata_get.py (L67)) for more details.
- Returned metadata is an aggregated result from multiple checkpoint files based on previous [discussion](https://github.com/ray-project/ray/issues/17090#issuecomment-920481789). The aggregation logic is [here for step metadata](8c8ca609d7/python/ray/workflow/workflow_storage.py (L451)) and [here for workflow metadata](8c8ca609d7/python/ray/workflow/workflow_storage.py (L484)) which can be tuned with further discussion.
Example:
```python
>>> user_step_metadata = {"k1": "v1"}
>>> user_run_metadata = {"k2": "v2"}
>>> step_name = "simple_step"
>>> workflow_id = "simple"
>>> @workflow.step
>>> def simple():
>>> return 0
>>> simple.options(name=step_name, metadata=user_step_metadata).step().run(workflow_id, metadata=user_run_metadata)
# get workflow-level metadata
>>> workflow.get_metadata("simple")
{'status': 'SUCCESSFUL',
'user_metadata': {'k2': 'v2'},
'stats': {'start_time': 1634173413.116535, 'end_time': 1634173413.149051}}
# get step-level metadata
>>> workflow.get_metadata("simple", "simple_step")
{'name': '__main__.simple',
'step_type': 'FUNCTION',
'workflows': [],
'max_retries': 3,
'workflow_refs': [],
'catch_exceptions': False,
'ray_options': {},
'user_metadata': {'k1': 'v1'},
'stats': {'start_time': 1634173413.131262, 'end_time': 1634173413.1347651}}
```
## Related issue number
https://github.com/ray-project/ray/issues/17090
This PR puts the final touches on apple silicon support. There are 3 main caveats to supporting M1 macs right now (described in the docs):
Requires using forge.
Requires special installation instructions to get grpc working (this is an underlying grpc issue, so ideally it will be fixed upstream).
We're only publishing release wheels, not nightlies right now.
This also includes a grpc import check to ensure that we provide an actionable error message if the user tries the regular pip install ray process to properly install grpcio.
## Why are these changes needed?
Recently we found a bug about named actor cache, only in internal codebase but not community, and the case is not covered by test case so we didn't know before user telling us.
This add an extra test to cover it.
Bug Detail: we didn't publish actor's name when the actor is dead so the cache keep the name to the old actor handle. The owner of this actor cannot sense this bug because the cache didn't apply to the owner currently.
postprocess_trajectory is referred to incorrectly in the rllib-environments documentation. When defining a custom policy, a user never directly modifies Policy.postprocess_trajectory, they define postprocess_fn, which is in turn called by postprocess_trajectory.