## Why are these changes needed?
Add metadata to workflow. Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.
We are adding various of metadata including:
1. step-level user metadata. can be set with `step.options(metadata={})`
2. step-level pre-run metadata. this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata. this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with `workflow.run(metadata={})`
5. workflow-level pre-run metadata. this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata. this captures post-run metadata such as workflow_end_time, more metrics can be added later.
## Related issue number
https://github.com/ray-project/ray/issues/17090
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
* [core] nicer error message for unpickleable exceptions
I ran into a case where we had an exception that wasn't unpickleable:
```
pickle.loads(pickle.dumps(filelock.Timeout()))
```
When a filelock.Timeout is raised on a worker, it gets surfaced in a way
that makes ray look like it was responsible:
```
ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file'
```
This PR turns the following stacktrace:
```
return ray.get(refs, timeout=timeout)
File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get
raise value
ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file'
traceback: Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object
return RayError.from_bytes(obj)
File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 28, in from_bytes
return pickle.loads(ray_exception.serialized_exception)
TypeError: __init__() missing 1 required positional argument: 'lock_file'
```
into this:
```
...
return ray.get(refs, timeout=timeout)
File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get
raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 29, in from_bytes
return pickle.loads(ray_exception.serialized_exception)
TypeError: __init__() missing 1 required positional argument: 'lock_file'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object
return RayError.from_bytes(obj)
File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 31, in from_bytes
raise RuntimeError("Failed to unpickle serialized exception") from e
RuntimeError: Failed to unpickle serialized exception
```
* lint
* test_unpickleable_stacktrace
* lint
* .
* .
Co-authored-by: hauntsaninja <>
* Make binbacking prioritize nodes better
Make binpacking prefer nodes that match multiple
resource types.
* spelling
* order demands when binpacking, starting from complex ones
* add stability to resource demand ordering
* lint
* logging
* add comments
* +comment
* use set
Why are these changes needed?
This PR implements workflow.delete which allows users to delete the information in storage related to a workflow. (This assumes the workflow isn't currently running).
Related issue number
Closes#18848
In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.