## Why are these changes needed?
There are some issues left from previous PRs.
- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags.
## Related issue number
Why are these changes needed?
Related issue number
##19177
Quoting #19177 (comment) here,
The following tests fail when not skipped,
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...
Results (395.22s):
36 passed
3 failed
- ray\tests/test_basic.py:197 test_user_setup_function
- ray\tests/test_basic.py:220 test_disable_cuda_devices
- ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23
Results (198.33s):
1 failed
- ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.
def test_oversized_function(ray_start_shared_local_modes)
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.
The third tests print a lot of
(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.
Related issue number
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :
This lint rule cannot apply only to changed lines because currently Ray has `-Winconsistent-missing-override` as a build flag. Either all or none of member functions from a derived class can have the `override` / `final` annocation.
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:
the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.
we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.
Related issue number
Closes#19270
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :(
## Why are these changes needed?
There are two issues fixed in this PR:
- make sure wait for session count alive node
- upgrade the machine to match what's tested in oss ray.
## Related issue number
https://github.com/ray-project/ray/issues/19084
## Why are these changes needed?
Add metadata to workflow. Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.
We are adding various of metadata including:
1. step-level user metadata. can be set with `step.options(metadata={})`
2. step-level pre-run metadata. this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata. this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with `workflow.run(metadata={})`
5. workflow-level pre-run metadata. this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata. this captures post-run metadata such as workflow_end_time, more metrics can be added later.
## Related issue number
https://github.com/ray-project/ray/issues/17090
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
## Why are these changes needed?
Before this PR, there is a race condition where:
- job register starts
- driver start to launch actor
- gcs register actor ===> crash
- job register ends
Actor registration should be forced to be after driver registration. This PR enforces that.
## Related issue number
Closes#19172