Resubmit the PR https://github.com/ray-project/ray/pull/19936
I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode.
In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock.
The solution is quite simple, release the lock before executing task in local mode.
In the commit 7c2f61c76c:
1. Release the lock in local mode to fix the bug. @scv119
2. `test_local_mode_deadlock` added to cover the case. @rkooo567
3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.
Fix Remote code injection in Log4j
Log4j versions prior to 2.15.0 are subject to a remote code execution vulnerability via the ldap JNDI parser.
Check this refer: [CVE-2021-44228](https://github.com/advisories/GHSA-jfh8-c2jp-5v3q)
Now the Java LongPollClient's is not singleton, and a new polling thread will be created within a new LongPollClient for per RayServeHandle. It will degrade the performance of Replica Actor. So we change the LongPollClient's polling thread to singleton.
Why are these changes needed?
Replace the existing temp file to avoid the issue that the previous worker dies and leaves the temp file there, resulting in the next coming workers are not able to write a new temp file since there is an existing one.
Why are these changes needed?
If max concurrency is 1 in default group, a blocking task executing in default group will block the following tasks in different group. See reproduction script in #20475
The issue is due to tasks executing in the default concurrent group run in the main task execution thread, and tasks in other concurrent groups will be blocked if the main task execution thread is blocked.
This PR only changes concurrent actor behavior that default group will not block other groups.
Related issue number
Fix#20475
Why are these changes needed?
Add timeout(ms) param for Java ray.get. The API changes have been updated to doc ([Ray Core Walkthrough]->[Fetching Results]).
eg:
ObjectRef<Integer> objRef = Ray.put(1);
objRef.get(1000)
Ray.get(Ray.task(MyRayApp::slowFunction).remote(), 3000)
Related issue number
#20247
## Why are these changes needed?
This is a part of redis removal. This PR remove redis kv in function table.
rpush related code is not updated in this PR.
## Related issue number
This PR removes global named actor and global PGs.
I believe these APIs are not used widely in OSS.
CPP part is not included in this PR.
@kfstorm @clay4444 @raulchen Please take a look if this change is reasonable.
IMPORTANT NOTE: This is a Java API change and will lead backward incompatibility in Java global named actor and global PG usage.
CPP part is not included in this PR.
INCLUDES:
Remove setGlobalName() and getGlobalActor() APIs.
Remove getGlobalPlacementGroup() and setGlobalPG
Add getActor(name, namespace) API
Add getPlacementGroup(name, namespace) API
Update doc pages.
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR
## Related issue number
#19443
Why are these changes needed?
For Java worker, we generate a UUID string as the namespace if a job is not specified a namespace by user.
Related issue number
#16474
In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.