Why are these changes needed?
This is a serial of PRs to make CoreWorkerProcess thread-safe and CoreWorker Code easy to read. [#19675#19677#19678#19679]
Move CoreWorkerOptions out of core_worker.h; makes the code easier to read.
Next PR: #19677
The ray-ml image depends on numpy ~=1.19.2 via the tensorflow==2.6 requirement. Unfortunately that's incompatible with Dataset (see here #20258 (comment)).
This PR upgrades the numpy dependency only for the nightly test.
The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.
Before this PR, `ds.iter_batches()` would yield no batches if `prefetch_blocks > ds.num_blocks()` was given, since the sliding window semantics were to return no windows if `window_size > len(iterable)`. This PR tweaks the sliding window implementation to always return at least one window, even if the one window is smaller than the given window size.
This should fix the long running release tests that are failing to build their app configs.
It seems like pip install ray[all] now downgrades the ray version. It's unclear why, but most likely, a dependency has pinned the ray version now. This PR explicitely install the version of Ray that we want after the pip install ray[all] to fix the problem.
Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
This PR adds the hiredis dependency for non M1 machines.
This removes the `redis < 4.0` pin.
Since hiredis doesn't have M1 mac wheels yet, so users there will have extra warning messages in their outputs if they use redis 4.0.
<!-- Please give a short summary of the change and the problem this solves. -->
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Co-authored-by: Alex Wu <alex@anyscale.com>
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
The change in #20374 was interpreted as a file redirect, not a "greater than" by docker (strangely enough, differently than bash interprets it locally).
<!-- Please give a short summary of the change and the problem this solves. -->
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Co-authored-by: Alex <alex@anyscale.com>
Ray currently does not filter GCP TPU nodes based on the cluster name, resulting in conflicts when multiple ray clusters are running on the same GCP account.
This change updates the TPU behavior to match the GCP compute node behavior, i.e. filtering to TPU nodes for the current cluster.
Why are these changes needed?
Add timeout(ms) param for Java ray.get. The API changes have been updated to doc ([Ray Core Walkthrough]->[Fetching Results]).
eg:
ObjectRef<Integer> objRef = Ray.put(1);
objRef.get(1000)
Ray.get(Ray.task(MyRayApp::slowFunction).remote(), 3000)
Related issue number
#20247