Why are these changes needed?
Today, Ray scheduler always pick a random node if the resource requirement is empty, regardless of scheduling policy/strategy.
However, for node affinity scheduling policy, we should not pick random policy but try to stick to the node affinity constraints.
Newly pushed actors will never be used with existing pending submits, so the worker will not be used to speed up existing tasks. If _return_actor is called at the end of push instead, the actor is pushed to _idle_actors and immediately used if there are pending submits.
This reverts commit 02f220b755.
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
Looks like this commit makes `test_ray_shutdown` way more flaky. cc @mattip for further investigation after revert
<img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png">
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Move initialization for `callback.results_preprocessor` property to `callback.start_training()` method which only be called once while training start, currently initialization is triggered per message.
This fixes AttributeError: 'list' object has no attribute 'schema' when read fusion is flag disabled and pipelines are windowed by bytes.
Broken out from https://github.com/ray-project/ray/pull/25167/files
This line:
```
pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm
```
also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested:
```
[INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ...
[INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9)
```
Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.
Pointing to the latest documentation for contributor is important as the workflow is always evolving. E.g. the installation instructions for bazel are not representatives of the current state on release vs master. Hence, I propose to update contribution links in the documentation to point to the latest state on master.
NOTE: This is not the official API improvement. But this will help dogfooding the feature before finalizing the output.
This PR improves the output state/metadata of existing state APIs.
Ray sometimes stores errors as the object value in shared memory. These objects have no data since the error is stored in the metadata field. #25085 describes a bug where these objects fail to spill because the IO worker assumes that the data field must be non-empty. This would cause head-of-line blocking for any other objects to spill and cause the whole job to hang. This PR fixes the issue by spilling these objects anyway.
Related issue number
Closes#25085.
If you pass a multidimensional input to `TorchPredictor.predict`, AIR errors. For more information about the error, see #25194.
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
When using ray inside a virtualenv on windows, python.exe as reported by sys.executable is a PEP397 launcher to the actual python as reported by os.getpid():
>>> import sys, os, psutil
>>> >>> print(sys.executable)
C:\temp\issue24361\Scripts\python.exe
>>> os.getpid()
2208
>>> child = psutil.Process(2208)
>>> child.cmdline()
['C:\\oss\\CPython38\\python.exe']
>>> child.parent().cmdline()
['C:\\temp\\issue24361\\Scripts\\python.exe']
>>> child.parent().pid
6424
When the agent_manager launches the agent process via Process::Process(), it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But inside agent.py, the child process reports the PID via os.getpid(), which is 2208, and this is the wrong PID to register the agent.
The solution proposed here is another version of #24905 that creates a int agent_id = rand(); before starting the python process, and passes the agent_id to the process.
The tests in `test_torch_predictor.py` weren't in running CI. Also `test_torch_predictor.py::test_init` was failing.
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>