Rolling out next deprecation cycle:
- DeprecationWarnings that were `warnings.warn` or `logger.warn` before are now raised errors
- Raised Deprecation warnings are now removed
- Notably, this involves deprecating the TrialCheckpoint functionality and associated cloud tests
- Added annotations to deprecation warning for when to fully remove
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.
Closes#16435
* Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)"
This reverts commit 00595653ed.
Failure in windows has been addressed by conditionally registering the signal handler if available.
- Closes#23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted.
In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage.
This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.
The recursive grep in the banned words check can get really messy when running locally depending on each person's directory structure or where the format script is being called from.
Separates the banned words check as a separate script so that it's not called by default in ./format.sh. Also adds this to the documentation
In the [docs contributing page](https://docs.ray.io/en/master/ray-contribute/docs.html), the links to other docs pages point to master/ instead of latest/, which can be a bit confusing since this is not the live version of the docs that people will be used to seeing.
I added a couple additional clarifications and fixed a typo as well. I also mentioned the need for an image and linked to the image directory (though some subprojects have their own image directories as well, which I did not mention).
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
What: This class adds a generic BatchPredictor class that offers an interface to run batch inference on Ray datasets. It takes a Predictor class and checkpoint as an input, and provides a predict(dataset) method to run scalable scoring inference.
Why: Currently users have to implement scorers themselves. This is mostly boilerplate and prone to errors, so we should provide a simple solution instead.
Note that this predictor also implements the Predictor interface.
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:
1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.
After this PR, the node-ip custom resource can be deprecated.
`ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).
1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page
2. We already have a separate/dedicated page called Pipelining Compute to cover the same content
Adding a FAQ page. Currently has some basic questions that have come up in the past.
Explaining how to use Matplotlib due to threading in the distributed training function.