Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
What: Quotes pip install packages in local environment setup for client runner.
Why: Strings like pyarrow>=6.0.1<7.0.0 currently don't work as they are interpreted as output redirection.
What: This class adds a generic BatchPredictor class that offers an interface to run batch inference on Ray datasets. It takes a Predictor class and checkpoint as an input, and provides a predict(dataset) method to run scalable scoring inference.
Why: Currently users have to implement scorers themselves. This is mostly boilerplate and prone to errors, so we should provide a simple solution instead.
Note that this predictor also implements the Predictor interface.
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:
1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.
After this PR, the node-ip custom resource can be deprecated.
`ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).
In some cases, the UDF for map_groups() may return value of different types, which should be disallowed.
This PR is to add unit test to make sure we do raise error if such case happens.
Update the torch policy to find the seq_lens using state_batches instead of input_dict. This helps handle the complex inputs to the model when the inbuilt preprocessing API is disabled.
What: Only open (create) CSV files when actually reporting results.
Why: When trials crash before they report first (e.g. on init), they will have created an empty CSV file. When results are subsequently written, the CSV header is then missing.
Copied from #23784.
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
Modified to fix lint.
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.
As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
What: Skips left-over checkpoint_tmp* directories when loading experiment analysis. Also loads iteration number from metadata file rather than parsing the checkpoint directory name.
Why: Sometimes temporary checkpoint directories are not deleted correctly when restoring (e.g. when interrupted). In these cases, they shouldn't be included in experiment analysis. Parsing their iteration number also failed, and should generally be done by reading the metadata file, not by inferring it from the directory name.
What: This introduces a general utility to synchronize directories between two nodes, derived from the RemoteTaskClient. This implementation uses chunked transfers for more efficient communication.
Why: Transferring files over 2GB in size leads to superlinear time complexity in some setups (e.g. local macbooks). This could be due to memory limits, swapping, or gRPC limits, and is explored in a different thread. To overcome this limitation, we use chunked data transfers which show quasi-linear scalability for larger files.
This PR preserves block order when transforming under the actor compute model. Before this PR, we were submitting block transformations in reverse order and creating the output block list in completion order.
Adds back the "Issue severity" dropdown to the bug template so that Ray users can have a way of reporting UX problems. Made some changes to try to streamline issue reporting:
- made this field optional
- moved the field to be last
- slightly changed some of the wording
Nan values do not have a well defined ordering. When sorting metrics to determine the best checkpoint, we should always filter out checkpoints that are associated with nan values.
Closes#23812
Takes care of the TODO left for SimpleImputer with most_frequent strategy by refactoring and optimising the logic for computing the most frequent value.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Adds back the "Issue severity" dropdown to the bug template so that Ray users can have a way of reporting UX problems. Made some changes to try to streamline issue reporting:
- made this field optional
- moved the field to be last
- slightly changed some of the wording
In test_many_tasks.py case, we usually found the case failing and found the reason.
We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round.
It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by
cur_cpus = ray.available_resources().get("CPU", 0)
min_cpus_available = min(min_cpus_available, cur_cpus)
especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes).
The situation we thought it would be:
for _ in range(sleep_time / 0.1):
sleep(0.1)
The actual situation happens:
for _ in range(sleep_time / 0.1):
do_something(); # it costs time, sometimes pretty much
sleep(0.1)
We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.
Today we have two storage interfaces in Gcs, one is InternalKvInterface which exposes key value interfaces, another is StoreClient which is kv interface with secondary index support.
To make GCS storage pluggable, we need to narrow down and unify the storage interface. This is a try to only use kv store and build index purely in memory.
known limitations:
we need to rebuild index during GCS startup
there might be consistency issues when concurrent change (write/delete) to the same key; but the current redis based solution also suffer from the same issue.
The PR https://github.com/ray-project/ray/pull/22820 introduced a API breakage for xlang usage, causing that `ray.java_actor_class` has not been available any longer from then on.
I'm fixing it in this PR. We should remove these top level APIs in 2.0 instead of minor versions.
This PR adds a RLTrainer to Ray AIR. It works for both offline and online use cases. In offline training, it will leverage the datasets key of the Trainer API to specify a dataset reader input, used e.g. in Behavioral Cloning (BC). In online training, it is a wrapper around the rllib trainables making use of the parameter layering enabled by the Trainer API.