Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)
Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests.
Related issue number
#24176: the added tests would verify the instance types which support the new shuffle implementations.
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.
Closes#16435
dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.
Copied from #23784.
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
Modified to fix lint.
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
What: Long running tests should use sdk file manager
Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.
Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests:
(1) Uses a local `working_dir`
(2) Uses a remote working_dir from a zip github URL.
This PR addresses recent failures in the tune cloud tests.
In particular, this PR changes the following:
The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).
Local release test runs succeeded.
https://buildkite.com/ray-project/release-tests-branch/builds/189https://buildkite.com/ray-project/release-tests-branch/builds/191
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.
RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.
Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.
Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
Run benchmark tests on k8s as well.
Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.
The PR handles edge cases that originally existed in the old e2e.py job-based runners.