Release tests are failing in buildkite run - however succeeds reliably in manual retry.
Suspected it's because not all nodes available when running with large number of actors.
This PR adds a serve ha test. The flow of the tests is:
1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: scv119 scv119@gmail.com
Why are these changes needed?
microbenchmarks failed complaining
raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address:
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"
- Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle
- Mark dataset_shuffle_push_based_random_shuffle_1tb as stable
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Add benchmark data for 4x4 GPU setup.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch.
In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Adds a CI test for 100TB shuffle.
There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count.
Related issue number
Closes#24480.
This adds "environments" to the release package that can be used to configure some environment variables. These variables will be loaded either by an `--env` argument or a `env` definition in the test definition and can be used to e.g. run release tests on staging.