Release tests are failing in buildkite run - however succeeds reliably in manual retry.
Suspected it's because not all nodes available when running with large number of actors.
CI is red because of a dependency issue around dataclass_transform .
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
The minimum size is 300GB
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
This PR adds a serve ha test. The flow of the tests is:
1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.
Ray automatically sets OMP_NUM_THREADS=1, potentially limiting multithreading in native pytorch/tensorflow. If this leads to performance differences, we should address this either in Ray Train or in Ray core.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases.
Specifically, the reason is this:
We will include the XGBoost benchmark as an example workload for the KubeRay documentation.
The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure.
(When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
Currently running into an issue:
Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
Removes all ML related code from `ray.util`
Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`
Moves `ray.util.ml_utils` to other locations
Closes#23900
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
- Update xgboost test (catch test failures properly)
- Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint.
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
Signed-off-by: scv119 scv119@gmail.com
Why are these changes needed?
microbenchmarks failed complaining
raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address:
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"