First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.
Ray automatically sets OMP_NUM_THREADS=1, potentially limiting multithreading in native pytorch/tensorflow. If this leads to performance differences, we should address this either in Ray Train or in Ray core.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases.
Specifically, the reason is this:
We will include the XGBoost benchmark as an example workload for the KubeRay documentation.
The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure.
(When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
Currently running into an issue:
Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
Removes all ML related code from `ray.util`
Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`
Moves `ray.util.ml_utils` to other locations
Closes#23900
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
- Update xgboost test (catch test failures properly)
- Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint.
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
Signed-off-by: scv119 scv119@gmail.com
Why are these changes needed?
microbenchmarks failed complaining
raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address:
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"
- Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle
- Mark dataset_shuffle_push_based_random_shuffle_1tb as stable
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Add benchmark data for 4x4 GPU setup.
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
* Revert "Revert "Bump pytest from 5.4.3 to 7.0.1""
This reverts commit ab10890e90.
Signed-off-by: Riatre Foo <foo@riat.re>
* Fix missing test data files dependency in rllib/BUILD
See # 26334 and # 26517 for context.
Once this is in, it should be good to roll-forwrad again.
Signed-off-by: Riatre Foo <foo@riat.re>
* debug: run all tests
Signed-off-by: Riatre Foo <foo@riat.re>
* Revert "debug: run all tests"
This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970.
Signed-off-by: Riatre Foo <foo@riat.re>
* fix new tests since last rebase
Signed-off-by: Riatre Foo <foo@riat.re>