hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	00947fd949	[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562 )	2022-07-18 12:14:10 -07:00
Jiao	98a07920d3	[AIR][CUJ] Make distributing training benchmark at silver tier (#26640 )	2022-07-17 22:07:09 -07:00
Eric Liang	0855bcb77e	[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633 )	2022-07-16 17:37:06 -07:00
Jiao	196e52ad7c	[AIR][CUJ] E2E Pytorch training (#26621 )	2022-07-16 08:23:19 -07:00
Jiao	988ffd494b	[AIR][CUJ] Add GPU bench prediction benchmark (#26614 )	2022-07-16 08:22:37 -07:00
matthewdeng	e3a096f412	[air] add bulk ingest benchmarks (#26618 )	2022-07-15 22:01:23 -07:00
xwjiang2010	a241e6a0f5	[air] Add xgboost release test for silver tier(10-node case). (#26460 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-15 13:21:10 -07:00
Kai Fricke	213a96e239	[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519 ) Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-14 22:08:43 +01:00
Kai Fricke	cd95569b01	[tune/release] Add up/down scaling release test (#25392 ) This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 22:57:24 +01:00
Kai Fricke	cf75cf7232	[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436 ) This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch. In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 10:53:24 +01:00
Jian Xiao	923209895d	Pipelined training test: change num of windows; log the ingestion perf (#26429 ) Why are these changes needed? Improve test perf Log the perf stats With 2 windows there are a lot of spilling, slowing down the throughput.	2022-07-11 11:03:35 -07:00
Stephanie Wang	dcc913073f	[testing] Run 100TB shuffle test nightly (#26306 ) Run this test nightly to collect more datapoints on stability and performance of 100TB shuffle.	2022-07-07 09:59:54 -07:00
Stephanie Wang	a90e53b76f	[core] Add weekly test for 100TB random shuffle (#25908 ) Adds a CI test for 100TB shuffle. There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count. Related issue number Closes #24480.	2022-07-01 13:30:07 -07:00
Kai Fricke	e2d8e7a6ae	[ci/release/ml] Run ML release tests on staging (#26168 ) This moves all ML release tests to staging. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-06-30 13:24:28 -07:00
Kai Fricke	f9e787115f	[ci/release/core] Run `many_nodes` test on staging (#26164 ) This moves many_nodes to anyscale staging.	2022-06-29 11:07:32 -07:00
Kai Fricke	7091a32fe1	[ci/release] Support running tests on staging (#25889 ) This adds "environments" to the release package that can be used to configure some environment variables. These variables will be loaded either by an `--env` argument or a `env` definition in the test definition and can be used to e.g. run release tests on staging.	2022-06-28 10:14:01 -07:00
Jun Gong	c026374acb	[RLlib] Fix the 2 failing RLlib release tests. (#25603 )	2022-06-14 14:51:08 +02:00
Kai Fricke	c3b608f757	[tune] Fix cloud tests, mark as stable (#25583 ) #25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.	2022-06-08 17:47:54 +01:00
Stephanie Wang	f7692e4602	[core] Remove more expensive shuffle tests (#25165 ) Now that the "smaller_instances" versions of these tests are stable, we can stop running the version that uses bigger instances.	2022-05-24 18:05:18 -07:00
Jiajun Yao	00cdd8dce5	Add chaos test for dataset shuffle (#25161 ) Add chaos tests for dataset shuffle: both push-based and non-push-based.	2022-05-24 15:12:20 -07:00
Jiajun Yao	b825a839f9	Mark dataset_shuffle_push_based_sort_1tb as stable (#25162 ) dataset_shuffle_push_based_sort_1tb is consistently passing for weeks.	2022-05-24 11:07:27 -07:00
mwtian	36d6d59169	[Datasets] mark nightly test dataset_shuffle_random_shuffle_1tb_small_instances stable #24861	2022-05-17 09:56:05 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Sven Mika	0cd7bc4054	[RLlib] Re-establish dashboard performance tests. (#24728 )	2022-05-16 13:13:49 +02:00
Kai Fricke	de69b0d6d6	[train/release] Fix horovod user test master app config (#24734 )	2022-05-14 21:20:45 -07:00
Amog Kamsetty	a36e2a8f51	[Tune] Deprecate DistributedTrainableCreator (#24453 ) Fully deprecate DistributedTrainableCreator for Ray 2.0 Closes #24453	2022-05-10 11:06:43 -07:00
mwtian	918d3601c6	[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481 )	2022-05-06 15:55:59 -07:00
SangBin Cho	295b4436b3	[Nightly tests] Increase wait for nodes timeout (#24457 ) Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)	2022-05-04 19:42:21 -07:00
Sihan Wang	3f5da8af7a	[Serve] Add serve handle graph workload nightly tests (#24435 )	2022-05-04 09:07:50 -07:00
Stephanie Wang	fbbc9c33d6	Add nightly tests for push-based shuffle (#24352 ) Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.	2022-05-02 11:35:14 -07:00
Jiao	ba7cc1803a	[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246 )	2022-04-29 17:03:33 -07:00
SangBin Cho	46cd7f1830	Make large multi tests to nightly + remove k8s tests (#24302 ) As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests	2022-04-29 03:40:12 -07:00
mwtian	afdfd20a5b	[Release tests] Create compute config for new dataset shuffle tests (#24239 ) Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests. Related issue number #24176: the added tests would verify the instance types which support the new shuffle implementations.	2022-04-27 11:50:12 -07:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Stephanie Wang	1de9f3457e	[nightly tests] Mark Datasets shuffle tests stable (#24175 ) dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.	2022-04-25 09:01:37 -07:00
Stephanie Wang	71e142b1fa	[core][tests] Add nightly test for datasets random_shuffle and sort (#23807 ) Copied from #23784. Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory. Modified to fix lint.	2022-04-12 12:53:57 -07:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Archit Kulkarni	7a1a7e1844	Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 )" (#23805 ) This reverts commit `ba484feac0`. Broke lint.	2022-04-08 13:18:13 -07:00
Stephanie Wang	ba484feac0	[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 ) Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.	2022-04-08 11:31:10 -07:00
Kai Fricke	0b804e5162	[ci/release] Move ML long running tests to sdk file manager (#23745 ) What: Long running tests should use sdk file manager Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.	2022-04-06 10:50:49 -07:00
Archit Kulkarni	582bf4e8f8	Add basic jobs release test with Tune script (#23474 ) Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests: (1) Uses a local `working_dir` (2) Uses a remote working_dir from a zip github URL.	2022-04-05 13:31:11 -05:00
Chen Shen	3e80da7e9f	[ci/release] long running / change failed test to sdk (#23602 ) close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.	2022-03-30 12:57:21 -07:00
Kai Fricke	e8abffb017	[tune/release] Improve Tune cloud release tests for durable storage (#23277 ) This PR addresses recent failures in the tune cloud tests. In particular, this PR changes the following: The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests. We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected) We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days. Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes). Local release test runs succeeded. https://buildkite.com/ray-project/release-tests-branch/builds/189 https://buildkite.com/ray-project/release-tests-branch/builds/191	2022-03-30 09:28:33 -07:00
Kai Fricke	922367d158	[ci/release] Fix smoke test compute templates (#23561 ) The smoke test definitions of a few tests were faulty for compute template override. Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294	2022-03-29 13:48:09 -07:00
Chen Shen	c3e04ab275	[nighly-test] try out spot instances for chaos test #23507	2022-03-27 20:10:21 -07:00
Stephanie Wang	aa6f773283	Switch long running tests to SDK (#23433 ) These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.	2022-03-23 17:44:26 -07:00
Amog Kamsetty	6d776976c1	[Train] Fix multi node horovod bug (#22564 ) Closes #20956	2022-03-22 16:22:53 -07:00
Jiajun Yao	bab19e8e68	Add perf metrics for test_many_tasks.py (#23318 ) Add perf metrics for test_many_tasks.py Use the new smoke test structure	2022-03-22 16:16:42 -07:00
Kai Fricke	e48c407b13	[release] long running many drivers: Use SDK file manager (#23379 ) This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_	2022-03-21 09:56:59 -07:00
Stephanie Wang	5ab634f285	[core] Disable threaded_actors_stress_test (#23292 ) * disable * smoke	2022-03-18 15:57:53 -07:00

1 2

81 commits