hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Sven Mika	0cd7bc4054	[RLlib] Re-establish dashboard performance tests. (#24728 )	2022-05-16 13:13:49 +02:00
Kai Fricke	de69b0d6d6	[train/release] Fix horovod user test master app config (#24734 )	2022-05-14 21:20:45 -07:00
Amog Kamsetty	a36e2a8f51	[Tune] Deprecate DistributedTrainableCreator (#24453 ) Fully deprecate DistributedTrainableCreator for Ray 2.0 Closes #24453	2022-05-10 11:06:43 -07:00
mwtian	918d3601c6	[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481 )	2022-05-06 15:55:59 -07:00
SangBin Cho	295b4436b3	[Nightly tests] Increase wait for nodes timeout (#24457 ) Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)	2022-05-04 19:42:21 -07:00
Sihan Wang	3f5da8af7a	[Serve] Add serve handle graph workload nightly tests (#24435 )	2022-05-04 09:07:50 -07:00
Stephanie Wang	fbbc9c33d6	Add nightly tests for push-based shuffle (#24352 ) Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.	2022-05-02 11:35:14 -07:00
Jiao	ba7cc1803a	[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246 )	2022-04-29 17:03:33 -07:00
SangBin Cho	46cd7f1830	Make large multi tests to nightly + remove k8s tests (#24302 ) As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests	2022-04-29 03:40:12 -07:00
mwtian	afdfd20a5b	[Release tests] Create compute config for new dataset shuffle tests (#24239 ) Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests. Related issue number #24176: the added tests would verify the instance types which support the new shuffle implementations.	2022-04-27 11:50:12 -07:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Stephanie Wang	1de9f3457e	[nightly tests] Mark Datasets shuffle tests stable (#24175 ) dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.	2022-04-25 09:01:37 -07:00
Stephanie Wang	71e142b1fa	[core][tests] Add nightly test for datasets random_shuffle and sort (#23807 ) Copied from #23784. Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory. Modified to fix lint.	2022-04-12 12:53:57 -07:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Archit Kulkarni	7a1a7e1844	Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 )" (#23805 ) This reverts commit `ba484feac0`. Broke lint.	2022-04-08 13:18:13 -07:00
Stephanie Wang	ba484feac0	[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 ) Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.	2022-04-08 11:31:10 -07:00
Kai Fricke	0b804e5162	[ci/release] Move ML long running tests to sdk file manager (#23745 ) What: Long running tests should use sdk file manager Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.	2022-04-06 10:50:49 -07:00
Archit Kulkarni	582bf4e8f8	Add basic jobs release test with Tune script (#23474 ) Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests: (1) Uses a local `working_dir` (2) Uses a remote working_dir from a zip github URL.	2022-04-05 13:31:11 -05:00
Chen Shen	3e80da7e9f	[ci/release] long running / change failed test to sdk (#23602 ) close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.	2022-03-30 12:57:21 -07:00
Kai Fricke	e8abffb017	[tune/release] Improve Tune cloud release tests for durable storage (#23277 ) This PR addresses recent failures in the tune cloud tests. In particular, this PR changes the following: The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests. We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected) We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days. Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes). Local release test runs succeeded. https://buildkite.com/ray-project/release-tests-branch/builds/189 https://buildkite.com/ray-project/release-tests-branch/builds/191	2022-03-30 09:28:33 -07:00
Kai Fricke	922367d158	[ci/release] Fix smoke test compute templates (#23561 ) The smoke test definitions of a few tests were faulty for compute template override. Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294	2022-03-29 13:48:09 -07:00
Chen Shen	c3e04ab275	[nighly-test] try out spot instances for chaos test #23507	2022-03-27 20:10:21 -07:00
Stephanie Wang	aa6f773283	Switch long running tests to SDK (#23433 ) These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.	2022-03-23 17:44:26 -07:00
Amog Kamsetty	6d776976c1	[Train] Fix multi node horovod bug (#22564 ) Closes #20956	2022-03-22 16:22:53 -07:00
Jiajun Yao	bab19e8e68	Add perf metrics for test_many_tasks.py (#23318 ) Add perf metrics for test_many_tasks.py Use the new smoke test structure	2022-03-22 16:16:42 -07:00
Kai Fricke	e48c407b13	[release] long running many drivers: Use SDK file manager (#23379 ) This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_	2022-03-21 09:56:59 -07:00
Stephanie Wang	5ab634f285	[core] Disable threaded_actors_stress_test (#23292 ) * disable * smoke	2022-03-18 15:57:53 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
SangBin Cho	b350fe9ee8	[Nightly test] Fix additional k8s issues + add new tests (#23231 ) Fix bug from the previous fixes. Add more tests Stop using m5.xlarge (not supported now) There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.	2022-03-16 16:37:29 -07:00
Stephanie Wang	ce71c5bbbd	[core][tests] Mark threaded_actors_stress_test as unstable	2022-03-16 15:31:19 -07:00
Kai Fricke	e3987d85c3	[tune] Mark cloud OSS release tests as unstable (#23240 ) These tests have been flaky for a while. Until this is addressed, mark them as unstable.	2022-03-16 17:37:58 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	76a939c820	[ci/release] Migrate long running (+distributed) tests (#22955 ) Migrating to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/103 Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_	2022-03-13 18:47:17 +00:00
SangBin Cho	8c1a6f9138	[Nightly Test] Fix a dataset test (#23106 ) Fix a broken dataset test (due to incorrect working dir)	2022-03-12 08:16:08 -08:00
SangBin Cho	c0f8de9c3c	[Nightly tests] Run benchmark tests on k8s as well (#23100 ) Run benchmark tests on k8s as well. Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration	2022-03-11 19:40:37 -08:00
SangBin Cho	97383e4c1b	[Nightly test] Fix a broken nightly test due to the wrong config (#23097 )	2022-03-11 16:47:06 -08:00
SangBin Cho	2b38fe89e2	[Nightly tests] Migrate rest of core tests (#23085 ) MIgrate the rest of core tests	2022-03-11 10:41:14 -08:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
SangBin Cho	965d609627	[Nightly test] Fix a minor syntax issue for core nightly tests (#23069 ) Add frequency to smoke tests Remove unnecessary alerts	2022-03-11 04:58:40 -08:00
Kai Fricke	5b2d58674b	[ci/release] Migrate horovod tests (#22951 ) Migrating horovod tests to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/125	2022-03-11 09:53:29 +00:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
SangBin Cho	92b50ff5da	Migrate multi nightly tests (#23005 )	2022-03-11 01:32:10 +09:00
SangBin Cho	4fa294ca49	[Nightly tests] Stop running broken tests (#22993 )	2022-03-10 06:59:51 -08:00
SangBin Cho	e88abe4c8e	[Nightly tests] migrated most of daily tests (#22960 ) * migrated most of daily tests * Addressed code review.	2022-03-10 05:49:16 -08:00
Kai Fricke	007cf03d7a	[ci/release] Migrate RLLib tests (#22967 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/111	2022-03-10 10:26:03 +00:00
Kai Fricke	fee4065daf	[ci/release] Migrate SGD tests (#22966 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/110	2022-03-10 10:23:50 +00:00
Kai Fricke	614dc6b511	[ci/release] Migrate Serve tests (#22965 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/109	2022-03-10 10:23:25 +00:00
Kai Fricke	ccda1555cc	[ci/release] Migrate Runtime Env tests (#22963 ) Migrating to new release test package. https://buildkite.com/ray-project/release-tests-branch/builds/108	2022-03-10 10:22:57 +00:00

1 2

58 commits