hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Artur Niederfahrenhorst	9a64bd4e9b	[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842 )	2022-03-29 14:44:40 +02:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Chen Shen	c3e04ab275	[nighly-test] try out spot instances for chaos test #23507	2022-03-27 20:10:21 -07:00
Sven Mika	22c9c4aa39	[RLlib] Slate-Q +GPU torch bug fix. (#23464 )	2022-03-24 17:39:33 +01:00
Avnish Narayan	9040f54060	[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452 )	2022-03-24 09:17:30 +01:00
Stephanie Wang	aa6f773283	Switch long running tests to SDK (#23433 ) These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.	2022-03-23 17:44:26 -07:00
Kai Fricke	724377163f	[ci/release] Unstable tests should only soft fail the build (#23403 ) This will leave the tests green if the test is failing but marked as unstable.	2022-03-23 09:38:56 +00:00
Amog Kamsetty	6d776976c1	[Train] Fix multi node horovod bug (#22564 ) Closes #20956	2022-03-22 16:22:53 -07:00
Jiajun Yao	bab19e8e68	Add perf metrics for test_many_tasks.py (#23318 ) Add perf metrics for test_many_tasks.py Use the new smoke test structure	2022-03-22 16:16:42 -07:00
SangBin Cho	0cd687cc19	[Nightly test] Fix job download retry (#23401 ) Currently when we download a file to the cluster using a job, we don't do the retry.	2022-03-22 08:31:24 -07:00
Kai Fricke	02644ab4d8	[ci/release] Retry cluster env build on failure (#23378 ) Failed cluster env builds should be retried.	2022-03-22 09:45:22 +00:00
Avnish Narayan	754bcd16f8	[rllib] Pin gym everywhere (#23384 ) This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.	2022-03-22 09:44:22 +00:00
Kai Fricke	e48c407b13	[release] long running many drivers: Use SDK file manager (#23379 ) This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_	2022-03-21 09:56:59 -07:00
Kai Fricke	7085749d50	[tune] Adjust release test timeouts (#23362 ) Currently release tests fail because they exceed the (rather arbitrary) timeout by 1-2 seconds.	2022-03-20 17:05:20 +00:00
Avnish Narayan	e008a48ef2	[release tests] Pin gym everywhere (#23349 )	2022-03-19 02:52:54 -07:00
Dmitri Gekhtman	561e7a9677	[RELEASE] Add autoscaler env to fix nightly tests (#23345 ) The product backend doesn't yet understand that nightly Ray uses GCS-Ray. (This will be fixed when the next time the product control plane is deployed.) This PR introduces the env required to signal to the product backend that we're using GCS-Ray so that the autoscaler can startup correctly.	2022-03-18 17:48:27 -07:00
Archit Kulkarni	db2c37c760	[serve] [release] Disable smoke test by default (#23334 )	2022-03-18 18:40:48 -05:00
Stephanie Wang	5ab634f285	[core] Disable threaded_actors_stress_test (#23292 ) * disable * smoke	2022-03-18 15:57:53 -07:00
Kai Fricke	ca5354ffb1	[ci/release] Fix test_wheels (#23329 )	2022-03-18 14:39:36 +00:00
Kai Fricke	3cf8116df2	[ci/release] Re-enable commit sanity check (#23327 ) Commit sanity checks are currently seemingly disabled. This PR re-enables them by parsing wheel URLs.	2022-03-18 12:57:41 +00:00
Kai Fricke	da140a80e9	[ci/release] Legacy field should be optional (#23326 ) #22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon. Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...	2022-03-18 11:34:05 +00:00
Kai Fricke	e510d81c71	[ci/release] Save test config and results as artifacts (#23278 ) It is good to have these information readily available when checking test results, as it will reveal both the original configuration (that could change over time) as well as the achieved results. Also gets rid of the unneeded old alerts directory. https://buildkite.com/ray-project/release-tests-branch/builds/190#ef531787-412c-40ec-81e6-beb495830c60	2022-03-18 09:26:42 +00:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
mwtian	391901f86b	[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233 ) Continue to clean up Redis and other related Redis references, for - gcs_utils.py - log_monitor.py - `publish_error_to_driver()`	2022-03-16 19:34:57 -07:00
SangBin Cho	b350fe9ee8	[Nightly test] Fix additional k8s issues + add new tests (#23231 ) Fix bug from the previous fixes. Add more tests Stop using m5.xlarge (not supported now) There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.	2022-03-16 16:37:29 -07:00
Stephanie Wang	ce71c5bbbd	[core][tests] Mark threaded_actors_stress_test as unstable	2022-03-16 15:31:19 -07:00
Kai Fricke	e3987d85c3	[tune] Mark cloud OSS release tests as unstable (#23240 ) These tests have been flaky for a while. Until this is addressed, mark them as unstable.	2022-03-16 17:37:58 +00:00
Kai Fricke	eca5bcfc87	[ci/release] Reload modules after installing matching Ray (#23227 ) Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel. Additionally, job submission should also install a matching local ray to match with the job submission server.	2022-03-16 15:44:43 +00:00
Avnish Narayan	6c20e9d898	[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168 )	2022-03-16 09:15:59 +01:00
Kai Fricke	15aeb33e50	[ci/release] Support PR wheels (#23084 ) This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.	2022-03-14 17:24:13 +00:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	d93fa95dd5	[ci/release] Only report results for scheduled builds (#23135 ) Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.	2022-03-14 15:10:16 +00:00
Kai Fricke	fce49694fc	[ci/release] Disable infra retries for now (#23132 ) Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now. We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.	2022-03-14 11:51:11 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Jiajun Yao	e4620669a1	[Release Test] Add perf metrics for core scalability tests (#23110 ) * Add perf metrics for core scalability tests * lint	2022-03-14 10:20:39 +09:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	c7303f538c	[ci/release] Validate smoke test fields, enforce frequency (#23075 ) Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.	2022-03-13 18:48:03 +00:00
Kai Fricke	76a939c820	[ci/release] Migrate long running (+distributed) tests (#22955 ) Migrating to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/103 Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_	2022-03-13 18:47:17 +00:00
SangBin Cho	8c1a6f9138	[Nightly Test] Fix a dataset test (#23106 ) Fix a broken dataset test (due to incorrect working dir)	2022-03-12 08:16:08 -08:00
SangBin Cho	c0f8de9c3c	[Nightly tests] Run benchmark tests on k8s as well (#23100 ) Run benchmark tests on k8s as well. Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration	2022-03-11 19:40:37 -08:00
SangBin Cho	97383e4c1b	[Nightly test] Fix a broken nightly test due to the wrong config (#23097 )	2022-03-11 16:47:06 -08:00
SangBin Cho	2b38fe89e2	[Nightly tests] Migrate rest of core tests (#23085 ) MIgrate the rest of core tests	2022-03-11 10:41:14 -08:00
Kai Fricke	04ea180dfb	[ci/release] Add "tiny" concurrency group, change limits (#23065 ) E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups. Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).	2022-03-11 10:19:38 -08:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
SangBin Cho	965d609627	[Nightly test] Fix a minor syntax issue for core nightly tests (#23069 ) Add frequency to smoke tests Remove unnecessary alerts	2022-03-11 04:58:40 -08:00
Kai Fricke	5b2d58674b	[ci/release] Migrate horovod tests (#22951 ) Migrating horovod tests to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/125	2022-03-11 09:53:29 +00:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
SangBin Cho	92b50ff5da	Migrate multi nightly tests (#23005 )	2022-03-11 01:32:10 +09:00

1 2 3 4 5 ...

546 commits