hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
mwtian	391901f86b	[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233 ) Continue to clean up Redis and other related Redis references, for - gcs_utils.py - log_monitor.py - `publish_error_to_driver()`	2022-03-16 19:34:57 -07:00
SangBin Cho	b350fe9ee8	[Nightly test] Fix additional k8s issues + add new tests (#23231 ) Fix bug from the previous fixes. Add more tests Stop using m5.xlarge (not supported now) There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.	2022-03-16 16:37:29 -07:00
Stephanie Wang	ce71c5bbbd	[core][tests] Mark threaded_actors_stress_test as unstable	2022-03-16 15:31:19 -07:00
Kai Fricke	e3987d85c3	[tune] Mark cloud OSS release tests as unstable (#23240 ) These tests have been flaky for a while. Until this is addressed, mark them as unstable.	2022-03-16 17:37:58 +00:00
Kai Fricke	eca5bcfc87	[ci/release] Reload modules after installing matching Ray (#23227 ) Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel. Additionally, job submission should also install a matching local ray to match with the job submission server.	2022-03-16 15:44:43 +00:00
Avnish Narayan	6c20e9d898	[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168 )	2022-03-16 09:15:59 +01:00
Kai Fricke	15aeb33e50	[ci/release] Support PR wheels (#23084 ) This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.	2022-03-14 17:24:13 +00:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	d93fa95dd5	[ci/release] Only report results for scheduled builds (#23135 ) Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.	2022-03-14 15:10:16 +00:00
Kai Fricke	fce49694fc	[ci/release] Disable infra retries for now (#23132 ) Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now. We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.	2022-03-14 11:51:11 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Jiajun Yao	e4620669a1	[Release Test] Add perf metrics for core scalability tests (#23110 ) * Add perf metrics for core scalability tests * lint	2022-03-14 10:20:39 +09:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	c7303f538c	[ci/release] Validate smoke test fields, enforce frequency (#23075 ) Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.	2022-03-13 18:48:03 +00:00
Kai Fricke	76a939c820	[ci/release] Migrate long running (+distributed) tests (#22955 ) Migrating to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/103 Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_	2022-03-13 18:47:17 +00:00
SangBin Cho	8c1a6f9138	[Nightly Test] Fix a dataset test (#23106 ) Fix a broken dataset test (due to incorrect working dir)	2022-03-12 08:16:08 -08:00
SangBin Cho	c0f8de9c3c	[Nightly tests] Run benchmark tests on k8s as well (#23100 ) Run benchmark tests on k8s as well. Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration	2022-03-11 19:40:37 -08:00
SangBin Cho	97383e4c1b	[Nightly test] Fix a broken nightly test due to the wrong config (#23097 )	2022-03-11 16:47:06 -08:00
SangBin Cho	2b38fe89e2	[Nightly tests] Migrate rest of core tests (#23085 ) MIgrate the rest of core tests	2022-03-11 10:41:14 -08:00
Kai Fricke	04ea180dfb	[ci/release] Add "tiny" concurrency group, change limits (#23065 ) E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups. Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).	2022-03-11 10:19:38 -08:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
SangBin Cho	965d609627	[Nightly test] Fix a minor syntax issue for core nightly tests (#23069 ) Add frequency to smoke tests Remove unnecessary alerts	2022-03-11 04:58:40 -08:00
Kai Fricke	5b2d58674b	[ci/release] Migrate horovod tests (#22951 ) Migrating horovod tests to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/125	2022-03-11 09:53:29 +00:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
SangBin Cho	92b50ff5da	Migrate multi nightly tests (#23005 )	2022-03-11 01:32:10 +09:00
shrekris-anyscale	1100c98222	[serve] Implement Serve Application object (#22917 ) The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.	2022-03-10 10:28:29 -06:00
SangBin Cho	d192ec30fd	[Nightly Tests] Readjust the concurrency limit. (#23002 ) This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota. Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.	2022-03-10 07:19:38 -08:00
SangBin Cho	4fa294ca49	[Nightly tests] Stop running broken tests (#22993 )	2022-03-10 06:59:51 -08:00
SangBin Cho	e88abe4c8e	[Nightly tests] migrated most of daily tests (#22960 ) * migrated most of daily tests * Addressed code review.	2022-03-10 05:49:16 -08:00
Kai Fricke	007cf03d7a	[ci/release] Migrate RLLib tests (#22967 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/111	2022-03-10 10:26:03 +00:00
Kai Fricke	fee4065daf	[ci/release] Migrate SGD tests (#22966 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/110	2022-03-10 10:23:50 +00:00
Kai Fricke	614dc6b511	[ci/release] Migrate Serve tests (#22965 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/109	2022-03-10 10:23:25 +00:00
Kai Fricke	ccda1555cc	[ci/release] Migrate Runtime Env tests (#22963 ) Migrating to new release test package. https://buildkite.com/ray-project/release-tests-branch/builds/108	2022-03-10 10:22:57 +00:00
kyle-chen-uber	592656ca28	[horovod] remove deprecated slot concept, use worker instead (#22708 ) Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor. horovod/horovod@a729ba7 The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune. This commit updates the Horovod RayExecutor init args. Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-03-10 08:16:42 +00:00
Kai Fricke	18d535f290	[ci/release] Migrate LightGBM tests (#22952 ) Note that LightGBM release tests were previously not enabled. https://buildkite.com/ray-project/release-tests-branch/builds/113 https://buildkite.com/ray-project/release-tests-branch/builds/114 Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-03-10 08:14:31 +00:00
Edward Oakes	22e698d0ff	[serve][release tests] Add smoke test to CI for remaining tests (#22962 )	2022-03-09 23:36:32 -06:00
Stephanie Wang	1b45582e43	[tests] Enable chaos testing for Dask-on-Ray (#22927 ) Turns on failures for Dask-on-Ray chaos tests.	2022-03-09 18:08:41 -05:00
Edward Oakes	135cd121b9	[release tests] Fix minor bug in multi-deployment serve test (#22961 )	2022-03-09 14:37:27 -06:00
Kai Fricke	ca87c37c61	[ci/release] Fix result output in Buildkite pipeline run (#22946 ) The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.	2022-03-09 17:29:31 +00:00
Edward Oakes	2cac49e4b0	[serve][release tests] Mark long-running failure test as non-stable (#22922 )	2022-03-09 09:42:47 -06:00
Kai Fricke	ac654dbb9d	[ci/release] Fix schema validation for single tests / add `stable` field (#22947 ) This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).	2022-03-09 15:22:49 +00:00
Kai Fricke	cac9d30909	[ci/release] Add schema validation for release test config (#22919 ) To avoid breakage like in #22905, this PR adds schema validation to the release test package. In a follow-up PR, we'll likely switch this to use pydantic instead.	2022-03-09 09:50:51 +00:00
Edward Oakes	aa907987bf	[serve][release tests] Use m5.8xlarge instance types for 1k replica tests (#22918 )	2022-03-08 21:34:01 -06:00
SangBin Cho	549527687f	Migrate scalability tests (#22901 ) This PR migrates scalability tests to the new infra. I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke	2022-03-08 17:22:41 -08:00
Kai Fricke	c57abb693b	[ci/release] Add frequency to core nightly test (#22905 ) Breaks the scheduled build: https://buildkite.com/ray-project/release-tests-branch/builds/82#3994f5e1-6da3-4c70-8c30-bdcfb1fec851 We should enforce schema validation soon.	2022-03-08 17:44:20 +00:00
SangBin Cho	0137fc8e23	[Tests] Add microbenchmark to the new infra test (#22861 ) Verified it works. It also addresses the frequency comments from the previous PR	2022-03-08 05:58:49 -08:00
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
SangBin Cho	529911ee78	[Nightly tests] Add missing patches (#22862 ) These changes are added to the old e2e.py, but not to the new infra	2022-03-07 19:48:43 +00:00

1 2 3 4 5 ...

523 commits