hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
mwtian	02fda97c86	[CI] Re-balance concurrency groups to allow more quota for `large` tests (#24344 ) Currently nightly tests are unable to finish in a day because of concurrency group limit on `large` tests. This is an attempt to adjust the limits so buildkite can run / finish more tests. I will observe which tests fall into the `enormous` group and adjust the test resource / concurrency group limits again.	2022-04-29 22:26:16 +01:00
Kai Fricke	ac036e4fe8	[ci/release] Print local environment information (#24346 ) For debugging client environments, it is helpful to print the installed pip packages. Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.	2022-04-29 21:01:50 +01:00
Kai Fricke	dd87e61808	[ci/release] Fix module import errors in release tests (#24334 ) After https://github.com/ray-project/ray/pull/24066, some release tests are running into: ``` ModuleNotFoundError: No module named 'ray.train.impl' ``` This PR simply adds a `__init__.py` file to resolve this. We also add a 5 wecond delay for client runners in release test to give clusters a bit of slack to come up (and avoid ray client connection errors)	2022-04-29 17:03:17 +01:00
Kai Fricke	f3857b7aa1	[ci/release] Fix concurrency group calculation for smoke tests (#24269 ) Currently concurrency groups are always calculated based on the full test cluster compute. Instead, smoke tests should use the smoke test cluster compute.	2022-04-27 22:13:25 +01:00
Kai Fricke	6e37a48632	[ci/release] Allow for preferring smoke tests when filtering (#23887 ) What: Adds a setting "prefer_smoke_tests" to the Buildkite settings. With this, user can specify to kick off smoke tests, if available. Why: The filtering interface of the release testing dialog is a bit complicated at the moment - in order to kick off smoke tests, users have to know with which frequency they are configured to run. Instead users should usually just filter the tests they want to run (using frequency ANY) and optionally specify to run smoke tests, if available.	2022-04-14 06:12:27 +01:00
Kai Fricke	5e1218aae1	[ci/release] Quote pip installs in client runner (#23888 ) What: Quotes pip install packages in local environment setup for client runner. Why: Strings like pyarrow>=6.0.1<7.0.0 currently don't work as they are interpreted as output redirection.	2022-04-13 11:07:12 +01:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Kai Fricke	73d1610e69	[ci/release] Fix pipeline build for empty PR repo (#23775 ) What: If BUILDKITE_PULL_REQUEST_REPO is empty string, default to DEFAULT_REPO Why: BUILDKITE_PULL_REQUEST_REPO is set to an empty string per default, thus we're currently not detecting the buildkite repo correctly in branched builds.	2022-04-07 09:29:48 -07:00
Kai Fricke	7b86a05efd	[ci/release] Parse PR github repos correctly (#23757 ) What: Correctly infer github repo from PRs iin Buildkite environments Why: For PRs, we need to checkout the correct github repo and branch so we can kick off release tests directly from PRs. Test run (from this PR!): https://buildkite.com/ray-project/release-tests-pr/builds/20#7f5a6526-0040-4896-b23a-f4896c75973d	2022-04-06 17:34:20 -07:00
Kai Fricke	40a8183e05	[ci/release] Fix job-based file download (#23657 ) have to wrap download call in a lambda to be compatible with run_with_retry	2022-04-04 08:06:31 -07:00
Kai Fricke	9071b39f3e	[ci/release] Add buildkite output groups (#23658 ) This makes the buildkite output easier to parse and interpret.	2022-04-01 13:04:22 -07:00
Kai Fricke	fe27dbcd9a	[air/release] Improve file packing/unpacking (#23621 ) We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes. Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).	2022-04-01 07:38:14 -07:00
Jiajun Yao	2959294f02	[CI] Filter release tests by attr regex (#23485 ) Support filtering tests by test attr regex filters. Multiple filters can be specified with one line for each filter. The format is attr:regex (e.g. team:serve)	2022-03-30 09:41:18 -07:00
Kai Fricke	e8abffb017	[tune/release] Improve Tune cloud release tests for durable storage (#23277 ) This PR addresses recent failures in the tune cloud tests. In particular, this PR changes the following: The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests. We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected) We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days. Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes). Local release test runs succeeded. https://buildkite.com/ray-project/release-tests-branch/builds/189 https://buildkite.com/ray-project/release-tests-branch/builds/191	2022-03-30 09:28:33 -07:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Kai Fricke	724377163f	[ci/release] Unstable tests should only soft fail the build (#23403 ) This will leave the tests green if the test is failing but marked as unstable.	2022-03-23 09:38:56 +00:00
SangBin Cho	0cd687cc19	[Nightly test] Fix job download retry (#23401 ) Currently when we download a file to the cluster using a job, we don't do the retry.	2022-03-22 08:31:24 -07:00
Kai Fricke	02644ab4d8	[ci/release] Retry cluster env build on failure (#23378 ) Failed cluster env builds should be retried.	2022-03-22 09:45:22 +00:00
Kai Fricke	7085749d50	[tune] Adjust release test timeouts (#23362 ) Currently release tests fail because they exceed the (rather arbitrary) timeout by 1-2 seconds.	2022-03-20 17:05:20 +00:00
Dmitri Gekhtman	561e7a9677	[RELEASE] Add autoscaler env to fix nightly tests (#23345 ) The product backend doesn't yet understand that nightly Ray uses GCS-Ray. (This will be fixed when the next time the product control plane is deployed.) This PR introduces the env required to signal to the product backend that we're using GCS-Ray so that the autoscaler can startup correctly.	2022-03-18 17:48:27 -07:00
Kai Fricke	ca5354ffb1	[ci/release] Fix test_wheels (#23329 )	2022-03-18 14:39:36 +00:00
Kai Fricke	3cf8116df2	[ci/release] Re-enable commit sanity check (#23327 ) Commit sanity checks are currently seemingly disabled. This PR re-enables them by parsing wheel URLs.	2022-03-18 12:57:41 +00:00
Kai Fricke	da140a80e9	[ci/release] Legacy field should be optional (#23326 ) #22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon. Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...	2022-03-18 11:34:05 +00:00
Kai Fricke	e510d81c71	[ci/release] Save test config and results as artifacts (#23278 ) It is good to have these information readily available when checking test results, as it will reveal both the original configuration (that could change over time) as well as the achieved results. Also gets rid of the unneeded old alerts directory. https://buildkite.com/ray-project/release-tests-branch/builds/190#ef531787-412c-40ec-81e6-beb495830c60	2022-03-18 09:26:42 +00:00
mwtian	391901f86b	[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233 ) Continue to clean up Redis and other related Redis references, for - gcs_utils.py - log_monitor.py - `publish_error_to_driver()`	2022-03-16 19:34:57 -07:00
Kai Fricke	eca5bcfc87	[ci/release] Reload modules after installing matching Ray (#23227 ) Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel. Additionally, job submission should also install a matching local ray to match with the job submission server.	2022-03-16 15:44:43 +00:00
Kai Fricke	15aeb33e50	[ci/release] Support PR wheels (#23084 ) This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.	2022-03-14 17:24:13 +00:00
Kai Fricke	d93fa95dd5	[ci/release] Only report results for scheduled builds (#23135 ) Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.	2022-03-14 15:10:16 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	956ad95d67	[ci/release] Fix release test config (#23122 ) Currently the test is failing due to an invalid config (merged before validation was properly enforced).	2022-03-13 19:48:34 +00:00
Kai Fricke	c7303f538c	[ci/release] Validate smoke test fields, enforce frequency (#23075 ) Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.	2022-03-13 18:48:03 +00:00
Kai Fricke	04ea180dfb	[ci/release] Add "tiny" concurrency group, change limits (#23065 ) E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups. Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).	2022-03-11 10:19:38 -08:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
SangBin Cho	d192ec30fd	[Nightly Tests] Readjust the concurrency limit. (#23002 ) This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota. Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.	2022-03-10 07:19:38 -08:00
Kai Fricke	ac654dbb9d	[ci/release] Fix schema validation for single tests / add `stable` field (#22947 ) This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).	2022-03-09 15:22:49 +00:00
Kai Fricke	cac9d30909	[ci/release] Add schema validation for release test config (#22919 ) To avoid breakage like in #22905, this PR adds schema validation to the release test package. In a follow-up PR, we'll likely switch this to use pydantic instead.	2022-03-09 09:50:51 +00:00
SangBin Cho	529911ee78	[Nightly tests] Add missing patches (#22862 ) These changes are added to the old e2e.py, but not to the new infra	2022-03-07 19:48:43 +00:00
Jiajun Yao	1b5efb588e	[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844 ) This's easier to sort and compare timestamp and avoid timezone issue.	2022-03-07 04:54:19 -08:00
SangBin Cho	9d0148dbbe	[Test] Migrate the first test to the new infra (#22770 ) This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test	2022-03-06 18:24:54 -08:00
Jiajun Yao	23f2862067	[Release Test] Send release test result to db pipeline for new test infra (#22813 ) * Send release test result to db pipeline for new test infra * address comment	2022-03-05 07:34:40 +09:00
Kai Fricke	7425fa6212	[ci/release] Add support for concurrency groups (#22728 ) This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources. Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)	2022-03-02 16:35:54 +01:00
Kai Fricke	3695408a85	[release] Fix special cases in release test package (e.g. smoke test) (#22442 ) Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.	2022-02-28 21:05:01 +01:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00

46 commits