hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	ca87c37c61	[ci/release] Fix result output in Buildkite pipeline run (#22946 ) The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.	2022-03-09 17:29:31 +00:00
Edward Oakes	2cac49e4b0	[serve][release tests] Mark long-running failure test as non-stable (#22922 )	2022-03-09 09:42:47 -06:00
Kai Fricke	ac654dbb9d	[ci/release] Fix schema validation for single tests / add `stable` field (#22947 ) This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).	2022-03-09 15:22:49 +00:00
Kai Fricke	cac9d30909	[ci/release] Add schema validation for release test config (#22919 ) To avoid breakage like in #22905, this PR adds schema validation to the release test package. In a follow-up PR, we'll likely switch this to use pydantic instead.	2022-03-09 09:50:51 +00:00
Edward Oakes	aa907987bf	[serve][release tests] Use m5.8xlarge instance types for 1k replica tests (#22918 )	2022-03-08 21:34:01 -06:00
SangBin Cho	549527687f	Migrate scalability tests (#22901 ) This PR migrates scalability tests to the new infra. I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke	2022-03-08 17:22:41 -08:00
Kai Fricke	c57abb693b	[ci/release] Add frequency to core nightly test (#22905 ) Breaks the scheduled build: https://buildkite.com/ray-project/release-tests-branch/builds/82#3994f5e1-6da3-4c70-8c30-bdcfb1fec851 We should enforce schema validation soon.	2022-03-08 17:44:20 +00:00
SangBin Cho	0137fc8e23	[Tests] Add microbenchmark to the new infra test (#22861 ) Verified it works. It also addresses the frequency comments from the previous PR	2022-03-08 05:58:49 -08:00
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
SangBin Cho	529911ee78	[Nightly tests] Add missing patches (#22862 ) These changes are added to the old e2e.py, but not to the new infra	2022-03-07 19:48:43 +00:00
Jiajun Yao	1b5efb588e	[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844 ) This's easier to sort and compare timestamp and avoid timezone issue.	2022-03-07 04:54:19 -08:00
SangBin Cho	9d0148dbbe	[Test] Migrate the first test to the new infra (#22770 ) This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test	2022-03-06 18:24:54 -08:00
Jiajun Yao	23f2862067	[Release Test] Send release test result to db pipeline for new test infra (#22813 ) * Send release test result to db pipeline for new test infra * address comment	2022-03-05 07:34:40 +09:00
xwjiang2010	ee7a458762	[release test] fix horovod release test. (#22781 ) horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. Error message is saying: ``` AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'> ``` The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`. The two communicate via Anyscale client. The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.	2022-03-03 08:24:26 -08:00
Clark Zinzow	fa44ec82f3	Add Parquet metadata resolution nightly test to test set. (#22787 )	2022-03-02 14:56:00 -08:00
Kai Fricke	7425fa6212	[ci/release] Add support for concurrency groups (#22728 ) This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources. Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)	2022-03-02 16:35:54 +01:00
Jiajun Yao	04a1a19f6b	[Release Test] Send release test result to db pipeline (#22667 ) Send release test result to db pipeline Add perf metrics for microbenchmark so that we can alert on them	2022-03-02 06:19:31 -08:00
Kai Fricke	d06c3ffd6f	[release] Migrate Tune + XGBoost tests to new infrastructure (#22705 ) Migrate XGBoost and Tune tests to new release testing infrastructure. https://buildkite.com/ray-project/release-tests-branch/builds/50	2022-03-01 08:10:06 +01:00
SangBin Cho	2c1184592e	mark threaded actor test unstable (#22696 )	2022-02-28 15:25:14 -08:00
Clark Zinzow	cf3577f0ee	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
Chen Shen	7e90700521	[Dataset][nighly-test] promote data ingestion test to stable #22702	2022-02-28 14:00:18 -08:00
Kai Fricke	3695408a85	[release] Fix special cases in release test package (e.g. smoke test) (#22442 ) Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.	2022-02-28 21:05:01 +01:00
SangBin Cho	1cedb1b6e4	[Test] Increase timeout for microbenchmark (#22655 )	2022-02-25 17:29:12 -08:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
Archit Kulkarni	31332f8930	[serve] [release tests] Add health check grace period for 1k deployment (#22651 )	2022-02-25 12:13:44 -06:00
Archit Kulkarni	1165f99b0b	[CI] disable Serve microbenchmark k8s (#22631 )	2022-02-24 16:50:06 -08:00
Yi Cheng	de76d86bcb	[nightly] Stop GCS HA related nightly test (#22636 ) Since we've already turned it on on master, we should stop these tests for now.	2022-02-24 16:40:08 -08:00
Jun Gong	99b7be5e22	[rllib] Fix impala long running test (#22619 ) fix impala long running test. Bandits is the first agent that requires torch import at registration time.	2022-02-24 09:03:55 -08:00
SangBin Cho	5e847f7e09	[Usage Stats] Usage stats only enabled on nightly test infra (#22591 ) This PR enables the usage stats only on the release test infrastructure (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.	2022-02-23 22:11:48 -08:00
Eric Liang	e15a419028	Enable stage fusion by default for dataset pipelines (#22476 ) This PR enables stage fusion for dataset pipelines. This also requires: 1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage. 2. Removing spread_resource_prefix (not supported for now).	2022-02-23 17:34:05 -08:00
Max Pumperla	29d94a2211	[docs] sphinx gallery removal, migrate to ipynb (#22467 )	2022-02-19 01:19:07 -08:00
Jiajun Yao	baa14d695a	Round robin during spread scheduling (#21303 ) - Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently. - Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later. - Prefer not to spill back tasks that are waiting for args since the pull is already in progress.	2022-02-18 15:05:35 -08:00
Stephanie Wang	03a5589591	[core] Enable lineage reconstruction in CI (#21519 ) Enables lineage reconstruction in all CI and release tests.	2022-02-18 11:04:20 -08:00
Chen Shen	17f589a05d	[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479	2022-02-17 15:20:39 -08:00
mwtian	05dd72101b	[Release 1.11.0] Release logs for 1.11.0rc1 (#22443 ) This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.	2022-02-16 17:03:49 -08:00
Chen Shen	30ec0df9cc	[placement group] fix pg benchmark regression #22441 We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.	2022-02-16 16:24:51 -08:00
Jun Gong	a9147bb62c	[Release Test] Fix AnyscaleSDK construction so we can run CI on staging instance. (#22325 )	2022-02-16 09:56:02 -08:00
SangBin Cho	42361a1801	[Test] Fix Dask on Ray 1 TB bug #22431 Open Fixes a bug. It seems like not df is not working with dataframe	2022-02-17 02:44:36 +09:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00
SangBin Cho	2ed5bb7a5f	[Nightly Test] Addressed client failure properly (#22438 ) When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors	2022-02-16 09:03:17 -08:00
Jun Gong	04dd536987	[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426 )	2022-02-16 13:06:09 +01:00
Kai Fricke	c866131cc0	[tune] Retry cloud sync up/down/delete on fail (#22029 )	2022-02-15 12:27:29 +00:00
SangBin Cho	640d92c385	It seems like the S3 read sometimes fails; #22214 . I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue. It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.	2022-02-12 11:58:58 +09:00
Jun Gong	cbd24503b6	[RLlib] Add A3C to RLlib performance regression tests. (#22316 )	2022-02-11 21:18:53 +01:00
Archit Kulkarni	da57012cbc	Add comment to periodic CI pipeline to update release process doc when updating test suites (#22037 ) This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary. This is an action item from the Ray 1.10.0 release retrospective.	2022-02-11 11:14:24 -06:00
Chen Shen	0866a5558f	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 ) * pin pyarrow==4.0.1 * address comments	2022-02-10 14:22:41 -08:00
Sven Mika	04a5c72ea3	Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708 )	2022-02-10 13:44:22 +01:00
mwtian	47a56ca062	[Release] Add release logs for 1.11.0rc0 (GCS KV & pubsub not enabled) (#22041 )	2022-02-10 00:03:31 -08:00
SangBin Cho	30000ff8ae	Fix a bug from many drivers. (#22248 ) After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.	2022-02-09 15:17:15 -08:00
Alex Wu	b122f093c1	Revert "[RLlib] Speedup A3C up to 3x (new `training_iteration` function instead of `execution_plan`) and re-instate Pong learning test." (#22250 ) Reverts ray-project/ray#22126 Breaks rllib:tests/test_io	2022-02-09 09:26:36 -08:00

1 2 3 4 5 ...

533 commits