Commit graph

716 commits

Author SHA1 Message Date
Kai Fricke
d06c3ffd6f
[release] Migrate Tune + XGBoost tests to new infrastructure (#22705)
Migrate XGBoost and Tune tests to new release testing infrastructure.

https://buildkite.com/ray-project/release-tests-branch/builds/50
2022-03-01 08:10:06 +01:00
SangBin Cho
2c1184592e
mark threaded actor test unstable (#22696) 2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665) 2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702 2022-02-28 14:00:18 -08:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
SangBin Cho
1cedb1b6e4
[Test] Increase timeout for microbenchmark (#22655) 2022-02-25 17:29:12 -08:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Archit Kulkarni
31332f8930
[serve] [release tests] Add health check grace period for 1k deployment (#22651) 2022-02-25 12:13:44 -06:00
Archit Kulkarni
1165f99b0b
[CI] disable Serve microbenchmark k8s (#22631) 2022-02-24 16:50:06 -08:00
Yi Cheng
de76d86bcb
[nightly] Stop GCS HA related nightly test (#22636)
Since we've already turned it on on master, we should stop these tests for now.
2022-02-24 16:40:08 -08:00
Jun Gong
99b7be5e22
[rllib] Fix impala long running test (#22619)
fix impala long running test.
Bandits is the first agent that requires torch import at registration time.
2022-02-24 09:03:55 -08:00
SangBin Cho
5e847f7e09
[Usage Stats] Usage stats only enabled on nightly test infra (#22591)
This PR **enables the usage stats only on the release test infrastructure** (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.
2022-02-23 22:11:48 -08:00
Eric Liang
e15a419028
Enable stage fusion by default for dataset pipelines (#22476)
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
2022-02-23 17:34:05 -08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Jiajun Yao
baa14d695a
Round robin during spread scheduling (#21303)
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
2022-02-18 15:05:35 -08:00
Stephanie Wang
03a5589591
[core] Enable lineage reconstruction in CI (#21519)
Enables lineage reconstruction in all CI and release tests.
2022-02-18 11:04:20 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479 2022-02-17 15:20:39 -08:00
mwtian
05dd72101b
[Release 1.11.0] Release logs for 1.11.0rc1 (#22443)
This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.
2022-02-16 17:03:49 -08:00
Chen Shen
30ec0df9cc
[placement group] fix pg benchmark regression #22441
We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.
2022-02-16 16:24:51 -08:00
Jun Gong
a9147bb62c
[Release Test] Fix AnyscaleSDK construction so we can run CI on staging instance. (#22325) 2022-02-16 09:56:02 -08:00
SangBin Cho
42361a1801
[Test] Fix Dask on Ray 1 TB bug #22431 Open
Fixes a bug. It seems like not df is not working with dataframe
2022-02-17 02:44:36 +09:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
SangBin Cho
2ed5bb7a5f
[Nightly Test] Addressed client failure properly (#22438)
When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors
2022-02-16 09:03:17 -08:00
Jun Gong
04dd536987
[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426) 2022-02-16 13:06:09 +01:00
Kai Fricke
c866131cc0
[tune] Retry cloud sync up/down/delete on fail (#22029) 2022-02-15 12:27:29 +00:00
SangBin Cho
640d92c385
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
2022-02-12 11:58:58 +09:00
Jun Gong
cbd24503b6
[RLlib] Add A3C to RLlib performance regression tests. (#22316) 2022-02-11 21:18:53 +01:00
Archit Kulkarni
da57012cbc
Add comment to periodic CI pipeline to update release process doc when updating test suites (#22037)
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.

This is an action item from the Ray 1.10.0 release retrospective.
2022-02-11 11:14:24 -06:00
Chen Shen
0866a5558f
[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277)
* pin pyarrow==4.0.1

* address comments
2022-02-10 14:22:41 -08:00
Sven Mika
04a5c72ea3
Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708) 2022-02-10 13:44:22 +01:00
mwtian
47a56ca062
[Release] Add release logs for 1.11.0rc0 (GCS KV & pubsub not enabled) (#22041) 2022-02-10 00:03:31 -08:00
SangBin Cho
30000ff8ae
Fix a bug from many drivers. (#22248)
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
2022-02-09 15:17:15 -08:00
Alex Wu
b122f093c1
Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test." (#22250)
Reverts ray-project/ray#22126

Breaks rllib:tests/test_io
2022-02-09 09:26:36 -08:00
Yi Cheng
8b1bbfe8e4
[e2e] Fix an error when "env_vars" is not set. (#22234)
To fix error in session https://buildkite.com/ray-project/periodic-ci/builds/2699#c532ed2b-ee89-48ad-a7db-fd4211ef8bd9
2022-02-08 22:05:53 -08:00
Yi Cheng
d8ac01bd5c
[e2e] Update e2e test to use redisless ray by default. (#22189)
As title, after infra got updated, we need to merge the PR so that test can run ray without redis.
2022-02-08 19:46:48 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
SangBin Cho
ac00389cbe
[Nightly test] Bring back the old way of running commands. (#22209)
Bring back the old way of running commands for non-k8s tests.

This also fixes the regression from many_drivers.py
2022-02-08 01:44:07 -08:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader (#22185) 2022-02-07 15:23:34 -08:00
Eric Liang
00b5801d71
Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156) 2022-02-07 14:05:44 -08:00
Jiajun Yao
355ee4a02c
Fix nightly shuffle_data_loader by pinning down dependencies versions (#22183) 2022-02-07 11:25:30 -08:00
Chen Shen
13819304d4
[Core][nightly-test] better way of calculating num features (#22158)
* better filter of column length

* address comments

* more
2022-02-07 02:13:40 -08:00
Kai Fricke
dd935874ee
[ci/release] Fix job submission command (#22093)
Ray job submission does not accept quoted commands anymore (#22011). This PR updates the command to fix job submission within e2e tests.
2022-02-04 00:05:52 +01:00
mwtian
b528bf9202
Revert "[e2e] Remove unnecessary logic around copying results (#22034)" (#22088)
This reverts commit 92d7e9bf98.
2022-02-03 13:42:40 -08:00
mwtian
92d7e9bf98
[e2e] Remove unnecessary logic around copying results (#22034)
After #21905, some of the logic around handling result artifacts become unnecessary or incorrect (in generating error logs). They are removed.
2022-02-03 12:15:06 -08:00
SangBin Cho
3c056a6b92
Revert "[Nightly Test] Add more metadata to test result (#21990)" (#22052)
This reverts commit fd20cf3239.
2022-02-02 12:56:42 -08:00
SangBin Cho
fd20cf3239
[Nightly Test] Add more metadata to test result (#21990)
Add a columns, error code, commit url, stable, session url, and runtime
2022-01-31 22:33:30 -08:00
Yi Cheng
0659d4a472
[nightly] Limit many drivers iteration to 4000 iterations (#21958)
Due to faster running of many drivers, we limit the iteration to 4k for the test.
2022-01-31 13:26:02 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
570f67798a
[nightly] Move scheduling tests into one suite (#21959)
For future convenience, we are moving scheduling-related tests into one suite for easier monitoring and benchmarking.
2022-01-28 13:32:34 -08:00
Chen Shen
bfe3e5f4a8
add check on shape (#21947) 2022-01-28 12:27:43 -08:00