Commit graph

556 commits

Author SHA1 Message Date
Jun Gong
99b7be5e22
[rllib] Fix impala long running test (#22619)
fix impala long running test.
Bandits is the first agent that requires torch import at registration time.
2022-02-24 09:03:55 -08:00
SangBin Cho
5e847f7e09
[Usage Stats] Usage stats only enabled on nightly test infra (#22591)
This PR **enables the usage stats only on the release test infrastructure** (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.
2022-02-23 22:11:48 -08:00
Eric Liang
e15a419028
Enable stage fusion by default for dataset pipelines (#22476)
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
2022-02-23 17:34:05 -08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Jiajun Yao
baa14d695a
Round robin during spread scheduling (#21303)
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
2022-02-18 15:05:35 -08:00
Stephanie Wang
03a5589591
[core] Enable lineage reconstruction in CI (#21519)
Enables lineage reconstruction in all CI and release tests.
2022-02-18 11:04:20 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479 2022-02-17 15:20:39 -08:00
mwtian
05dd72101b
[Release 1.11.0] Release logs for 1.11.0rc1 (#22443)
This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.
2022-02-16 17:03:49 -08:00
Chen Shen
30ec0df9cc
[placement group] fix pg benchmark regression #22441
We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.
2022-02-16 16:24:51 -08:00
Jun Gong
a9147bb62c
[Release Test] Fix AnyscaleSDK construction so we can run CI on staging instance. (#22325) 2022-02-16 09:56:02 -08:00
SangBin Cho
42361a1801
[Test] Fix Dask on Ray 1 TB bug #22431 Open
Fixes a bug. It seems like not df is not working with dataframe
2022-02-17 02:44:36 +09:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
SangBin Cho
2ed5bb7a5f
[Nightly Test] Addressed client failure properly (#22438)
When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors
2022-02-16 09:03:17 -08:00
Jun Gong
04dd536987
[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426) 2022-02-16 13:06:09 +01:00
Kai Fricke
c866131cc0
[tune] Retry cloud sync up/down/delete on fail (#22029) 2022-02-15 12:27:29 +00:00
SangBin Cho
640d92c385
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
2022-02-12 11:58:58 +09:00
Jun Gong
cbd24503b6
[RLlib] Add A3C to RLlib performance regression tests. (#22316) 2022-02-11 21:18:53 +01:00
Archit Kulkarni
da57012cbc
Add comment to periodic CI pipeline to update release process doc when updating test suites (#22037)
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.

This is an action item from the Ray 1.10.0 release retrospective.
2022-02-11 11:14:24 -06:00
Chen Shen
0866a5558f
[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277)
* pin pyarrow==4.0.1

* address comments
2022-02-10 14:22:41 -08:00
Sven Mika
04a5c72ea3
Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708) 2022-02-10 13:44:22 +01:00
mwtian
47a56ca062
[Release] Add release logs for 1.11.0rc0 (GCS KV & pubsub not enabled) (#22041) 2022-02-10 00:03:31 -08:00
SangBin Cho
30000ff8ae
Fix a bug from many drivers. (#22248)
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
2022-02-09 15:17:15 -08:00
Alex Wu
b122f093c1
Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test." (#22250)
Reverts ray-project/ray#22126

Breaks rllib:tests/test_io
2022-02-09 09:26:36 -08:00
Yi Cheng
8b1bbfe8e4
[e2e] Fix an error when "env_vars" is not set. (#22234)
To fix error in session https://buildkite.com/ray-project/periodic-ci/builds/2699#c532ed2b-ee89-48ad-a7db-fd4211ef8bd9
2022-02-08 22:05:53 -08:00
Yi Cheng
d8ac01bd5c
[e2e] Update e2e test to use redisless ray by default. (#22189)
As title, after infra got updated, we need to merge the PR so that test can run ray without redis.
2022-02-08 19:46:48 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
SangBin Cho
ac00389cbe
[Nightly test] Bring back the old way of running commands. (#22209)
Bring back the old way of running commands for non-k8s tests.

This also fixes the regression from many_drivers.py
2022-02-08 01:44:07 -08:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader (#22185) 2022-02-07 15:23:34 -08:00
Eric Liang
00b5801d71
Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156) 2022-02-07 14:05:44 -08:00
Jiajun Yao
355ee4a02c
Fix nightly shuffle_data_loader by pinning down dependencies versions (#22183) 2022-02-07 11:25:30 -08:00
Chen Shen
13819304d4
[Core][nightly-test] better way of calculating num features (#22158)
* better filter of column length

* address comments

* more
2022-02-07 02:13:40 -08:00
Kai Fricke
dd935874ee
[ci/release] Fix job submission command (#22093)
Ray job submission does not accept quoted commands anymore (#22011). This PR updates the command to fix job submission within e2e tests.
2022-02-04 00:05:52 +01:00
mwtian
b528bf9202
Revert "[e2e] Remove unnecessary logic around copying results (#22034)" (#22088)
This reverts commit 92d7e9bf98.
2022-02-03 13:42:40 -08:00
mwtian
92d7e9bf98
[e2e] Remove unnecessary logic around copying results (#22034)
After #21905, some of the logic around handling result artifacts become unnecessary or incorrect (in generating error logs). They are removed.
2022-02-03 12:15:06 -08:00
SangBin Cho
3c056a6b92
Revert "[Nightly Test] Add more metadata to test result (#21990)" (#22052)
This reverts commit fd20cf3239.
2022-02-02 12:56:42 -08:00
SangBin Cho
fd20cf3239
[Nightly Test] Add more metadata to test result (#21990)
Add a columns, error code, commit url, stable, session url, and runtime
2022-01-31 22:33:30 -08:00
Yi Cheng
0659d4a472
[nightly] Limit many drivers iteration to 4000 iterations (#21958)
Due to faster running of many drivers, we limit the iteration to 4k for the test.
2022-01-31 13:26:02 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
570f67798a
[nightly] Move scheduling tests into one suite (#21959)
For future convenience, we are moving scheduling-related tests into one suite for easier monitoring and benchmarking.
2022-01-28 13:32:34 -08:00
Chen Shen
bfe3e5f4a8
add check on shape (#21947) 2022-01-28 12:27:43 -08:00
Archit Kulkarni
1f58ee3731
[1.10.0 Release] Add release logs for 1.10.0 (#21908)
* Copy logs from 1.9.0

* Replace 1.9.0 data with 1.10.0 data

* update with non-smoke-test results
2022-01-28 11:59:03 -08:00
Amog Kamsetty
bd726aab02
[Release] Disable caching for ray_lightning (#21886)
Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_

Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures.

Closes #21871
Closes #21863

Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images
2022-01-27 17:56:32 -08:00
mwtian
97f7e3d0e6
[e2e] do not terminate in serve_failure smoke test (#21925)
When the script terminates, it will also terminate its cluster including dashboard, which will prevent subsequent job submissions. Other long running e2e tests do not terminate in smoke test mode, so make `serve_failure` behave the same.
2022-01-27 15:36:46 -08:00
Jiajun Yao
cea80b1a5b
Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899)
* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests
2022-01-27 09:17:01 -08:00
mwtian
634f897cb6
[e2e] improve output dir handling (#21906)
Try to clear the result dir before running the e2e.py script, to avoid failures where the directory already exists, or a file cannot be overwritten due to permission issue.
2022-01-26 23:56:08 -08:00
Yi Cheng
3560211ab5
[nightly] Temporarily stops the two pipelines for scheduling until with good setup. (#21922)
Right now these two tests always run out-of-time. We disable them for now and after solid test, we'll reenable them with good parameters.
2022-01-26 20:15:59 -08:00
Kai Fricke
3b73a62dad
[ci/release] Increase long running timeout, fix artifacts copy (#21905)
With the new job-based file copy, fetching results takes longer. We thus have to increase the long running update test check times in order not to run into bogus release test failures.
Also fixes artifact uploading issues.
2022-01-26 21:25:03 +00:00
Archit Kulkarni
11e2a07752
[release] Fix broken pip_download_test.sh script for non-M1 Macs (#21542)
Fixes a typo that caused the script to exit early without running any sanity checks when not using an M1 Mac.
2022-01-26 10:38:52 -08:00
mwtian
1674a17e6f
[e2e] use alternative copy tree function to tolerate output directory that already exists (#21869)
Many release tests have error messages when copying results with `shutil.copytree()`. e.g.
https://buildkite.com/ray-project/periodic-ci/builds/2511#131c0d22-61a3-4dcf-b80a-de37b68ec591/139-450

This PR tries to make the copying process tolerate existing destination directory. There is logic to remove the destination directory, but I'm not sure why it failed.

This error should not be failing the tests though.
2022-01-26 05:10:22 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00