Commit graph

533 commits

Author SHA1 Message Date
Yi Cheng
8b1bbfe8e4
[e2e] Fix an error when "env_vars" is not set. (#22234)
To fix error in session https://buildkite.com/ray-project/periodic-ci/builds/2699#c532ed2b-ee89-48ad-a7db-fd4211ef8bd9
2022-02-08 22:05:53 -08:00
Yi Cheng
d8ac01bd5c
[e2e] Update e2e test to use redisless ray by default. (#22189)
As title, after infra got updated, we need to merge the PR so that test can run ray without redis.
2022-02-08 19:46:48 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
SangBin Cho
ac00389cbe
[Nightly test] Bring back the old way of running commands. (#22209)
Bring back the old way of running commands for non-k8s tests.

This also fixes the regression from many_drivers.py
2022-02-08 01:44:07 -08:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader (#22185) 2022-02-07 15:23:34 -08:00
Eric Liang
00b5801d71
Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156) 2022-02-07 14:05:44 -08:00
Jiajun Yao
355ee4a02c
Fix nightly shuffle_data_loader by pinning down dependencies versions (#22183) 2022-02-07 11:25:30 -08:00
Chen Shen
13819304d4
[Core][nightly-test] better way of calculating num features (#22158)
* better filter of column length

* address comments

* more
2022-02-07 02:13:40 -08:00
Kai Fricke
dd935874ee
[ci/release] Fix job submission command (#22093)
Ray job submission does not accept quoted commands anymore (#22011). This PR updates the command to fix job submission within e2e tests.
2022-02-04 00:05:52 +01:00
mwtian
b528bf9202
Revert "[e2e] Remove unnecessary logic around copying results (#22034)" (#22088)
This reverts commit 92d7e9bf98.
2022-02-03 13:42:40 -08:00
mwtian
92d7e9bf98
[e2e] Remove unnecessary logic around copying results (#22034)
After #21905, some of the logic around handling result artifacts become unnecessary or incorrect (in generating error logs). They are removed.
2022-02-03 12:15:06 -08:00
SangBin Cho
3c056a6b92
Revert "[Nightly Test] Add more metadata to test result (#21990)" (#22052)
This reverts commit fd20cf3239.
2022-02-02 12:56:42 -08:00
SangBin Cho
fd20cf3239
[Nightly Test] Add more metadata to test result (#21990)
Add a columns, error code, commit url, stable, session url, and runtime
2022-01-31 22:33:30 -08:00
Yi Cheng
0659d4a472
[nightly] Limit many drivers iteration to 4000 iterations (#21958)
Due to faster running of many drivers, we limit the iteration to 4k for the test.
2022-01-31 13:26:02 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
570f67798a
[nightly] Move scheduling tests into one suite (#21959)
For future convenience, we are moving scheduling-related tests into one suite for easier monitoring and benchmarking.
2022-01-28 13:32:34 -08:00
Chen Shen
bfe3e5f4a8
add check on shape (#21947) 2022-01-28 12:27:43 -08:00
Archit Kulkarni
1f58ee3731
[1.10.0 Release] Add release logs for 1.10.0 (#21908)
* Copy logs from 1.9.0

* Replace 1.9.0 data with 1.10.0 data

* update with non-smoke-test results
2022-01-28 11:59:03 -08:00
Amog Kamsetty
bd726aab02
[Release] Disable caching for ray_lightning (#21886)
Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_

Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures.

Closes #21871
Closes #21863

Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images
2022-01-27 17:56:32 -08:00
mwtian
97f7e3d0e6
[e2e] do not terminate in serve_failure smoke test (#21925)
When the script terminates, it will also terminate its cluster including dashboard, which will prevent subsequent job submissions. Other long running e2e tests do not terminate in smoke test mode, so make `serve_failure` behave the same.
2022-01-27 15:36:46 -08:00
Jiajun Yao
cea80b1a5b
Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899)
* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests

* Don't advertise cpus on gpu nodes for pipelined ingestion tests
2022-01-27 09:17:01 -08:00
mwtian
634f897cb6
[e2e] improve output dir handling (#21906)
Try to clear the result dir before running the e2e.py script, to avoid failures where the directory already exists, or a file cannot be overwritten due to permission issue.
2022-01-26 23:56:08 -08:00
Yi Cheng
3560211ab5
[nightly] Temporarily stops the two pipelines for scheduling until with good setup. (#21922)
Right now these two tests always run out-of-time. We disable them for now and after solid test, we'll reenable them with good parameters.
2022-01-26 20:15:59 -08:00
Kai Fricke
3b73a62dad
[ci/release] Increase long running timeout, fix artifacts copy (#21905)
With the new job-based file copy, fetching results takes longer. We thus have to increase the long running update test check times in order not to run into bogus release test failures.
Also fixes artifact uploading issues.
2022-01-26 21:25:03 +00:00
Archit Kulkarni
11e2a07752
[release] Fix broken pip_download_test.sh script for non-M1 Macs (#21542)
Fixes a typo that caused the script to exit early without running any sanity checks when not using an M1 Mac.
2022-01-26 10:38:52 -08:00
mwtian
1674a17e6f
[e2e] use alternative copy tree function to tolerate output directory that already exists (#21869)
Many release tests have error messages when copying results with `shutil.copytree()`. e.g.
https://buildkite.com/ray-project/periodic-ci/builds/2511#131c0d22-61a3-4dcf-b80a-de37b68ec591/139-450

This PR tries to make the copying process tolerate existing destination directory. There is logic to remove the destination directory, but I'm not sure why it failed.

This error should not be failing the tests though.
2022-01-26 05:10:22 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
shrekris-anyscale
e4370720cc
[Serve] Add "Serve" team tag to untagged release tests (#21861) 2022-01-25 11:46:03 -08:00
SangBin Cho
7d4287a6ab
[Test] Move long running tests to run everyday (#21813)
Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
2022-01-24 15:10:27 -08:00
SangBin Cho
ac5f38d7fd
[Test] Fix dask on ray test on K8s (#21816)
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
2022-01-24 15:09:22 -08:00
SangBin Cho
6b4aac7a08
Promote unstable tests to stable (#21811)
Promote tests that have passed 100% last 1 week to stable
2022-01-24 02:10:37 -08:00
SangBin Cho
babc03edf2
Add a threaded actor k8s test (#21739)
Add threaded actor flaky test to k8s.
2022-01-23 20:12:57 -08:00
Max Pumperla
f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
Clark Zinzow
2cd3045b16
[Test Infra] Fix e2e.py help info for --report (#21757)
This momentarily confused me as to whether --report would enable or disable reporting.
2022-01-21 03:29:50 -08:00
Yi Cheng
90093769df
[nightly] Add more many tasks tests (#21727)
This PR add four tests for many tasks:

many short tasks send from the single node
many short tasks send from multiple nodes
many long tasks send from multiple nodes
many long tasks send from the single node
TODO: migrate many nodes actor tests to this one.

scheduling envelop should contain:

(tasks): scheduling_test_many_xx_tasks_yy_nodes
(actors):many_nodes_actor_test (to be combined with this one)
(shuffle): pipelined_ingestion_1500_gb_15_windows
(shuffle): dask_on_ray_1tb_sort
2022-01-20 14:52:26 -08:00
SangBin Cho
02af73a571
[Test] First core nightly test migration to k8s (#21698)
The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.
2022-01-19 13:31:49 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
Kai Fricke
e233f8172d
[ci/release] Terminate session on session startup timeout (#21703)
When a session startup times out due to resources not being available, the session may still come up after that timeout. At that time the control script (e2e.py) is already terminated, so the session runs until the autosuspend limit is hit, incurring unnecessary costs. Instead, we should always trigger session termination on session timeout.
2022-01-19 10:01:03 -08:00
Kai Fricke
4ef0c6c434
[tune/release] Demote xgboost_sweep to weekly testing (#21704)
XGBoost functionality is tested daily in the xgboost release test suite. The expensive XGBoost sweep test can thus be run weekly.
2022-01-19 09:15:04 -08:00
Chen Shen
74d4e7c20c
install botocore with s3fs to ensure no confliction (#21680) 2022-01-18 23:09:16 -08:00
Jiajun Yao
bb04cc9d80
Use latest cmake for pipelined_ingestion and pipelined_training tests (#21674) 2022-01-18 12:03:43 -08:00
Jun Gong
1315293dd8
[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643) 2022-01-18 09:29:01 +01:00
Kai Fricke
0e9e8824e4
[ci/release] use s3 sync (#21626)
Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now.

This reverts commit 42bcab27e8.
2022-01-15 17:53:19 -08:00
Kai Fricke
42bcab27e8
Revert "[Release Test] Opt-in tests to use K8s based cloud. (#21583)" (#21605)
This reverts commit 0d5fbcc7bb.
2022-01-14 11:46:52 -08:00
Jun Gong
7517aefe05
[RLlib] Bring back BC and Marwil learning tests. (#21574) 2022-01-14 14:35:32 +01:00
Simon Mo
0d5fbcc7bb
[Release Test] Opt-in tests to use K8s based cloud. (#21583) 2022-01-13 17:20:36 -08:00
Jun Gong
83955a9407
[RLlib] Extend CQL perf test to 1hr. (#21449) 2022-01-07 11:35:16 +01:00
Jiajun Yao
76b91efd9b
Fix wrong many_nodes_actor_test app config (#21404)
RAY_GCS_ACTOR_SCHEDULING_ENABLED is wrong should be RAY_gcs_actor_scheduling_enabled. Since gcs based actor scheduling is not enabled yet so I just removed this flag.
2022-01-05 11:52:13 -08:00
Kai Fricke
aa35045b6f
[ci/release] Update to recent anyscale API changes (#21149)
Recent changes in the anyscale API rendered the current e2e script incompatible. This PR resolves these subtle API changes.
2022-01-04 11:21:47 +00:00