Commit graph

26 commits

Author SHA1 Message Date
Sven Mika
09886d7ab8
[RLlib] Upgrade gym 0.23 (#24171) 2022-05-23 08:18:44 +02:00
SangBin Cho
ec653e3196
[Nightly test] Move two line downloads to one line. (#25061)
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
2022-05-22 00:07:03 -07:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Kai Fricke
bb341eb1e4
Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101)
* Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)"

This reverts commit 00595653ed.

Failure in windows has been addressed by conditionally registering the signal handler if available.
2022-04-22 11:27:38 +01:00
xwjiang2010
00595653ed
Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085) 2022-04-21 13:27:34 -07:00
Kai Fricke
f376dd8902
[tune] Also interrupt training when SIGUSR1 received (#24015)
Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted.

In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage.

This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.
2022-04-21 13:07:29 +01:00
Kai Fricke
e3bd59882d
[air] Move storage handling to pyarrow.fs.FileSystem (#23370) 2022-04-13 14:31:30 -07:00
Eric Liang
1ff874e8e8
[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817) 2022-04-10 16:12:53 -07:00
Kai Fricke
fe27dbcd9a
[air/release] Improve file packing/unpacking (#23621)
We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes.

Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).
2022-04-01 07:38:14 -07:00
Kai Fricke
e8abffb017
[tune/release] Improve Tune cloud release tests for durable storage (#23277)
This PR addresses recent failures in the tune cloud tests.

In particular, this PR changes the following:

    The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
    We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
    We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
    Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).

Local release test runs succeeded.

https://buildkite.com/ray-project/release-tests-branch/builds/189
https://buildkite.com/ray-project/release-tests-branch/builds/191
2022-03-30 09:28:33 -07:00
Avnish Narayan
754bcd16f8
[rllib] Pin gym everywhere (#23384)
This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.
2022-03-22 09:44:22 +00:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
430ea3e636
[ci/release] Migrate golden notebook tests (#22949)
Migrating golden notebook tests to new release test package.
Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155
2022-03-13 21:39:41 +00:00
Kai Fricke
c866131cc0
[tune] Retry cloud sync up/down/delete on fail (#22029) 2022-02-15 12:27:29 +00:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
Kai Fricke
7446269ac9
[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536)
Fixes some race conditions and softens some constraints around checkpoint numbers.
2021-11-24 09:29:12 +00:00
Kai Fricke
05d21497db
[rllib/tune] Fix durable trainable in trainer template, add release test (#20422) 2021-11-16 20:52:42 +00:00
Yiran Wang
f4e8319eaa
Remove .boto files that are no longer needed during docker build (#20407)
## Why are these changes needed?

The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues.

## Related issue number
2021-11-15 20:49:33 -08:00
Kai Fricke
d88fdd6e38
[tune] refactor SyncConfig (#20155) 2021-11-12 09:36:15 +00:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00
Kai Fricke
fa0158abe5
[tune] Cloud checkpointing release tests (#19638) 2021-10-29 12:12:01 +02:00