hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	149c031c4b	[tune/release] Do not use spot instances in k8s tests (#27250 ) Spot instances are not being booted up, so let's go without them. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-02 11:30:41 +01:00
Kai Fricke	3cd9a0446b	[tune/rllib/release] Load correct metadata file in rllib cloud tests (#27164 ) Currently this tries to load a stale metadata file that doesn't exist anymore after internal refactoring. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-28 15:51:09 +01:00
xwjiang2010	eb69c1ca28	[air] Add annotation for Tune module. (#27060 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 13:53:46 -07:00
Steven Morad	259429bdc3	Bump gym dep to 0.24 (#26190 ) Co-authored-by: Steven Morad <smorad@anyscale.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2022-07-22 12:37:16 -07:00
Avnish Narayan	5433c11650	[RLlib] Pin gym to 0.23.1 (#26752 )	2022-07-20 11:49:01 -07:00
Kai Fricke	0959f44b6f	[tune/structure] Introduce execution package (#26015 ) Execution-specific packages are moved to tune.execution. Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2022-06-23 11:13:19 +01:00
Sven Mika	130b7eeaba	[RLlib] `Trainer` to `Algorithm` renaming. (#25539 )	2022-06-11 15:10:39 +02:00
Kai Fricke	c3b608f757	[tune] Fix cloud tests, mark as stable (#25583 ) #25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.	2022-06-08 17:47:54 +01:00
Sven Mika	b5bc2b93c3	[RLlib] Move all remaining algos into `algorithms` directory. (#25366 )	2022-06-04 07:35:24 +02:00
Yi Cheng	fd0f967d2e	Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )" (#25420 ) This reverts commit `e4ceae19ef`. Reverts #25346 linux://python/ray/tests:test_client_library_integration never fail before this PR. In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR. And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)	2022-06-02 20:38:44 -07:00
Sven Mika	e4ceae19ef	[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )	2022-06-02 16:47:05 +02:00
Sven Mika	09886d7ab8	[RLlib] Upgrade gym 0.23 (#24171 )	2022-05-23 08:18:44 +02:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Kai Fricke	bb341eb1e4	Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101 ) * Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)" This reverts commit `00595653ed`. Failure in windows has been addressed by conditionally registering the signal handler if available.	2022-04-22 11:27:38 +01:00
xwjiang2010	00595653ed	Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085 )	2022-04-21 13:27:34 -07:00
Kai Fricke	f376dd8902	[tune] Also interrupt training when SIGUSR1 received (#24015 ) Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted. In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage. This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.	2022-04-21 13:07:29 +01:00
Kai Fricke	e3bd59882d	[air] Move storage handling to pyarrow.fs.FileSystem (#23370 )	2022-04-13 14:31:30 -07:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Kai Fricke	fe27dbcd9a	[air/release] Improve file packing/unpacking (#23621 ) We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes. Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).	2022-04-01 07:38:14 -07:00
Kai Fricke	e8abffb017	[tune/release] Improve Tune cloud release tests for durable storage (#23277 ) This PR addresses recent failures in the tune cloud tests. In particular, this PR changes the following: The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests. We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected) We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days. Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes). Local release test runs succeeded. https://buildkite.com/ray-project/release-tests-branch/builds/189 https://buildkite.com/ray-project/release-tests-branch/builds/191	2022-03-30 09:28:33 -07:00
Avnish Narayan	754bcd16f8	[rllib] Pin gym everywhere (#23384 ) This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.	2022-03-22 09:44:22 +00:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	430ea3e636	[ci/release] Migrate golden notebook tests (#22949 ) Migrating golden notebook tests to new release test package. Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155	2022-03-13 21:39:41 +00:00
Kai Fricke	c866131cc0	[tune] Retry cloud sync up/down/delete on fail (#22029 )	2022-02-15 12:27:29 +00:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Ian Rodney	257bd2d1e7	[Cleanup] Use `mkstemp` (#21676 ) `tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.	2022-01-25 13:42:12 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
mwtian	0b3fed5ef3	Revert "[Nightly Test] Add a team column to each test config. (#21198 )" (#21289 ) This reverts commit `b5b11b2d06`.	2021-12-30 06:44:51 +09:00
SangBin Cho	b5b11b2d06	[Nightly Test] Add a team column to each test config. (#21198 ) Please review e2e.py and test_suite belonging to your team! This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit# This PR adds a team name to each test suite. If the name is not specified, it will be reported as unspecified. If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future). Note that we will aggregate all of test config into a single file, nightly_test.yaml.	2021-12-27 14:42:41 -08:00
Kai Fricke	7446269ac9	[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536 ) Fixes some race conditions and softens some constraints around checkpoint numbers.	2021-11-24 09:29:12 +00:00
Kai Fricke	05d21497db	[rllib/tune] Fix durable trainable in trainer template, add release test (#20422 )	2021-11-16 20:52:42 +00:00
Yiran Wang	f4e8319eaa	Remove .boto files that are no longer needed during docker build (#20407 ) ## Why are these changes needed? The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues. ## Related issue number	2021-11-15 20:49:33 -08:00
Kai Fricke	d88fdd6e38	[tune] refactor SyncConfig (#20155 )	2021-11-12 09:36:15 +00:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
Kai Fricke	4e3e213549	[tune] Allow more versatile experiment analysis loading (#20181 )	2021-11-10 11:46:27 +00:00
Kai Fricke	fa0158abe5	[tune] Cloud checkpointing release tests (#19638 )	2021-10-29 12:12:01 +02:00

37 commits