hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Stephanie Wang	f7692e4602	[core] Remove more expensive shuffle tests (#25165 ) Now that the "smaller_instances" versions of these tests are stable, we can stop running the version that uses bigger instances.	2022-05-24 18:05:18 -07:00
Jiajun Yao	00cdd8dce5	Add chaos test for dataset shuffle (#25161 ) Add chaos tests for dataset shuffle: both push-based and non-push-based.	2022-05-24 15:12:20 -07:00
Jiajun Yao	b825a839f9	Mark dataset_shuffle_push_based_sort_1tb as stable (#25162 ) dataset_shuffle_push_based_sort_1tb is consistently passing for weeks.	2022-05-24 11:07:27 -07:00
mwtian	7013b32d15	[Release] prefer last cluster env version in release tests (#24950 ) Currently the release test runner prefers the first successfully version of a cluster env, instead of the last version. But sometimes a cluster env may build successfully on Anyscale but cannot launch cluster successfully (e.g. version 2 here) or new dependencies need to be installed, so a new version needs to be built. The existing logic always picks up the 1st successful build and cannot pick up the new cluster env version. Although this is an edge case (tweaking cluster env versions, with the same Ray wheel or cluster env name), I believe it is possible for others to run into it. Also, avoid running most of the CI tests for changes under release/ray_release/.	2022-05-24 13:26:54 +01:00
Sven Mika	09886d7ab8	[RLlib] Upgrade gym 0.23 (#24171 )	2022-05-23 08:18:44 +02:00
Steven Morad	501d932449	[RLlib] SAC, RNNSAC, and CQL TrainerConfig objects (#25059 )	2022-05-22 19:58:47 +02:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
SangBin Cho	5ac29c0105	Update pip download test.sh #24997 Update pip download test.sh to test python 31.0	2022-05-19 16:46:15 -07:00
Simon Mo	c3ac6fcf3f	Bump Ray Version from 2.0.0.dev0 to 3.0.0.dev0 (#24894 )	2022-05-17 19:31:05 -07:00
Kai Fricke	a0bba30153	[tune/release] Make long running distributed PBT cheaper (#24782 ) The test currently uses 6 GPUs out of 8 available, so we can get rid of one instance. Savings will be 25% for one instance less (3 instead of 4).	2022-05-17 18:23:31 +01:00
mwtian	36d6d59169	[Datasets] mark nightly test dataset_shuffle_random_shuffle_1tb_small_instances stable #24861	2022-05-17 09:56:05 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Artur Niederfahrenhorst	fb2915d26a	[RLlib] Replay Buffer API and Ape-X. (#24506 )	2022-05-17 13:43:49 +02:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Sven Mika	0cd7bc4054	[RLlib] Re-establish dashboard performance tests. (#24728 )	2022-05-16 13:13:49 +02:00
Jiajun Yao	6f14b6a9c3	[Release Test] Add smoke_test field to release test report (#24749 ) Distinguish smoke test and normal test.	2022-05-16 10:38:54 +01:00
Jiajun Yao	863943a540	Add perf alert for shuffle tests (#24798 ) Add perf alert for shuffle tests so we can catch #24740 earlier.	2022-05-15 21:50:18 -07:00
Kai Fricke	de69b0d6d6	[train/release] Fix horovod user test master app config (#24734 )	2022-05-14 21:20:45 -07:00
Chen Shen	2be45fed5e	Revert "[dataset] Use polars for sorting (#24523 )" (#24781 ) This reverts commit `c62e00e`. See if reverts this resolve linux://python/ray/tests:test_actor_advanced failure.	2022-05-13 12:09:12 -07:00
Jian Xiao	ba500133af	lower the utilization threshold in many tasks scheduling test by 5% (#24758 ) Fix the failure to unbreak nightly and unblock 1.13 release. The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-05-13 10:44:58 -07:00
Stephanie Wang	c62e00ed6d	[dataset] Use polars for sorting (#24523 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.	2022-05-12 18:35:50 -07:00
Amog Kamsetty	a36e2a8f51	[Tune] Deprecate DistributedTrainableCreator (#24453 ) Fully deprecate DistributedTrainableCreator for Ray 2.0 Closes #24453	2022-05-10 11:06:43 -07:00
Kai Fricke	67d602e7a6	[ci] Fix automatic buildkite token fetching in fetch_release_logs.py (#24606 ) The script expected a return string, not an environment variable.	2022-05-10 09:24:10 +02:00
mwtian	918d3601c6	[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481 )	2022-05-06 15:55:59 -07:00
Kai Fricke	d6096df742	[release] Add utility script to fetch release logs (#24508 ) This PR adds a utility script to automatically fetch release test results from the Buildkite pipeline for a release branch. This was previously a manual process.	2022-05-05 19:32:34 +01:00
Sven Mika	70d3bfcf9c	[RLlib] Provide more time for APPO Pong release and performance tests. (#24503 )	2022-05-05 18:19:38 +02:00
Kai Fricke	e1eec5507a	[ci/release] Fix ray version from init test (#24510 ) This release package unit test fails on release branches. Instead of checking for a hard-coded version number, we should just require the value to be non-empty. See e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/31295#b6c6c952-ce34-4521-9342-429e92560dd3	2022-05-05 16:05:23 +01:00
SangBin Cho	295b4436b3	[Nightly tests] Increase wait for nodes timeout (#24457 ) Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)	2022-05-04 19:42:21 -07:00
SangBin Cho	168790c276	[Test] Add grace period to long running actor test failure (#24469 ) Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP #24468	2022-05-04 16:00:22 -07:00
Sihan Wang	3f5da8af7a	[Serve] Add serve handle graph workload nightly tests (#24435 )	2022-05-04 09:07:50 -07:00
Sven Mika	b48f63113b	[RLlib] SlateQ fixes: Release learning tests wrong yaml structure + TD-error torch issue (#24429 )	2022-05-04 13:37:14 +02:00
Jiao	9d31f5f7b2	[Serve] Change deployment graph long chain test (#24418 )	2022-05-03 10:38:47 -07:00
Stephanie Wang	fbbc9c33d6	Add nightly tests for push-based shuffle (#24352 ) Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.	2022-05-02 11:35:14 -07:00
Sven Mika	f066180ed5	[RLlib] Deprecate `timesteps_per_iteration` config key (in favor of `min_[sample\|train]_timesteps_per_reporting`. (#24372 )	2022-05-02 12:51:14 +02:00
Kai Fricke	8a578c191f	[ci/release] Re-install anyscale package after local env setup (#24373 ) The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.	2022-05-01 16:51:55 +01:00
Jiao	ba7cc1803a	[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246 )	2022-04-29 17:03:33 -07:00
mwtian	02fda97c86	[CI] Re-balance concurrency groups to allow more quota for `large` tests (#24344 ) Currently nightly tests are unable to finish in a day because of concurrency group limit on `large` tests. This is an attempt to adjust the limits so buildkite can run / finish more tests. I will observe which tests fall into the `enormous` group and adjust the test resource / concurrency group limits again.	2022-04-29 22:26:16 +01:00
Sven Mika	3052193c9e	[RLlib] Fix CQL getting stuck when deprecated `timesteps_per_iteration` is used (use `min_train_timesteps_per_reporting` instead). (#24345 ) Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting. If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).	2022-04-29 21:02:34 +01:00
Kai Fricke	ac036e4fe8	[ci/release] Print local environment information (#24346 ) For debugging client environments, it is helpful to print the installed pip packages. Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.	2022-04-29 21:01:50 +01:00
Kai Fricke	dd87e61808	[ci/release] Fix module import errors in release tests (#24334 ) After https://github.com/ray-project/ray/pull/24066, some release tests are running into: ``` ModuleNotFoundError: No module named 'ray.train.impl' ``` This PR simply adds a `__init__.py` file to resolve this. We also add a 5 wecond delay for client runners in release test to give clusters a bit of slack to come up (and avoid ray client connection errors)	2022-04-29 17:03:17 +01:00
SangBin Cho	46cd7f1830	Make large multi tests to nightly + remove k8s tests (#24302 ) As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests	2022-04-29 03:40:12 -07:00
Kai Fricke	f3857b7aa1	[ci/release] Fix concurrency group calculation for smoke tests (#24269 ) Currently concurrency groups are always calculated based on the full test cluster compute. Instead, smoke tests should use the smoke test cluster compute.	2022-04-27 22:13:25 +01:00
mwtian	afdfd20a5b	[Release tests] Create compute config for new dataset shuffle tests (#24239 ) Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests. Related issue number #24176: the added tests would verify the instance types which support the new shuffle implementations.	2022-04-27 11:50:12 -07:00
Chen Shen	5c461519f3	Revert "[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )" This reverts commit `717e60c` and `4aa854a`	2022-04-25 17:56:08 -07:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Stephanie Wang	1de9f3457e	[nightly tests] Mark Datasets shuffle tests stable (#24175 ) dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.	2022-04-25 09:01:37 -07:00
Kai Fricke	bb341eb1e4	Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101 ) * Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)" This reverts commit `00595653ed`. Failure in windows has been addressed by conditionally registering the signal handler if available.	2022-04-22 11:27:38 +01:00
shrekris-anyscale	b51d0aa8b1	[serve] Introduce `context.py` and `client.py` (#24067 ) Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files: * `context.py` * Intended to expose process-wide state to internal Serve code as well as `api.py` * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables * `client.py` * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`	2022-04-21 18:35:09 -05:00
xwjiang2010	00595653ed	Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085 )	2022-04-21 13:27:34 -07:00
Kai Fricke	f376dd8902	[tune] Also interrupt training when SIGUSR1 received (#24015 ) Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted. In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage. This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.	2022-04-21 13:07:29 +01:00

1 2 3 4 5 ...

628 commits