hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	de69b0d6d6	[train/release] Fix horovod user test master app config (#24734 )	2022-05-14 21:20:45 -07:00
Chen Shen	2be45fed5e	Revert "[dataset] Use polars for sorting (#24523 )" (#24781 ) This reverts commit `c62e00e`. See if reverts this resolve linux://python/ray/tests:test_actor_advanced failure.	2022-05-13 12:09:12 -07:00
Jian Xiao	ba500133af	lower the utilization threshold in many tasks scheduling test by 5% (#24758 ) Fix the failure to unbreak nightly and unblock 1.13 release. The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-05-13 10:44:58 -07:00
Stephanie Wang	c62e00ed6d	[dataset] Use polars for sorting (#24523 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.	2022-05-12 18:35:50 -07:00
Amog Kamsetty	a36e2a8f51	[Tune] Deprecate DistributedTrainableCreator (#24453 ) Fully deprecate DistributedTrainableCreator for Ray 2.0 Closes #24453	2022-05-10 11:06:43 -07:00
Kai Fricke	67d602e7a6	[ci] Fix automatic buildkite token fetching in fetch_release_logs.py (#24606 ) The script expected a return string, not an environment variable.	2022-05-10 09:24:10 +02:00
mwtian	918d3601c6	[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481 )	2022-05-06 15:55:59 -07:00
Kai Fricke	d6096df742	[release] Add utility script to fetch release logs (#24508 ) This PR adds a utility script to automatically fetch release test results from the Buildkite pipeline for a release branch. This was previously a manual process.	2022-05-05 19:32:34 +01:00
Sven Mika	70d3bfcf9c	[RLlib] Provide more time for APPO Pong release and performance tests. (#24503 )	2022-05-05 18:19:38 +02:00
Kai Fricke	e1eec5507a	[ci/release] Fix ray version from init test (#24510 ) This release package unit test fails on release branches. Instead of checking for a hard-coded version number, we should just require the value to be non-empty. See e.g. https://buildkite.com/ray-project/ray-builders-pr/builds/31295#b6c6c952-ce34-4521-9342-429e92560dd3	2022-05-05 16:05:23 +01:00
SangBin Cho	295b4436b3	[Nightly tests] Increase wait for nodes timeout (#24457 ) Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)	2022-05-04 19:42:21 -07:00
SangBin Cho	168790c276	[Test] Add grace period to long running actor test failure (#24469 ) Add 30 seconds grace period before raising an exception from this test failure (https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_1FL4g3cMg1wYifWf52tAaWtJ?command-history-section=command_history). What I'd like to see is some sort of error messages are propagated to the driver if this is due to some unexpected issues. Note that this PR also adds more detailed exit information to all worker failures, but this is still WIP #24468	2022-05-04 16:00:22 -07:00
Sihan Wang	3f5da8af7a	[Serve] Add serve handle graph workload nightly tests (#24435 )	2022-05-04 09:07:50 -07:00
Sven Mika	b48f63113b	[RLlib] SlateQ fixes: Release learning tests wrong yaml structure + TD-error torch issue (#24429 )	2022-05-04 13:37:14 +02:00
Jiao	9d31f5f7b2	[Serve] Change deployment graph long chain test (#24418 )	2022-05-03 10:38:47 -07:00
Stephanie Wang	fbbc9c33d6	Add nightly tests for push-based shuffle (#24352 ) Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.	2022-05-02 11:35:14 -07:00
Sven Mika	f066180ed5	[RLlib] Deprecate `timesteps_per_iteration` config key (in favor of `min_[sample\|train]_timesteps_per_reporting`. (#24372 )	2022-05-02 12:51:14 +02:00
Kai Fricke	8a578c191f	[ci/release] Re-install anyscale package after local env setup (#24373 ) The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.	2022-05-01 16:51:55 +01:00
Jiao	ba7cc1803a	[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246 )	2022-04-29 17:03:33 -07:00
mwtian	02fda97c86	[CI] Re-balance concurrency groups to allow more quota for `large` tests (#24344 ) Currently nightly tests are unable to finish in a day because of concurrency group limit on `large` tests. This is an attempt to adjust the limits so buildkite can run / finish more tests. I will observe which tests fall into the `enormous` group and adjust the test resource / concurrency group limits again.	2022-04-29 22:26:16 +01:00
Sven Mika	3052193c9e	[RLlib] Fix CQL getting stuck when deprecated `timesteps_per_iteration` is used (use `min_train_timesteps_per_reporting` instead). (#24345 ) Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting. If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).	2022-04-29 21:02:34 +01:00
Kai Fricke	ac036e4fe8	[ci/release] Print local environment information (#24346 ) For debugging client environments, it is helpful to print the installed pip packages. Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.	2022-04-29 21:01:50 +01:00
Kai Fricke	dd87e61808	[ci/release] Fix module import errors in release tests (#24334 ) After https://github.com/ray-project/ray/pull/24066, some release tests are running into: ``` ModuleNotFoundError: No module named 'ray.train.impl' ``` This PR simply adds a `__init__.py` file to resolve this. We also add a 5 wecond delay for client runners in release test to give clusters a bit of slack to come up (and avoid ray client connection errors)	2022-04-29 17:03:17 +01:00
SangBin Cho	46cd7f1830	Make large multi tests to nightly + remove k8s tests (#24302 ) As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests	2022-04-29 03:40:12 -07:00
Kai Fricke	f3857b7aa1	[ci/release] Fix concurrency group calculation for smoke tests (#24269 ) Currently concurrency groups are always calculated based on the full test cluster compute. Instead, smoke tests should use the smoke test cluster compute.	2022-04-27 22:13:25 +01:00
mwtian	afdfd20a5b	[Release tests] Create compute config for new dataset shuffle tests (#24239 ) Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests. Related issue number #24176: the added tests would verify the instance types which support the new shuffle implementations.	2022-04-27 11:50:12 -07:00
Chen Shen	5c461519f3	Revert "[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )" This reverts commit `717e60c` and `4aa854a`	2022-04-25 17:56:08 -07:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Stephanie Wang	1de9f3457e	[nightly tests] Mark Datasets shuffle tests stable (#24175 ) dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.	2022-04-25 09:01:37 -07:00
Kai Fricke	bb341eb1e4	Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101 ) * Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)" This reverts commit `00595653ed`. Failure in windows has been addressed by conditionally registering the signal handler if available.	2022-04-22 11:27:38 +01:00
shrekris-anyscale	b51d0aa8b1	[serve] Introduce `context.py` and `client.py` (#24067 ) Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files: * `context.py` * Intended to expose process-wide state to internal Serve code as well as `api.py` * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables * `client.py` * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`	2022-04-21 18:35:09 -05:00
xwjiang2010	00595653ed	Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085 )	2022-04-21 13:27:34 -07:00
Kai Fricke	f376dd8902	[tune] Also interrupt training when SIGUSR1 received (#24015 ) Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted. In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage. This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.	2022-04-21 13:07:29 +01:00
Simon Mo	7b0c77dd38	[Serve] Fix torch_tune_serve_test client test (#24031 )	2022-04-20 16:52:27 -07:00
Amog Kamsetty	47243ace7c	[Release] Upgrade instance types for xgboost gpu release tests (#24002 ) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048	2022-04-20 15:18:22 -07:00
Chen Shen	717e60cb4d	[Core][nightly-test] fix shuffle 5000 partition OOM #23997 closes #23992 #23781 changed the machine type where the memory capacity dropped from 128GB to 64GB and thus shuffle_1tb_5000_partitions starts OOMing.	2022-04-18 23:49:51 -07:00
Amog Kamsetty	9ec5793bea	[Release] Fix XGBoost Golden Notebook Tests (#23996 ) Xgboost released a new version a few days ago. Due to caching of the Anyscale cluster env, this resulted in the server having an outdated xgboost version while the client has the most recent version causing the test to fail. Instead, we reinstall xgboost-ray and xgboost in the post build commands so that these dependencies are not being cached in the cluster env.	2022-04-18 21:44:47 -07:00
Dmitri Gekhtman	fc4ac71deb	[minor] Fix legacy OSS operator test (#23540 ) A legacy K8s test fails due to incorrect usage of @ray.method which only started raising errors after the Ray 1.12.0 branch cut. This PR removes the use of @ray.method in the test. Some context in #23271 and #23471 In addition, I noticed some of the test were flakey due to out-of-memory issues. For that reason, I've doubled the memory request and limits in the legacy operator's example files. I've also added CPU limits in an example file that was missing them -- it makes the most sense for consistency with Ray's resource model to use CPU limits in K8s configs. Finally, I added an extra note to the instructions for running the tests.	2022-04-18 17:47:42 -07:00
Kai Fricke	6e37a48632	[ci/release] Allow for preferring smoke tests when filtering (#23887 ) What: Adds a setting "prefer_smoke_tests" to the Buildkite settings. With this, user can specify to kick off smoke tests, if available. Why: The filtering interface of the release testing dialog is a bit complicated at the moment - in order to kick off smoke tests, users have to know with which frequency they are configured to run. Instead users should usually just filter the tests they want to run (using frequency ANY) and optionally specify to run smoke tests, if available.	2022-04-14 06:12:27 +01:00
Kai Fricke	e3bd59882d	[air] Move storage handling to pyarrow.fs.FileSystem (#23370 )	2022-04-13 14:31:30 -07:00
Kai Fricke	65d9a410f7	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 ) Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories. Details: - Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc. - Minor adjustments to some scripts (variable renames) - Removes the outdated (unused) asan tests	2022-04-13 18:11:30 +01:00
Kai Fricke	5e1218aae1	[ci/release] Quote pip installs in client runner (#23888 ) What: Quotes pip install packages in local environment setup for client runner. Why: Strings like pyarrow>=6.0.1<7.0.0 currently don't work as they are interpreted as output redirection.	2022-04-13 11:07:12 +01:00
Edward Oakes	de227ac407	[serve] Add component logger + basic access logging (#23558 ) Adds a "component logger" to standardize logging across the HTTP proxy, controller, and deployment replicas.	2022-04-12 18:16:58 -05:00
Stephanie Wang	71e142b1fa	[core][tests] Add nightly test for datasets random_shuffle and sort (#23807 ) Copied from #23784. Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory. Modified to fix lint.	2022-04-12 12:53:57 -07:00
Tao Wang	a051e693c1	[Test]Add a time check for task benchmark (#23170 ) In test_many_tasks.py case, we usually found the case failing and found the reason. We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round. It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by cur_cpus = ray.available_resources().get("CPU", 0) min_cpus_available = min(min_cpus_available, cur_cpus) especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes). The situation we thought it would be: for _ in range(sleep_time / 0.1): sleep(0.1) The actual situation happens: for _ in range(sleep_time / 0.1): do_something(); # it costs time, sometimes pretty much sleep(0.1) We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.	2022-04-11 06:27:04 -07:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Archit Kulkarni	7a1a7e1844	Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 )" (#23805 ) This reverts commit `ba484feac0`. Broke lint.	2022-04-08 13:18:13 -07:00
Stephanie Wang	ba484feac0	[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 ) Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.	2022-04-08 11:31:10 -07:00
Stephanie Wang	4aa854aa23	[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )	2022-04-07 19:05:42 -07:00
Jian Xiao	c23cae660d	[Release 1.12.0] Add release logs for 1.12.0rc1 (#23508 ) Add release logs for 1.12.0rc1. The base is 1.11.0rc1.	2022-04-07 11:23:04 -07:00

1 2 3 4 5 ...

661 commits