hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
Chen Shen	5c461519f3	Revert "[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )" This reverts commit `717e60c` and `4aa854a`	2022-04-25 17:56:08 -07:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Stephanie Wang	1de9f3457e	[nightly tests] Mark Datasets shuffle tests stable (#24175 ) dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.	2022-04-25 09:01:37 -07:00
Kai Fricke	bb341eb1e4	Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101 ) * Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)" This reverts commit `00595653ed`. Failure in windows has been addressed by conditionally registering the signal handler if available.	2022-04-22 11:27:38 +01:00
shrekris-anyscale	b51d0aa8b1	[serve] Introduce `context.py` and `client.py` (#24067 ) Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files: * `context.py` * Intended to expose process-wide state to internal Serve code as well as `api.py` * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables * `client.py` * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`	2022-04-21 18:35:09 -05:00
xwjiang2010	00595653ed	Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085 )	2022-04-21 13:27:34 -07:00
Kai Fricke	f376dd8902	[tune] Also interrupt training when SIGUSR1 received (#24015 ) Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted. In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage. This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.	2022-04-21 13:07:29 +01:00
Simon Mo	7b0c77dd38	[Serve] Fix torch_tune_serve_test client test (#24031 )	2022-04-20 16:52:27 -07:00
Amog Kamsetty	47243ace7c	[Release] Upgrade instance types for xgboost gpu release tests (#24002 ) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048	2022-04-20 15:18:22 -07:00
Chen Shen	717e60cb4d	[Core][nightly-test] fix shuffle 5000 partition OOM #23997 closes #23992 #23781 changed the machine type where the memory capacity dropped from 128GB to 64GB and thus shuffle_1tb_5000_partitions starts OOMing.	2022-04-18 23:49:51 -07:00
Amog Kamsetty	9ec5793bea	[Release] Fix XGBoost Golden Notebook Tests (#23996 ) Xgboost released a new version a few days ago. Due to caching of the Anyscale cluster env, this resulted in the server having an outdated xgboost version while the client has the most recent version causing the test to fail. Instead, we reinstall xgboost-ray and xgboost in the post build commands so that these dependencies are not being cached in the cluster env.	2022-04-18 21:44:47 -07:00
Dmitri Gekhtman	fc4ac71deb	[minor] Fix legacy OSS operator test (#23540 ) A legacy K8s test fails due to incorrect usage of @ray.method which only started raising errors after the Ray 1.12.0 branch cut. This PR removes the use of @ray.method in the test. Some context in #23271 and #23471 In addition, I noticed some of the test were flakey due to out-of-memory issues. For that reason, I've doubled the memory request and limits in the legacy operator's example files. I've also added CPU limits in an example file that was missing them -- it makes the most sense for consistency with Ray's resource model to use CPU limits in K8s configs. Finally, I added an extra note to the instructions for running the tests.	2022-04-18 17:47:42 -07:00
Kai Fricke	6e37a48632	[ci/release] Allow for preferring smoke tests when filtering (#23887 ) What: Adds a setting "prefer_smoke_tests" to the Buildkite settings. With this, user can specify to kick off smoke tests, if available. Why: The filtering interface of the release testing dialog is a bit complicated at the moment - in order to kick off smoke tests, users have to know with which frequency they are configured to run. Instead users should usually just filter the tests they want to run (using frequency ANY) and optionally specify to run smoke tests, if available.	2022-04-14 06:12:27 +01:00
Kai Fricke	e3bd59882d	[air] Move storage handling to pyarrow.fs.FileSystem (#23370 )	2022-04-13 14:31:30 -07:00
Kai Fricke	65d9a410f7	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 ) Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories. Details: - Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc. - Minor adjustments to some scripts (variable renames) - Removes the outdated (unused) asan tests	2022-04-13 18:11:30 +01:00
Kai Fricke	5e1218aae1	[ci/release] Quote pip installs in client runner (#23888 ) What: Quotes pip install packages in local environment setup for client runner. Why: Strings like pyarrow>=6.0.1<7.0.0 currently don't work as they are interpreted as output redirection.	2022-04-13 11:07:12 +01:00
Edward Oakes	de227ac407	[serve] Add component logger + basic access logging (#23558 ) Adds a "component logger" to standardize logging across the HTTP proxy, controller, and deployment replicas.	2022-04-12 18:16:58 -05:00
Stephanie Wang	71e142b1fa	[core][tests] Add nightly test for datasets random_shuffle and sort (#23807 ) Copied from #23784. Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory. Modified to fix lint.	2022-04-12 12:53:57 -07:00
Tao Wang	a051e693c1	[Test]Add a time check for task benchmark (#23170 ) In test_many_tasks.py case, we usually found the case failing and found the reason. We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round. It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by cur_cpus = ray.available_resources().get("CPU", 0) min_cpus_available = min(min_cpus_available, cur_cpus) especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes). The situation we thought it would be: for _ in range(sleep_time / 0.1): sleep(0.1) The actual situation happens: for _ in range(sleep_time / 0.1): do_something(); # it costs time, sometimes pretty much sleep(0.1) We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.	2022-04-11 06:27:04 -07:00
Eric Liang	1ff874e8e8	[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817 )	2022-04-10 16:12:53 -07:00
Archit Kulkarni	7a1a7e1844	Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 )" (#23805 ) This reverts commit `ba484feac0`. Broke lint.	2022-04-08 13:18:13 -07:00
Stephanie Wang	ba484feac0	[core][tests] Add nightly test for datasets random_shuffle and sort (#23784 ) Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.	2022-04-08 11:31:10 -07:00
Stephanie Wang	4aa854aa23	[core] Use cheaper AWS m5 instances for shuffle tests (#23781 )	2022-04-07 19:05:42 -07:00
Jian Xiao	c23cae660d	[Release 1.12.0] Add release logs for 1.12.0rc1 (#23508 ) Add release logs for 1.12.0rc1. The base is 1.11.0rc1.	2022-04-07 11:23:04 -07:00
Kai Fricke	73d1610e69	[ci/release] Fix pipeline build for empty PR repo (#23775 ) What: If BUILDKITE_PULL_REQUEST_REPO is empty string, default to DEFAULT_REPO Why: BUILDKITE_PULL_REQUEST_REPO is set to an empty string per default, thus we're currently not detecting the buildkite repo correctly in branched builds.	2022-04-07 09:29:48 -07:00
Kai Fricke	7b86a05efd	[ci/release] Parse PR github repos correctly (#23757 ) What: Correctly infer github repo from PRs iin Buildkite environments Why: For PRs, we need to checkout the correct github repo and branch so we can kick off release tests directly from PRs. Test run (from this PR!): https://buildkite.com/ray-project/release-tests-pr/builds/20#7f5a6526-0040-4896-b23a-f4896c75973d	2022-04-06 17:34:20 -07:00
SangBin Cho	47ff1241f9	[Test] Use spot instances for chaos tests. (#23679 ) Use spot instances for chaos tests. We can also experiment with other tests that don't suppose to have dead nodes, but let's do it once the nightly infra is stabilized	2022-04-06 15:56:31 -07:00
Avnish Narayan	fdc6e02c29	[RLlib; testing] Move `num_workers` to RLlib config (#23750 )	2022-04-06 20:06:48 +02:00
Kai Fricke	0b804e5162	[ci/release] Move ML long running tests to sdk file manager (#23745 ) What: Long running tests should use sdk file manager Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.	2022-04-06 10:50:49 -07:00
Archit Kulkarni	582bf4e8f8	Add basic jobs release test with Tune script (#23474 ) Adds basic jobs release tests that connects to the test cluster and runs a basic tune script. Specifies `ray[tune]` in the `runtime_env` `pip` dependencies. Two tests: (1) Uses a local `working_dir` (2) Uses a remote working_dir from a zip github URL.	2022-04-05 13:31:11 -05:00
Jiajun Yao	a668e5d8db	Add perf metrics for stress tests (#23648 ) Added perf metrics for stress tests so they can be alerted on.	2022-04-05 08:09:27 +09:00
Kai Fricke	40a8183e05	[ci/release] Fix job-based file download (#23657 ) have to wrap download call in a lambda to be compatible with run_with_retry	2022-04-04 08:06:31 -07:00
Kai Fricke	9071b39f3e	[ci/release] Add buildkite output groups (#23658 ) This makes the buildkite output easier to parse and interpret.	2022-04-01 13:04:22 -07:00
Kai Fricke	fe27dbcd9a	[air/release] Improve file packing/unpacking (#23621 ) We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes. Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).	2022-04-01 07:38:14 -07:00
Avnish Narayan	161d95c31b	[RLlib] Increase slateq workers to decrease runtime on prod (#23609 )	2022-03-30 17:38:21 -07:00
Chen Shen	3e80da7e9f	[ci/release] long running / change failed test to sdk (#23602 ) close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.	2022-03-30 12:57:21 -07:00
Jiajun Yao	2959294f02	[CI] Filter release tests by attr regex (#23485 ) Support filtering tests by test attr regex filters. Multiple filters can be specified with one line for each filter. The format is attr:regex (e.g. team:serve)	2022-03-30 09:41:18 -07:00
Kai Fricke	e8abffb017	[tune/release] Improve Tune cloud release tests for durable storage (#23277 ) This PR addresses recent failures in the tune cloud tests. In particular, this PR changes the following: The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests. We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected) We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days. Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes). Local release test runs succeeded. https://buildkite.com/ray-project/release-tests-branch/builds/189 https://buildkite.com/ray-project/release-tests-branch/builds/191	2022-03-30 09:28:33 -07:00
Kai Fricke	922367d158	[ci/release] Fix smoke test compute templates (#23561 ) The smoke test definitions of a few tests were faulty for compute template override. Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294	2022-03-29 13:48:09 -07:00
Artur Niederfahrenhorst	9a64bd4e9b	[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842 )	2022-03-29 14:44:40 +02:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Chen Shen	c3e04ab275	[nighly-test] try out spot instances for chaos test #23507	2022-03-27 20:10:21 -07:00
Sven Mika	22c9c4aa39	[RLlib] Slate-Q +GPU torch bug fix. (#23464 )	2022-03-24 17:39:33 +01:00
Avnish Narayan	9040f54060	[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452 )	2022-03-24 09:17:30 +01:00
Stephanie Wang	aa6f773283	Switch long running tests to SDK (#23433 ) These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.	2022-03-23 17:44:26 -07:00
Kai Fricke	724377163f	[ci/release] Unstable tests should only soft fail the build (#23403 ) This will leave the tests green if the test is failing but marked as unstable.	2022-03-23 09:38:56 +00:00
Amog Kamsetty	6d776976c1	[Train] Fix multi node horovod bug (#22564 ) Closes #20956	2022-03-22 16:22:53 -07:00
Jiajun Yao	bab19e8e68	Add perf metrics for test_many_tasks.py (#23318 ) Add perf metrics for test_many_tasks.py Use the new smoke test structure	2022-03-22 16:16:42 -07:00
SangBin Cho	0cd687cc19	[Nightly test] Fix job download retry (#23401 ) Currently when we download a file to the cluster using a job, we don't do the retry.	2022-03-22 08:31:24 -07:00
Kai Fricke	02644ab4d8	[ci/release] Retry cluster env build on failure (#23378 ) Failed cluster env builds should be retried.	2022-03-22 09:45:22 +00:00

1 2 3 4 5 ...

585 commits