hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Kai Fricke	cd95569b01	[tune/release] Add up/down scaling release test (#25392 ) This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 22:57:24 +01:00
Sihan Wang	b606169cb5	[Serve] Promote autoscaling feature (#26393 ) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows	2022-07-13 14:38:38 -05:00
Sven Mika	ab10890e90	Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517 )	2022-07-13 11:19:30 -07:00
Antoni Baum	a8fb194c8b	[CI] Fix nightly horovod test (#26447 ) Removes usage of deprecated Train APIs and uses Ray AIR HorovodTrainer instead.	2022-07-13 16:51:50 +01:00
Kai Fricke	e4a4f7de70	[ci/release] Fix fetching logs from staging clusters (#26515 ) Replaces a formerly hard-coded URI to anyscale prod with the respective env variable. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 11:15:50 +01:00
Kai Fricke	cf75cf7232	[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436 ) This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch. In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 10:53:24 +01:00
Riatre	2cdb76789e	Bump pytest from 5.4.3 to 7.0.1 (#26334 ) See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828. Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future. The change contains three commits with rather descriptive commit message, which I repeat here: Pass deps to py_test in py_test_module_list Bazel macro py_test_module_list takes a `deps` argument, but completely ignores it instead of passes it to `native.py_test`. Fixing that as we are going to use deps of py_test_module_list in BUILD in later changes. cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library from a py_test, which isn't working, see upstream issue: https://github.com/bazelbuild/bazel/issues/701. This is fixed by simply removing the (non-working) deps. Depend on conftest and data files in Python tests BUILD files Bazel requires that all the files used in a test run should be represented in the transitive dependencies specified for the test target. For py_test, it means srcs, deps and data. Bazel enforces this constraint by creating a "runfiles" directory, symbolic links files in the dependency closure and run the test in the "runfiles" directory, so that the test shouldn't see files not in the dependency graph. Unfortunately, the constraint does not apply for a large number of Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic links during test collection and effectively "breaks out" of the runfiles tree. pytest >= 6.0 introduces a breaking change and removed the symbolic link resolving behaviour, see pytest pull request https://github.com/pytest-dev/pytest/pull/6523 for more context. Currently, we are underspecifying dependencies in a lot of BUILD files and thus blocking us from updating to newer pytest (for Python 3.10 support). This change hopefully fixes all of them, and at least those in CI, by adding data or source dependencies (mostly for conftest.py-s) where needed. Bump pytest version from 5.4.3 to 7.0.1 We want at least pytest 6.2.5 for Python 3.10 support, but not past 7.1.0 since it drops Python 3.6 support (which Ray still supports), thus the version constraint is set to <7.1. Updating pytest, combined with earlier BUILD fixes, changed the ground truth of a few error message based unit test, these tests are updated to reflect the change. There are also two small drive-by changes for making test_traceback and test_cli pass under Python 3.10. These are discovered while debugging CI failures (on earlier Python) with a Python 3.10 install locally. Expect more such issues when adding Python 3.10 to CI.	2022-07-12 21:14:35 -07:00
Kai Fricke	753f5feaf4	[tune] Remove TrialCheckpoint class (#25406 ) The old user-facing TrialCheckpoint class has been deprecated in favor of `ray.ml.Checkpoint` and will be removed with this PR. The main change in this PR is to delete the old `TrialCheckpoint` class and replace remaining API calls (e.g. `checkpoint.local_path`) with the correct AIR equivalents. One issue that comes up is that with Ray client usage, checkpoint directories are not available on the local node (the client). Thus, we can't construct `Checkpoint` objects easily. (Previously, the TrialCheckpoint object held a reference to the location, even if it is not locally available). There are ongoing discussions on how to resolve this in the future. For now, we print an error when such a checkpoint is requested. Depends on #25805 Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-11 20:08:10 +01:00
Jian Xiao	923209895d	Pipelined training test: change num of windows; log the ingestion perf (#26429 ) Why are these changes needed? Improve test perf Log the perf stats With 2 windows there are a lot of spilling, slowing down the throughput.	2022-07-11 11:03:35 -07:00
Jiajun Yao	743e2f403a	Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (#26366 ) - Record the test name for the usage stats. - Change the cluster name to indicate if it's smoke test or not.	2022-07-07 21:17:34 -07:00
Antoni Baum	ea94cda1f3	[AIR] Replace `train.` with `session.` (#26303 ) This PR replaces legacy API calls to `train.` with AIR `session.` in Train code, examples and docs. Depends on https://github.com/ray-project/ray/pull/25735	2022-07-07 16:29:04 -07:00
Antoni Baum	b9a4f64f32	[AIR/train] Use new Train API (#25735 ) Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires https://github.com/ray-project/ray/pull/25943 to be merged in first	2022-07-07 12:28:37 -07:00
xwjiang2010	40f9561f78	[ml/release] fix ptl ml user test. (#26365 ) Between version1 and 2 of [this](https://console.anyscale-staging.com/o/anyscale-internal/configurations/app-config-versions/apt_TsCpJCRjMJDpNFhNgJmyCniS) cluster_env, 1 fails and 2 succeeds. btw, we really should start to think about a systematic approach towards our python dependency story. - between client and server - but more importantly server side, and any conflicts among requirements - how are pip freeze result evolving over time	2022-07-07 11:45:46 -07:00
Stephanie Wang	dcc913073f	[testing] Run 100TB shuffle test nightly (#26306 ) Run this test nightly to collect more datapoints on stability and performance of 100TB shuffle.	2022-07-07 09:59:54 -07:00
Amog Kamsetty	6f683c8d1c	[Release] Use nightly base images for release tests (#25373 ) Revert back to using nightly base images instead of pinning to 1.12.1. Pinning the docker image had led to uncaught errors in the past. Instead, we should be using nightly to make sure release tests will work on the most up to date versions of docker/cluster envs. If there are any test failures, the underlying issues should be fixed rather than pinning the docker image. Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-05 10:58:53 -07:00
Stephanie Wang	a90e53b76f	[core] Add weekly test for 100TB random shuffle (#25908 ) Adds a CI test for 100TB shuffle. There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count. Related issue number Closes #24480.	2022-07-01 13:30:07 -07:00
Alex Wu	76c5122357	[ci/release] Fix wait_cluster (#26236 ) Fixes a bug in wait_cluster where we count the total number of nodes ever in the cluster rather than the alive nodes. This has causes infra/autoscaler failures (e.g. #26138) to be mislabeled as test failures (and probably messes with timing too). Co-authored-by: Alex Wu <alex@anyscale.com>	2022-06-30 16:37:32 -07:00
Kai Fricke	e2d8e7a6ae	[ci/release/ml] Run ML release tests on staging (#26168 ) This moves all ML release tests to staging. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-06-30 13:24:28 -07:00
Kai Fricke	f9e787115f	[ci/release/core] Run `many_nodes` test on staging (#26164 ) This moves many_nodes to anyscale staging.	2022-06-29 11:07:32 -07:00
Kai Fricke	7091a32fe1	[ci/release] Support running tests on staging (#25889 ) This adds "environments" to the release package that can be used to configure some environment variables. These variables will be loaded either by an `--env` argument or a `env` definition in the test definition and can be used to e.g. run release tests on staging.	2022-06-28 10:14:01 -07:00
mwtian	513881584d	[Core] install jemalloc in Ray docker and use jemalloc in `benchmark` release tests (#26112 ) There are mysterious memory usage growth in Ray clusters that disappear when running with jemalloc. Before we are able to figure out the root cause, it seems using jemalloc by default can be a good walkaround. Because of its efficiency, using jemalloc by default can be beneficial, but we need to run more benchmarks to verify.	2022-06-27 23:26:56 -07:00
Sihan Wang	478733d751	[Serve] Bump min_workers configuration for Serve nightly tests (#25892 )	2022-06-23 15:38:07 -07:00
Kai Fricke	0959f44b6f	[tune/structure] Introduce execution package (#26015 ) Execution-specific packages are moved to tune.execution. Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2022-06-23 11:13:19 +01:00
Kai Fricke	fb3dd0ea40	[release/1.13.0] Add release logs (#24509 ) Preliminary release logs for review and approval.	2022-06-21 23:51:25 +01:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Chen Shen	97582a802d	[Core] update protobuf to 3.19.4 (#25648 ) The error message in #25638 indicates we should use protobuf>3.19.0 to generated code so that we can work with python protobuf >= 4.21.1. Try generating wheels to see if this works.	2022-06-18 16:06:56 -07:00
Stephanie Wang	293c122302	[dataset] Use polars for sorting (#25454 )	2022-06-17 12:26:46 -07:00
Simon Mo	438b6c78c8	[Release Tests] Add memory monitoring for Serve release test (#25868 )	2022-06-17 11:11:56 -07:00
matthewdeng	5c6b91d375	[Release] fix Horovod release tests (#25873 ) Error message suggests: Wait timeout after 30 seconds for key(s): 0. You may want to increase the timeout via HOROVOD_GLOO_TIMEOUT_SECONDS Bumped up to 120 seconds. Tests run successfully: https://buildkite.com/ray-project/release-tests-pr/builds/6906	2022-06-17 14:52:54 +01:00
Jiao	f6735f90c7	[Ray DAG] Move `dag` project folder out of `experimental` (#25532 )	2022-06-16 19:15:39 -07:00
Jun Gong	c026374acb	[RLlib] Fix the 2 failing RLlib release tests. (#25603 )	2022-06-14 14:51:08 +02:00
Sven Mika	130b7eeaba	[RLlib] `Trainer` to `Algorithm` renaming. (#25539 )	2022-06-11 15:10:39 +02:00
mwtian	1483c4553c	use smaller instance for scheduling tests (#25635 ) m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.	2022-06-10 17:09:35 +00:00
Sven Mika	7c39aa5fac	[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076 )	2022-06-10 17:09:18 +02:00
Amog Kamsetty	1316a2d05e	[AIR/Train] Move `ray.air.train` to `ray.train` (#25570 )	2022-06-08 21:34:18 -07:00
Kai Fricke	c3b608f757	[tune] Fix cloud tests, mark as stable (#25583 ) #25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.	2022-06-08 17:47:54 +01:00
Kai Fricke	45bf925ef0	[train/serve] Fix torch tune serve test (#25547 ) #24772 broke the smoke test as it was not run on CI - this PR hotfixes this	2022-06-07 15:54:37 +01:00
Rohan Potdar	a9d8da0100	[RLlib]: Doubly Robust Off-Policy Evaluation. (#25056 )	2022-06-07 12:52:19 +02:00
Kai Fricke	a0c8db1b5e	[release] Update download_wheels.sh to include Python 3.10 (#25508 ) Currently the download script does not contain python 3.10	2022-06-06 22:42:50 +01:00
Eric Liang	94dec83a60	[data] Rename data.impl to data._internal (#25486 )	2022-06-06 11:39:53 -07:00
Sven Mika	b5bc2b93c3	[RLlib] Move all remaining algos into `algorithms` directory. (#25366 )	2022-06-04 07:35:24 +02:00
Kai Fricke	4b9a89ad90	[air] Move python/ray/ml to python/ray/air (#25449 ) The package "ml" should be renamed to "air". Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility? I'd go for no to force people to use the new structure.	2022-06-03 21:53:44 +01:00
Yi Cheng	fd0f967d2e	Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )" (#25420 ) This reverts commit `e4ceae19ef`. Reverts #25346 linux://python/ray/tests:test_client_library_integration never fail before this PR. In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR. And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)	2022-06-02 20:38:44 -07:00
Sihan Wang	b024a9543e	[Serve] Support scale replica down to 0 (#24892 )	2022-06-02 08:06:46 -07:00
Sven Mika	e4ceae19ef	[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )	2022-06-02 16:47:05 +02:00
Kai Fricke	8a9512bf62	[ci/release] Install local wheels in release test shell script (#25227 ) We're currently installing matching wheels on the fly in the python script for Ray client tests. However, we can't reload modules with changed protobuf configurations, and thus can't reload ray completely. Since the `anyscale` pacakge depends on Ray, this effectively prevents us from installing matching wheels within the python script. There are a few possible solutions to this. First, we could separate out the local environment preparation from the test running - this will duplicate some logic and is thus a bit more involved, but should be considered in the future. For now, we adjust the `run_release_tests.sh` shell script to install any passed `--ray-wheels` wheels locally. We only do this in CI instances per default as these wheels will not be compatible with e.g. MacOS. Link to successful build: https://buildkite.com/ray-project/release-tests-branch/builds/619#_	2022-06-02 10:32:51 +01:00
Eric Liang	905258dbc1	Clean up docstyle in python modules and add LINT rule (#25272 )	2022-06-01 11:27:54 -07:00
SangBin Cho	ca75570f51	Revert "Revert "Revert "[dataset] Use polars for sorting (#24523 )" (#24781 )" (#25173 )" (#25341 ) This reverts commit `61676f26d3`.	2022-06-01 10:49:12 -07:00
Kai Fricke	1ed8bd0345	[release/xgboost/lightgbm] Fix app config dependency install overwriting ray (#25307 ) This line: ``` pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm ``` also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested: ``` [INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ... [INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9) ``` Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.	2022-05-31 13:46:17 +02:00
Stephanie Wang	61676f26d3	Revert "Revert "[dataset] Use polars for sorting (#24523 )" (#24781 )" (#25173 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.	2022-05-27 10:43:51 -07:00

1 2 3 4 5 ...

681 commits