hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Antoni Baum	c74886a55e	[CI] Run doc notebooks in CI (#24816 ) Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by: * Running the `py_test_run_all_notebooks` macro inside the relevant subpackages * Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks` * Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early * Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.	2022-05-17 09:50:42 +01:00
Edward Oakes	f99aa5cb40	[serve][docs] Unify `doc_code` directories and add bazel target (#24736 ) Split off from https://github.com/ray-project/ray/pull/24693/, unifying the redundant directories we had and making sure all `serve/doc_code` snippets are run in CI.	2022-05-16 09:49:42 -05:00
Yi Cheng	68384ec745	[ci] Add flag for staging tests and disable the unstable one. (#24745 ) This PR tries to add a prefix for the staging ci test. This is useful to separate staging tests from stable tests in https://flakey-tests.ray.io/	2022-05-13 13:48:14 -07:00
Kai Fricke	b0fa9d6766	[air] Example for Comet ML (#24603 ) After #24459, this PR will add similar support for model artifact saving and an example for experiment tracking with Ray AIR for Comet ML.	2022-05-12 12:12:30 +01:00
Yi Cheng	a7d552ca25	[ci] Fix syncer staging tests error (#24681 ) The staging tests failed due to using the wrong file. This PR fixed it. https://buildkite.com/ray-project/ray-builders-branch/builds/7458#d6c28480-4c99-4a69-908c-9b0b5af9ce1f	2022-05-10 23:23:50 -07:00
Simon Mo	791ce22feb	[CI] Add conditional build to macOS pipeline (#24671 )	2022-05-10 16:49:03 -07:00
Yi Cheng	6c60dbb242	[scheduler][6] Integrate ray with syncer. (#23660 ) The new syncer comes with the feature of long-polling and versioning. This PR integrates it with ray.	2022-05-10 13:12:22 -07:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
Kai Fricke	5d9bf4234a	[air] Example to track runs with Weights & Biases (#24459 ) This PR - adds an example on how to run Ray Train and log results to weights & biases - adds functionality to the W&B plugin to store checkpoints - fixes a bug introduced in #24017 - Adds a CI utility script to setup credentials - Adds a CI utility script to remove test state from external services cc @simon-mo	2022-05-06 15:52:37 +01:00
mwtian	b02029b29f	[Core] allow using grpcio > 1.44.0 (#23722 )	2022-05-04 19:06:11 -07:00
Kai Fricke	c01681cf34	[ci] Exclude flaky test headers from pytest summaries (#24365 ) Failing pytest summaries for flaky tests that eventually succeed are not always cleaned up properly: https://buildkite.com/ray-project/ray-builders-branch/builds/7292#_ This PR ensures we only print summaries when we have at least one summary file (and not just the header file).	2022-05-01 12:51:22 +01:00
Kai Fricke	6282090401	[ci] Fix GPU docker builds (#24336 ) NVIDIA Docker builds are currently broken, e.g.: https://buildkite.com/ray-project/ray-builders-branch/builds/7239#e9dea1d6-7dea-4323-801c-b7efe917be03 Following this workaround: https://forums.developer.nvidia.com/t/invalid-public-key-for-cuda-apt-repository/212901/11 to hopefully fix this for now.	2022-04-29 17:10:18 +01:00
Dmitri Gekhtman	d68c1ecaf9	[kuberay] Test Ray client and update autoscaler image (#24195 ) This PR adds KubeRay e2e testing for Ray client and updates the suggested autoscaler image to one running the merge commit of PR #23883 .	2022-04-27 18:02:12 -07:00
Kai Fricke	fc1cd89020	[ci] Add short failing test summary for pytests (#24104 ) It is sometimes hard to find all failing tests in buildkite output logs - even filtering for "FAILED" is cumbersome as the output can be overloaded. This PR adds a small utility to add a short summary log in a separate output section at the end of the buildkite job. The only shared directory between the Buildkite host machine and the test docker container is `/tmp/artifacts:/artifact-mount`. Thus, we write the summary file to this directory, and delete it before actually uploading it as an artifact in the `post-commands` hook.	2022-04-26 22:18:07 +01:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Kai Fricke	b86d420a3c	[ci] Only upload wheels to S3 once (#24072 ) Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories. The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.	2022-04-25 21:19:11 +01:00
Dmitri Gekhtman	8c5fe44542	[KubeRay] Fix autoscaling with GPUs and custom resources, with e2e tests (#23883 ) - Closes #23874 by fixing a typo ("num_gpus" -> "num-gpus"). - Adds end-to-end test logic confirming the fix. - Adds end-to-end test logic confirming autoscaling with custom resources works. - Slightly refines developer instructions. - Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.	2022-04-21 14:54:37 -07:00
Yi Cheng	04611edf5a	[scheduler] Update syncer API and add reconnect feature. (#23929 ) This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size. Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage Make ray syncer test work even when we add more components in the protobuf Make ray syncer able to reconnect to a new node.	2022-04-20 14:31:24 -07:00
Kai Fricke	65d9a410f7	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 ) Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories. Details: - Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc. - Minor adjustments to some scripts (variable renames) - Removes the outdated (unused) asan tests	2022-04-13 18:11:30 +01:00
Sven Mika	a8494742a3	[RLlib] Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412 )	2022-04-12 07:50:09 +02:00
Sven Mika	c82f6c62c8	[RLlib] Make RolloutWorkers (optionally) recoverable after failure. (#23739 )	2022-04-08 15:33:28 +02:00
Kai Fricke	40a8183e05	[ci/release] Fix job-based file download (#23657 ) have to wrap download call in a lambda to be compatible with run_with_retry	2022-04-04 08:06:31 -07:00
xwjiang2010	6443f3da84	[air] Add horovod trainer (#23437 )	2022-03-29 18:12:32 -07:00
Kai Fricke	afd287eb93	[ci] linkcheck should soft fail (#23559 ) Linkcheck failures should not break the build.	2022-03-29 10:57:03 -07:00
Eric Liang	990b0ec934	Move linkcheck into a separate CI build Why are these changes needed? Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.	2022-03-29 01:08:53 -07:00
Matti Picus	77c4c1e48e	WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449 )	2022-03-29 00:56:42 -07:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Kai Fricke	940c028540	[ci] Clean up artifacts before/after jobs (#23463 ) We sometimes end up with stale wheel uploads from previous runs of a Buildkite agent. The result is that commit wheels are being overwritten from old build jobs - effectively breaking the wheel build logic. Example: This Agent: https://buildkite.com/organizations/ray-project/agents/4b955117-2f6c-4849-b703-3457daf69f89 - builds wheels (in post-wheels tests) for a35ebc945b - and then runs both the Ray CPP worker and the Train + Tune tests in 6746e9f - Usually these two tests shouldn't provide artifacts at all, but they do - these are the wheels from a35ebc945b though! Meaning these are uncleaned leftovers from the first build task. - See here for proof of artifact upload: https://buildkite.com/ray-project/ray-builders-pr/builds/27622#d11bc514-ebd8-4e0c-a2ce-826b9bad27de The solution is thus to always clean up the artifacts directory in the worker, i.e. `rm -rf /artifact-mount/*` This PR adds two of such clean up instructions - once before commands are run and once after artifacts are uploaded. We can probably just do either, but it doesn't hurt to have both.	2022-03-25 13:07:20 +00:00
Dmitri Gekhtman	bc98afcdf8	Test of KubeRay autoscaler integration (#23365 ) This PR adds a test of KubeRay autoscaler integration to the Ray CI. - Tests scaling with autoscaler.sdk.request_resources - Tests autoscaler response to RayCluster CR change	2022-03-23 18:18:48 -07:00
shrekris-anyscale	b00977b1b1	[serve] Remove dashboard's dependency on Serve (#23389 )	2022-03-21 22:14:41 -07:00
Jialing He	4a83bc3dc2	[runtime env] Support set timeout for runtime env setup (#23082 ) Interface example: ```python @ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10)) def f(): pass @ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}}) def f(): pass ``` Support set timeout second for timeout of runtime environment creation. Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-03-18 12:52:59 -05:00
Kai Fricke	da140a80e9	[ci/release] Legacy field should be optional (#23326 ) #22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon. Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...	2022-03-18 11:34:05 +00:00
Amog Kamsetty	bb4ff42eec	[ml] `TorchTrainer` bug fixes + GPU test (#23293 ) Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-03-17 23:49:42 -07:00
Simon Mo	78d6ed7029	[Serve] [CI] Split Serve tests into multiple shards (#23145 )	2022-03-15 16:32:30 -07:00
mwtian	6eb805b357	[CI] remove GCS-Ray CI tests (#23149 ) * remove redis ci tests * remove mac	2022-03-14 18:18:59 -07:00
matthewdeng	6b0169b23d	[ml] enable CI tests (#22926 ) Follow-up to #22748, enabling tests in CI. Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly. Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.	2022-03-09 14:31:53 +00:00
mwtian	f67ff312a8	run mac c++ tests with static linking (#22829 ) There are problems with running C++ tests in MacOS 10.15 Catalina, when upgrading to the newest grpc due to dynamic linking: #22384 (comment). The problem does not exist for Python tests in Catalina, or in C++ tests of other systems. Upgrading MacOS CI from Catalina is also blocked in the short term: ray-project/buildkite-ci-stack#24 (comment) So working around the issue by using static linking for C++ tests on Mac.	2022-03-05 10:39:32 +09:00
Kai Fricke	a9bf5e9e2f	[ci] Update GPU docker image to Ubuntu 20.04 (#22759 ) This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.	2022-03-02 10:28:26 +01:00
Sven Mika	e50bd212a1	[RLlib] Disable flakey Pendulum-v1 tests (until further investigation). (#22686 )	2022-03-01 16:44:17 +01:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
Simon Mo	3d3218d153	[CI] Add K8s Builder Step (#22035 )	2022-02-24 13:11:38 -08:00
Yi Cheng	e3051ebf67	[ci] Fix grpcio 1.44 break test_output (#22494 ) This PR limit grpc to be <= 1.42. This will fix testoutput.	2022-02-22 13:59:25 -08:00
Sven Mika	6522935291	[RLlib] Slate-Q tf implementation and tests/benchmarks. (#22389 )	2022-02-22 09:36:44 +01:00
Simon Mo	3e7511e84f	[CI] Disable privileged test (#22484 )	2022-02-17 15:34:02 -08:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00
Eric Liang	92550500bc	Split workflow and dataset tests (#22415 )	2022-02-16 01:47:55 -08:00
matthewdeng	2c204a755b	[train] add minimal installation test suite (#22300 ) Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path. Example: If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.	2022-02-11 10:09:00 -08:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
Avnish Narayan	0d2ba41e41	[RLlib] [CI] Deflake longer running RLlib learning tests for off policy algorithms. Fix seeding issue in TransformedAction Environments (#21685 )	2022-02-04 14:59:56 +01:00
Kai Fricke	b51b5afaea	[ci/gpu] Move ML dependency install to Dockerfile (#21711 ) Instead of installing dependencies in each Buildkite job, let's move this to the Dockerfile instead. This will update GPU tests to always use Python 3.7.	2022-02-01 12:04:55 +00:00

1 2 3 4 5

225 commits