Commit graph

234 commits

Author SHA1 Message Date
Amog Kamsetty
983d8b3db2
[AIR] Fix failing CI on master (#25201)
The AIR CI build has been failing on master since #25022.

#25022 moved the tests that require credentials, but we left the bazel command in the build pipeline still. So even though all the tests are passing, the buildkite stage itself was failing since it tries run tests that require credentials, but these tests no longer exist in the directory. This is only a problem for master build since we don't run this command for PR builds.
2022-05-26 11:34:57 +02:00
Antoni Baum
2b6c6301e2
[CI] Fix typo in CI label (#25185) 2022-05-25 17:31:29 +02:00
Kai Fricke
d57ba750f5
[docs/air] Move upload example to docs (#25022) 2022-05-21 12:16:33 -07:00
Kai Fricke
e76efffec6
[air/docs] Move RL examples to docs (#24962)
Following #24959, this PR moves the RL examples (online/offline/serving) into the Ray ML docs. It also splits the online and offline parts.
2022-05-20 14:55:01 +01:00
Yi Cheng
8ec558dcb9
[core] Reenable GCS test with redis as backend. (#23506)
Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e.

The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit.

In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok.

This PR is part of GCS HA testing-related work.
2022-05-19 21:46:55 -07:00
mwtian
502c3e132d
Revert "[Core] allow using grpcio > 1.44.0 (#23722)" (#24935)
This reverts commit b02029b29f.
2022-05-18 18:16:39 -07:00
SangBin Cho
fb60d68bbb
[WIP] Run minimal tests against all supported python version (#24830)
Run minimal CI tests to all Python versions.
2022-05-18 09:42:26 -07:00
Antoni Baum
1d5e6d908d
[AIR] HuggingFace Text Classification example (#24402) 2022-05-18 09:35:12 -07:00
Chen Shen
1325cf7876
[python3.10] Build py310 images (#24859)
Build python 3.10 images so we can run release tests.
2022-05-18 08:48:20 -07:00
Antoni Baum
c74886a55e
[CI] Run doc notebooks in CI (#24816)
Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by:
* Running the `py_test_run_all_notebooks` macro inside the relevant subpackages
* Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks`
* Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early
* Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files

This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.
2022-05-17 09:50:42 +01:00
Edward Oakes
f99aa5cb40
[serve][docs] Unify doc_code directories and add bazel target (#24736)
Split off from https://github.com/ray-project/ray/pull/24693/, unifying the redundant directories we had and making sure all `serve/doc_code` snippets are run in CI.
2022-05-16 09:49:42 -05:00
Yi Cheng
68384ec745
[ci] Add flag for staging tests and disable the unstable one. (#24745)
This PR tries to add a prefix for the staging ci test. This is useful to separate staging tests from stable tests in https://flakey-tests.ray.io/
2022-05-13 13:48:14 -07:00
Kai Fricke
b0fa9d6766
[air] Example for Comet ML (#24603)
After #24459, this PR will add similar support for model artifact saving and an example for experiment tracking with Ray AIR for Comet ML.
2022-05-12 12:12:30 +01:00
Yi Cheng
a7d552ca25
[ci] Fix syncer staging tests error (#24681)
The staging tests failed due to using the wrong file. This PR fixed it.

https://buildkite.com/ray-project/ray-builders-branch/builds/7458#d6c28480-4c99-4a69-908c-9b0b5af9ce1f
2022-05-10 23:23:50 -07:00
Simon Mo
791ce22feb
[CI] Add conditional build to macOS pipeline (#24671) 2022-05-10 16:49:03 -07:00
Yi Cheng
6c60dbb242
[scheduler][6] Integrate ray with syncer. (#23660)
The new syncer comes with the feature of long-polling and versioning. This PR integrates it with ray.
2022-05-10 13:12:22 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
Kai Fricke
5d9bf4234a
[air] Example to track runs with Weights & Biases (#24459)
This PR 
- adds an example on how to run Ray Train and log results to weights & biases
- adds functionality to the W&B plugin to store checkpoints
- fixes a bug introduced in #24017
- Adds a CI utility script to setup credentials
- Adds a CI utility script to remove test state from external services cc @simon-mo
2022-05-06 15:52:37 +01:00
mwtian
b02029b29f
[Core] allow using grpcio > 1.44.0 (#23722) 2022-05-04 19:06:11 -07:00
Kai Fricke
c01681cf34
[ci] Exclude flaky test headers from pytest summaries (#24365)
Failing pytest summaries for flaky tests that eventually succeed are not always cleaned up properly: https://buildkite.com/ray-project/ray-builders-branch/builds/7292#_
This PR ensures we only print summaries when we have at least one summary file (and not just the header file).
2022-05-01 12:51:22 +01:00
Kai Fricke
6282090401
[ci] Fix GPU docker builds (#24336)
NVIDIA Docker builds are currently broken, e.g.: https://buildkite.com/ray-project/ray-builders-branch/builds/7239#e9dea1d6-7dea-4323-801c-b7efe917be03

Following this workaround: https://forums.developer.nvidia.com/t/invalid-public-key-for-cuda-apt-repository/212901/11
to hopefully fix this for now.
2022-04-29 17:10:18 +01:00
Dmitri Gekhtman
d68c1ecaf9
[kuberay] Test Ray client and update autoscaler image (#24195)
This PR adds KubeRay e2e testing for Ray client and updates the suggested autoscaler image to one running the merge commit of PR #23883 .
2022-04-27 18:02:12 -07:00
Kai Fricke
fc1cd89020
[ci] Add short failing test summary for pytests (#24104)
It is sometimes hard to find all failing tests in buildkite output logs - even filtering for "FAILED" is cumbersome as the output can be overloaded. This PR adds a small utility to add a short summary log in a separate output section at the end of the buildkite job.

The only shared directory between the Buildkite host machine and the test docker container is `/tmp/artifacts:/artifact-mount`. Thus, we write the summary file to this directory, and delete it before actually uploading it as an artifact in the `post-commands` hook.
2022-04-26 22:18:07 +01:00
Amog Kamsetty
ae9c68e75f
[Train] Fully deprecate Ray SGD v1 (#24038)
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.

Closes #16435
2022-04-25 16:12:57 -07:00
Kai Fricke
b86d420a3c
[ci] Only upload wheels to S3 once (#24072)
Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories.

The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.
2022-04-25 21:19:11 +01:00
Dmitri Gekhtman
8c5fe44542
[KubeRay] Fix autoscaling with GPUs and custom resources, with e2e tests (#23883)
- Closes #23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
2022-04-21 14:54:37 -07:00
Yi Cheng
04611edf5a
[scheduler] Update syncer API and add reconnect feature. (#23929)
This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size.

Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage
Make ray syncer test work even when we add more components in the protobuf
Make ray syncer able to reconnect to a new node.
2022-04-20 14:31:24 -07:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Sven Mika
a8494742a3
[RLlib] Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412) 2022-04-12 07:50:09 +02:00
Sven Mika
c82f6c62c8
[RLlib] Make RolloutWorkers (optionally) recoverable after failure. (#23739) 2022-04-08 15:33:28 +02:00
Kai Fricke
40a8183e05
[ci/release] Fix job-based file download (#23657)
have to wrap download call in a lambda to be compatible with run_with_retry
2022-04-04 08:06:31 -07:00
xwjiang2010
6443f3da84
[air] Add horovod trainer (#23437) 2022-03-29 18:12:32 -07:00
Kai Fricke
afd287eb93
[ci] linkcheck should soft fail (#23559)
Linkcheck failures should not break the build.
2022-03-29 10:57:03 -07:00
Eric Liang
990b0ec934
Move linkcheck into a separate CI build
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
2022-03-29 01:08:53 -07:00
Matti Picus
77c4c1e48e
WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449) 2022-03-29 00:56:42 -07:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Kai Fricke
940c028540
[ci] Clean up artifacts before/after jobs (#23463)
We sometimes end up with stale wheel uploads from previous runs of a Buildkite agent. The result is that commit wheels are being overwritten from old build jobs - effectively breaking the wheel build logic.

Example:

This Agent: https://buildkite.com/organizations/ray-project/agents/4b955117-2f6c-4849-b703-3457daf69f89

- builds wheels (in post-wheels tests) for a35ebc945b
- and then runs both the Ray CPP worker and the Train + Tune tests in 6746e9f
- Usually these two tests shouldn't provide artifacts at all, but they do - these are the wheels from a35ebc945b though! Meaning these are uncleaned leftovers from the first build task.
- See here for proof of artifact upload: https://buildkite.com/ray-project/ray-builders-pr/builds/27622#d11bc514-ebd8-4e0c-a2ce-826b9bad27de

The solution is thus to always clean up the artifacts directory in the worker, i.e. `rm -rf /artifact-mount/*`

This PR adds two of such clean up instructions - once before commands are run and once after artifacts are uploaded. We can probably just do either, but it doesn't hurt to have both.
2022-03-25 13:07:20 +00:00
Dmitri Gekhtman
bc98afcdf8
Test of KubeRay autoscaler integration (#23365)
This PR adds a test of KubeRay autoscaler integration to the Ray CI.

- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
2022-03-23 18:18:48 -07:00
shrekris-anyscale
b00977b1b1
[serve] Remove dashboard's dependency on Serve (#23389) 2022-03-21 22:14:41 -07:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Kai Fricke
da140a80e9
[ci/release] Legacy field should be optional (#23326)
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
2022-03-18 11:34:05 +00:00
Amog Kamsetty
bb4ff42eec
[ml] TorchTrainer bug fixes + GPU test (#23293)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-03-17 23:49:42 -07:00
Simon Mo
78d6ed7029
[Serve] [CI] Split Serve tests into multiple shards (#23145) 2022-03-15 16:32:30 -07:00
mwtian
6eb805b357
[CI] remove GCS-Ray CI tests (#23149)
* remove redis ci tests

* remove mac
2022-03-14 18:18:59 -07:00
matthewdeng
6b0169b23d
[ml] enable CI tests (#22926)
Follow-up to #22748, enabling tests in CI.

Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.

Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
2022-03-09 14:31:53 +00:00
mwtian
f67ff312a8
run mac c++ tests with static linking (#22829)
There are problems with running C++ tests in MacOS 10.15 Catalina, when upgrading to the newest grpc due to dynamic linking: #22384 (comment). The problem does not exist for Python tests in Catalina, or in C++ tests of other systems.

Upgrading MacOS CI from Catalina is also blocked in the short term: ray-project/buildkite-ci-stack#24 (comment)

So working around the issue by using static linking for C++ tests on Mac.
2022-03-05 10:39:32 +09:00
Kai Fricke
a9bf5e9e2f
[ci] Update GPU docker image to Ubuntu 20.04 (#22759)
This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.
2022-03-02 10:28:26 +01:00
Sven Mika
e50bd212a1
[RLlib] Disable flakey Pendulum-v1 tests (until further investigation). (#22686) 2022-03-01 16:44:17 +01:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Simon Mo
3d3218d153
[CI] Add K8s Builder Step (#22035) 2022-02-24 13:11:38 -08:00