Commit graph

716 commits

Author SHA1 Message Date
Simon Mo
791ce22feb
[CI] Add conditional build to macOS pipeline (#24671) 2022-05-10 16:49:03 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
Kai Fricke
5d9bf4234a
[air] Example to track runs with Weights & Biases (#24459)
This PR 
- adds an example on how to run Ray Train and log results to weights & biases
- adds functionality to the W&B plugin to store checkpoints
- fixes a bug introduced in #24017
- Adds a CI utility script to setup credentials
- Adds a CI utility script to remove test state from external services cc @simon-mo
2022-05-06 15:52:37 +01:00
Amog Kamsetty
60ded3ef79
[Docker] Start building ray-ml CPU Docker image again (#24266) 2022-04-28 15:29:23 -07:00
xwjiang2010
d9d9fbb044
[ci] try fixing ensure pip by down pinning cryptography. (#24238)
cryptography had a major release (7 hours ago): https://pypi.org/project/cryptography/#history. Suspecting that it's breaking our docker build step in ci.
2022-04-26 17:48:29 -07:00
Kai Fricke
fc1cd89020
[ci] Add short failing test summary for pytests (#24104)
It is sometimes hard to find all failing tests in buildkite output logs - even filtering for "FAILED" is cumbersome as the output can be overloaded. This PR adds a small utility to add a short summary log in a separate output section at the end of the buildkite job.

The only shared directory between the Buildkite host machine and the test docker container is `/tmp/artifacts:/artifact-mount`. Thus, we write the summary file to this directory, and delete it before actually uploading it as an artifact in the `post-commands` hook.
2022-04-26 22:18:07 +01:00
Amog Kamsetty
ae9c68e75f
[Train] Fully deprecate Ray SGD v1 (#24038)
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.

Closes #16435
2022-04-25 16:12:57 -07:00
Kai Fricke
b86d420a3c
[ci] Only upload wheels to S3 once (#24072)
Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories.

The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.
2022-04-25 21:19:11 +01:00
jon-chuang
e6a458a31e
[CI] Create zip of ray session_latest/logs dir on test failure and upload to buildkite via /artifact-mount (#23783)
Creates a zip of session_latest dir with test name and timestamp upon python test failure. Writes to dir specified by env var `RAY_TEST_FAILURE_LOGS_DIR`. Noop if env var does not exist.

Downstream consumer (e.g. CI) can upload all created artifacts in this dir. Thereby, PR submitters can more easily debug their CI failures, especially if they can't repro locally.

Limitations:
- a conftest.py file importing the main ray conftest.py needs to be present in same dir as test. This presents a challenge for e.g. dashboard tests which are highly scattered
2022-04-22 09:48:53 +01:00
Guyang Song
0e6c042e29
[Bugfix] fix invalid excluding of Black (#24042)
- We should use `--force-exclude` when we pass code path explicitly https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html?highlight=--force-exclude#command-line-options
- Recover the files in `python/ray/_private/thirdparty` which has been formatted in the PR https://github.com/ray-project/ray/pull/21975 by mistake.
2022-04-21 10:21:35 +08:00
Amog Kamsetty
7a3ccb93ee
[CI] Separate out banned words check from formatting script (#23998)
The recursive grep in the banned words check can get really messy when running locally depending on each person's directory structure or where the format script is being called from.

Separates the banned words check as a separate script so that it's not called by default in ./format.sh. Also adds this to the documentation
2022-04-19 13:30:37 -07:00
Jiajun Yao
6e2f9dfe53
[CI] Upload mac wheels to buildkite artifacts (#23930)
Upload mac wheels to buildkite artifacts and s3.
2022-04-17 13:10:14 -07:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Amog Kamsetty
38696b155a
[Docker/CI] Add comment on keeping Docker images in sync (#23782) 2022-04-11 18:09:10 +01:00
Eric Liang
1ff874e8e8
[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817) 2022-04-10 16:12:53 -07:00
Kai Fricke
d27e73f851
[ci] Pin prometheus_client to fix current test outages (#23749)
What: Pins prometheus_client to < 0.14.0, hopefully fixing today's CI outages
Why: New version of the python client (https://github.com/prometheus/client_python/releases) breaks our CI
2022-04-06 14:22:22 -07:00
Kai Fricke
7cf89dd686
[ci] Non-verbose llvm download in Buildkite (#23731)
What: Use wget -nv in Buildkite environments
Why: The llvm download currently clutters the log output as it's not rendered correctly, thus we should silence it.

Result: Logs are finally readable again in Buildkite without download: https://buildkite.com/ray-project/ray-builders-pr/builds/28916#25e8965a-d18b-49a1-8e29-200365b13c53
2022-04-05 21:41:51 -07:00
Jiao
ff6515b5a3
Remove requests from blacklist of minimal install test (#20584)
While working on https://github.com/ray-project/ray/pull/20577 we noticed `requests` module is not blacked listed in minimal install test, but not sure why. As a result we missed coverage on P0 issue like https://github.com/ray-project/ray/issues/20574.

This is an attempt to see what would happen if we blacklist it and if we're able to get any signals from CI.

Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-04-04 16:15:58 +01:00
Yi Cheng
31483a003a
[syncer] skip ray_syncer_test on windows temporarily (#23610)
ray_syncer_test is flaky on windows. It's not so easy to investigate what's happening there. The test timeout somehow.
We disable it for short time.
2022-03-30 17:29:08 -07:00
Eric Liang
990b0ec934
Move linkcheck into a separate CI build
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
2022-03-29 01:08:53 -07:00
Matti Picus
77c4c1e48e
WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449) 2022-03-29 00:56:42 -07:00
ddelange
e109c13b83
[ci] Clean up ray-ml requirements (#23325)
In https://github.com/ray-project/ray/blob/ray-1.11.0/docker/ray-ml/Dockerfile, the order of pip install commands currently matters (potentially a lot). It would be good to run one big pip install command to avoid ending up with a broken env.

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-03-25 15:59:54 +00:00
Kai Fricke
940c028540
[ci] Clean up artifacts before/after jobs (#23463)
We sometimes end up with stale wheel uploads from previous runs of a Buildkite agent. The result is that commit wheels are being overwritten from old build jobs - effectively breaking the wheel build logic.

Example:

This Agent: https://buildkite.com/organizations/ray-project/agents/4b955117-2f6c-4849-b703-3457daf69f89

- builds wheels (in post-wheels tests) for a35ebc945b
- and then runs both the Ray CPP worker and the Train + Tune tests in 6746e9f
- Usually these two tests shouldn't provide artifacts at all, but they do - these are the wheels from a35ebc945b though! Meaning these are uncleaned leftovers from the first build task.
- See here for proof of artifact upload: https://buildkite.com/ray-project/ray-builders-pr/builds/27622#d11bc514-ebd8-4e0c-a2ce-826b9bad27de

The solution is thus to always clean up the artifacts directory in the worker, i.e. `rm -rf /artifact-mount/*`

This PR adds two of such clean up instructions - once before commands are run and once after artifacts are uploaded. We can probably just do either, but it doesn't hurt to have both.
2022-03-25 13:07:20 +00:00
Max Pumperla
60054995e6
[docs] fix doctests and activate CI (#23418) 2022-03-24 17:04:02 -07:00
Dmitri Gekhtman
9ce221f514
Disable KubeRay tests on windows. (#23453)
This PR disables KubeRay tests on windows, because they're not relevant there.
2022-03-24 08:11:17 -07:00
shrekris-anyscale
b00977b1b1
[serve] Remove dashboard's dependency on Serve (#23389) 2022-03-21 22:14:41 -07:00
Avnish Narayan
e008a48ef2
[release tests] Pin gym everywhere (#23349) 2022-03-19 02:52:54 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
shrekris-anyscale
56ddea85a1
[Serve] Fix typo language (#23213) 2022-03-16 10:14:44 -07:00
mwtian
6eb805b357
[CI] remove GCS-Ray CI tests (#23149)
* remove redis ci tests

* remove mac
2022-03-14 18:18:59 -07:00
Kai Yang
e9755d87a6
[Lint] One parameter/argument per line for C++ code (#22725)
It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments.

In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line.

There are several benefits:

* Conflict resolving is easier.
* Less potential human mistakes when resolving conflicts.
* Git history and Git blame are more straightforward.
* Better readability.
* Align with the new Python format style.
2022-03-13 17:05:44 +08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
kyle-chen-uber
592656ca28
[horovod] remove deprecated slot concept, use worker instead (#22708)
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7

The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.

This commit updates the Horovod RayExecutor init args.

Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-03-10 08:16:42 +00:00
Kai Fricke
b267be4758
[ml] Add Ray ML / AIR checkpoint implementation (#22691)
This PR splits up the changes in #22393 and introduces an implementation of the ML Checkpoint interface used by Ray Tune.

This means, the TuneCheckpoint class implements the to/from_[bytes|dict|directory|object_ref|uri] conversion functions, as well as more high-level functions to transition between the different TuneCheckpoint classes. It also includes test cases for Tune's main conversion modes, i.e. dict - intermediate - dict and fs - intermediate - fs.

These changes will be the basis for refactoring the tune interface to use TuneCheckpoint objects instead of TrialCheckpoints (externally) and instead of paths/objects (internally).
2022-03-09 10:02:59 -08:00
matthewdeng
6b0169b23d
[ml] enable CI tests (#22926)
Follow-up to #22748, enabling tests in CI.

Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.

Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
2022-03-09 14:31:53 +00:00
Jiajun Yao
4801e57c77
[Test] Add missing tests to bazel BUILD (#22827) 2022-03-07 19:54:49 -08:00
Kai Fricke
84a163a2c4
[RLlib] Remove atari rom install script (#22797) 2022-03-03 16:55:56 +01:00
Simon Mo
0bab8dbfe0
[Serve] Add test for controller managing Java Replica (#22628) 2022-02-28 23:13:56 -08:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Simon Mo
3d3218d153
[CI] Add K8s Builder Step (#22035) 2022-02-24 13:11:38 -08:00
Siyuan (Ryans) Zhuang
8f4f3cb79b
Make shellcheck optional 2022-02-24 12:04:05 -08:00
Siyuan (Ryans) Zhuang
ec23050df6
Error if shellcheck is not installed (#22556) 2022-02-24 09:53:03 -08:00
Yi Cheng
e3051ebf67
[ci] Fix grpcio 1.44 break test_output (#22494)
This PR limit grpc to be <= 1.42. This will fix testoutput.
2022-02-22 13:59:25 -08:00
Jialing He
4c73560b31
[runtime env] Support clone virtualenv from an existing virtualenv (#22309)
Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv  from an existing virtualenv.

More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-02-15 12:51:01 -06:00
Gagandeep Singh
a8341dfc29
Replace queue.Queue with multiprocessing.JoinableQueue (#21860)
Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300
And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121

AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.
2022-02-15 09:01:17 -08:00
Balaji Veeramani
ee1711fe41
[CI] Remove YAPF from format.sh (#21986) 2022-02-07 16:05:27 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Matti Picus
d3d1e8559c
enable passing metric tests on windows (#21755)
Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.

Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
2022-01-25 09:20:16 -08:00