Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
We sometimes end up with stale wheel uploads from previous runs of a Buildkite agent. The result is that commit wheels are being overwritten from old build jobs - effectively breaking the wheel build logic.
Example:
This Agent: https://buildkite.com/organizations/ray-project/agents/4b955117-2f6c-4849-b703-3457daf69f89
- builds wheels (in post-wheels tests) for a35ebc945b
- and then runs both the Ray CPP worker and the Train + Tune tests in 6746e9f
- Usually these two tests shouldn't provide artifacts at all, but they do - these are the wheels from a35ebc945b though! Meaning these are uncleaned leftovers from the first build task.
- See here for proof of artifact upload: https://buildkite.com/ray-project/ray-builders-pr/builds/27622#d11bc514-ebd8-4e0c-a2ce-826b9bad27de
The solution is thus to always clean up the artifacts directory in the worker, i.e. `rm -rf /artifact-mount/*`
This PR adds two of such clean up instructions - once before commands are run and once after artifacts are uploaded. We can probably just do either, but it doesn't hurt to have both.
Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300
And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121
AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.
Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.
Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.
Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
After enabling tests of test_runtime_env_plugin and test_runtime_env_env_vars (PR #21252) and python/ray/serve:* tests (PR #21107), the analysis at flaky-tests.ray.io starting showing failing tests in the windows://python/ray/test/serv:test_standalone. PR #21352 reverted 21252 (runtime_env tests), but the problem was more likely in the serve tests. Specifically `test_standalone` has a test that uses Cluster, which should be skipped on windows because it is flaky. So this PR
- re-enables the runtime_env tests for windows
- skips the Cluster test in serve/tests/test_standalone.py
Uses a direct `pip install` instead of creating a conda env to make pip installs incremental to the cluster environment.
Separates the handling of `pip` and `conda` dependencies.
The new `pip` approach still works if only the base Ray is installed on the cluster and the user specifies libraries like "ray[serve]" in the `pip` field. The mechanism is as follows:
- We don't actually want to reinstall ray via pip, since this could lead to version mismatch issues. Instead, we want to use the Ray that's already installed in the cluster.
- So if "ray" was included by the user in the pip list, remove it
- If a library "ray[serve]" or "ray[tune, rllib]" was included in the pip list, remove it and replace it by its dependencies (e.g. "uvicorn", "requests", ..)
Co-authored-by: architkulkarni <arkulkar@gmail.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
Why are these changes needed?
Currently clang-tidy does not run inside scripts/format.sh. Also clang-tidy can produce false positive warnings. Maybe we can disable clang-tidy until ergonomic issues are resolved.