hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Yi Cheng	33997da299	[core] Introduce a flag which allows a longer timeout for raylet when GCS restarts. (#26919 ) ## Why are these changes needed? When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts. Once GCS get the first ping, it'll just use the regular timeout instead.	2022-07-25 16:57:19 -07:00
mwtian	6acd0a4c9b	Allow grpcio >= 1.48 (#26765 ) The previously observed Python grpc warning / logspam seems to have been fixed for grpcio >= 1.48. And users would like to upgrade beyond grpcio 1.43 for better M1 support. However, grpcio 1.48 has not been released yet, so there is still a risk this change needs to be reverted if any problem is discovered later with Ray nightly + grpcio 1.48.	2022-07-21 10:03:41 -07:00
Jiajun Yao	1b2b526a2b	Fix windows buildkite (#26615 ) - Stop using dot command to run ci.sh script: it doesn't fail the build if the command fails for windows and is generally dangerous since it will make unexpected changes to the current shell. - Fix uncovered windows build issues.	2022-07-18 09:15:49 -07:00
Tao Wang	6ddbdaa81a	[CI]Split C++, Java tests in MacOS from the big one (#26434 )	2022-07-14 18:33:47 -07:00
Antoni Baum	9b2cd29511	[CI] Install Horovod in doc tests to fix notebook (#26476 ) Fixes the Horovod notebook example as found in #26410 by installing Horovod in doc tests jobs. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-07-13 16:27:20 +01:00
xwjiang2010	03671c961e	[CI] run air related doc/example tests as part of pre-submit CI. (#26466 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2022-07-12 18:30:37 +01:00
Dmitri Gekhtman	8f8f036957	[autoscaler][kuberay] Deflake KubeRay autoscaling test (#26411 ) Improves stability of KubeRay autoscaling test.	2022-07-12 00:56:36 -07:00
Jiao	d95dc2f2e5	[AIR][GPU Batch Prediction] Add basic support for GPU batch prediction (#26251 ) This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`. Notable changes: - Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call) - Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor` - Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set - parametrized existing predictor tests to use GPU for both CPU & GPU coverage - Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho) Follow ups: https://github.com/ray-project/ray/issues/26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.	2022-07-11 13:04:15 -07:00
Amog Kamsetty	b01e11d721	[Docker] Add support for Cuda 11.3 (#26233 ) Start building Ray docker images with cuda 11.3	2022-07-10 21:50:42 -07:00
Yi Cheng	818bb78542	[ci] Stop syncer staging tests (#26273 ) The tests has been running for 1-2 months, and the overall observation is that it's not very useful to catch the actual regression. Basically, we didn't notice any regression. Stop this test for now to save some resources.	2022-07-03 11:17:10 -07:00
Sven Mika	96693055bd	[RLlib] More Trainer -> Algorithm renaming cleanups. (#25869 )	2022-06-20 15:54:00 +02:00
clarng	2b270fd9cb	apply isort uniformly for a subset of directories (#25824 ) Simplify isort filters and move it into isort cfg file. With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.	2022-06-17 13:40:32 -07:00
Jiao	f6735f90c7	[Ray DAG] Move `dag` project folder out of `experimental` (#25532 )	2022-06-16 19:15:39 -07:00
Simon Mo	503e197f8c	[CI] Upload macOS bazel test files (#25744 )	2022-06-15 10:09:04 -07:00
Antoni Baum	5e9a8eb5f6	[AIR/data] Move preprocessors to `ray.data` (#25599 ) Moves ray.air.Preprocessor and ray.air.preprocessors to ray.data to converge on the agreed upon package structure discussed internally.	2022-06-13 12:57:59 -07:00
Amog Kamsetty	1316a2d05e	[AIR/Train] Move `ray.air.train` to `ray.train` (#25570 )	2022-06-08 21:34:18 -07:00
Amog Kamsetty	80ae651f25	[Train] Clean up `ray.train` package (#25566 )	2022-06-08 10:22:36 -07:00
Yi Cheng	acf210fcac	[flakey] Skip ray_syncer_test for ubsan. (#25477 ) From the message: ``` [ OK ] SyncerTest.TestMToN (13132 ms) [----------] 5 tests from SyncerTest (43175 ms total) [----------] Global test environment tear-down [==========] 8 tests from 2 test suites ran. (43176 ms total) [ PASSED ] 8 tests. external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable' ``` This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API. Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.	2022-06-04 23:06:57 -07:00
Sven Mika	b5bc2b93c3	[RLlib] Move all remaining algos into `algorithms` directory. (#25366 )	2022-06-04 07:35:24 +02:00
Kai Fricke	4b9a89ad90	[air] Move python/ray/ml to python/ray/air (#25449 ) The package "ml" should be renamed to "air". Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility? I'd go for no to force people to use the new structure.	2022-06-03 21:53:44 +01:00
Yi Cheng	fd0f967d2e	Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )" (#25420 ) This reverts commit `e4ceae19ef`. Reverts #25346 linux://python/ray/tests:test_client_library_integration never fail before this PR. In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR. And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)	2022-06-02 20:38:44 -07:00
Sven Mika	e4ceae19ef	[RLlib] Move (A/DD)?PPO and IMPALA algos to `algorithms` dir and rename policy and trainer classes. (#25346 )	2022-06-02 16:47:05 +02:00
Yi Cheng	cb1f08a3c1	[core] Basic end-2-end multi-node tests for GCS HA in CI. (#25114 ) In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back. To make it close to the real-world case, the docker is used for isolation: It starts a head node (0 cpus) and a worker node It tried the basic function and make sure it's working It kills GCS and make sure everything is working. It starts GCS and make sure reconfig continues to work. This is the basic cases for serve HA. We'll add more once we get better integrations.	2022-06-02 02:41:38 +00:00
Amog Kamsetty	983d8b3db2	[AIR] Fix failing CI on master (#25201 ) The AIR CI build has been failing on master since #25022. #25022 moved the tests that require credentials, but we left the bazel command in the build pipeline still. So even though all the tests are passing, the buildkite stage itself was failing since it tries run tests that require credentials, but these tests no longer exist in the directory. This is only a problem for master build since we don't run this command for PR builds.	2022-05-26 11:34:57 +02:00
Antoni Baum	2b6c6301e2	[CI] Fix typo in CI label (#25185 )	2022-05-25 17:31:29 +02:00
Kai Fricke	d57ba750f5	[docs/air] Move upload example to docs (#25022 )	2022-05-21 12:16:33 -07:00
Kai Fricke	e76efffec6	[air/docs] Move RL examples to docs (#24962 ) Following #24959, this PR moves the RL examples (online/offline/serving) into the Ray ML docs. It also splits the online and offline parts.	2022-05-20 14:55:01 +01:00
Yi Cheng	8ec558dcb9	[core] Reenable GCS test with redis as backend. (#23506 ) Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e. The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit. In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok. This PR is part of GCS HA testing-related work.	2022-05-19 21:46:55 -07:00
mwtian	502c3e132d	Revert "[Core] allow using grpcio > 1.44.0 (#23722 )" (#24935 ) This reverts commit `b02029b29f`.	2022-05-18 18:16:39 -07:00
SangBin Cho	fb60d68bbb	[WIP] Run minimal tests against all supported python version (#24830 ) Run minimal CI tests to all Python versions.	2022-05-18 09:42:26 -07:00
Antoni Baum	1d5e6d908d	[AIR] HuggingFace Text Classification example (#24402 )	2022-05-18 09:35:12 -07:00
Chen Shen	1325cf7876	[python3.10] Build py310 images (#24859 ) Build python 3.10 images so we can run release tests.	2022-05-18 08:48:20 -07:00
Antoni Baum	c74886a55e	[CI] Run doc notebooks in CI (#24816 ) Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by: * Running the `py_test_run_all_notebooks` macro inside the relevant subpackages * Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks` * Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early * Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.	2022-05-17 09:50:42 +01:00
Edward Oakes	f99aa5cb40	[serve][docs] Unify `doc_code` directories and add bazel target (#24736 ) Split off from https://github.com/ray-project/ray/pull/24693/, unifying the redundant directories we had and making sure all `serve/doc_code` snippets are run in CI.	2022-05-16 09:49:42 -05:00
Yi Cheng	68384ec745	[ci] Add flag for staging tests and disable the unstable one. (#24745 ) This PR tries to add a prefix for the staging ci test. This is useful to separate staging tests from stable tests in https://flakey-tests.ray.io/	2022-05-13 13:48:14 -07:00
Kai Fricke	b0fa9d6766	[air] Example for Comet ML (#24603 ) After #24459, this PR will add similar support for model artifact saving and an example for experiment tracking with Ray AIR for Comet ML.	2022-05-12 12:12:30 +01:00
Yi Cheng	a7d552ca25	[ci] Fix syncer staging tests error (#24681 ) The staging tests failed due to using the wrong file. This PR fixed it. https://buildkite.com/ray-project/ray-builders-branch/builds/7458#d6c28480-4c99-4a69-908c-9b0b5af9ce1f	2022-05-10 23:23:50 -07:00
Simon Mo	791ce22feb	[CI] Add conditional build to macOS pipeline (#24671 )	2022-05-10 16:49:03 -07:00
Yi Cheng	6c60dbb242	[scheduler][6] Integrate ray with syncer. (#23660 ) The new syncer comes with the feature of long-polling and versioning. This PR integrates it with ray.	2022-05-10 13:12:22 -07:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
Kai Fricke	5d9bf4234a	[air] Example to track runs with Weights & Biases (#24459 ) This PR - adds an example on how to run Ray Train and log results to weights & biases - adds functionality to the W&B plugin to store checkpoints - fixes a bug introduced in #24017 - Adds a CI utility script to setup credentials - Adds a CI utility script to remove test state from external services cc @simon-mo	2022-05-06 15:52:37 +01:00
mwtian	b02029b29f	[Core] allow using grpcio > 1.44.0 (#23722 )	2022-05-04 19:06:11 -07:00
Kai Fricke	c01681cf34	[ci] Exclude flaky test headers from pytest summaries (#24365 ) Failing pytest summaries for flaky tests that eventually succeed are not always cleaned up properly: https://buildkite.com/ray-project/ray-builders-branch/builds/7292#_ This PR ensures we only print summaries when we have at least one summary file (and not just the header file).	2022-05-01 12:51:22 +01:00
Kai Fricke	6282090401	[ci] Fix GPU docker builds (#24336 ) NVIDIA Docker builds are currently broken, e.g.: https://buildkite.com/ray-project/ray-builders-branch/builds/7239#e9dea1d6-7dea-4323-801c-b7efe917be03 Following this workaround: https://forums.developer.nvidia.com/t/invalid-public-key-for-cuda-apt-repository/212901/11 to hopefully fix this for now.	2022-04-29 17:10:18 +01:00
Dmitri Gekhtman	d68c1ecaf9	[kuberay] Test Ray client and update autoscaler image (#24195 ) This PR adds KubeRay e2e testing for Ray client and updates the suggested autoscaler image to one running the merge commit of PR #23883 .	2022-04-27 18:02:12 -07:00
Kai Fricke	fc1cd89020	[ci] Add short failing test summary for pytests (#24104 ) It is sometimes hard to find all failing tests in buildkite output logs - even filtering for "FAILED" is cumbersome as the output can be overloaded. This PR adds a small utility to add a short summary log in a separate output section at the end of the buildkite job. The only shared directory between the Buildkite host machine and the test docker container is `/tmp/artifacts:/artifact-mount`. Thus, we write the summary file to this directory, and delete it before actually uploading it as an artifact in the `post-commands` hook.	2022-04-26 22:18:07 +01:00
Amog Kamsetty	ae9c68e75f	[Train] Fully deprecate Ray SGD v1 (#24038 ) Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported. Closes #16435	2022-04-25 16:12:57 -07:00
Kai Fricke	b86d420a3c	[ci] Only upload wheels to S3 once (#24072 ) Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories. The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.	2022-04-25 21:19:11 +01:00
Dmitri Gekhtman	8c5fe44542	[KubeRay] Fix autoscaling with GPUs and custom resources, with e2e tests (#23883 ) - Closes #23874 by fixing a typo ("num_gpus" -> "num-gpus"). - Adds end-to-end test logic confirming the fix. - Adds end-to-end test logic confirming autoscaling with custom resources works. - Slightly refines developer instructions. - Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.	2022-04-21 14:54:37 -07:00
Yi Cheng	04611edf5a	[scheduler] Update syncer API and add reconnect feature. (#23929 ) This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size. Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage Make ray syncer test work even when we add more components in the protobuf Make ray syncer able to reconnect to a new node.	2022-04-20 14:31:24 -07:00

1 2 3 4 5 ...

257 commits