hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
mwtian	24da654d90	[Test] Shard "Small & Large" tests (#21351 )	2022-01-05 10:49:14 -08:00
mwtian	70db5c5592	[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232 ) After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster. Co-authored-by: Yi Cheng <chengyidna@gmail.com> Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>	2022-01-04 23:06:44 -08:00
Gagandeep Singh	92bf609a08	Unskip tests in ``test_basic_3.py`` (#20433 )	2021-12-22 00:09:32 -08:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
Yi Cheng	f62faca04c	[1/gcs] gcs ha bootstrap for raylet (#21174 ) This is part of #21129 This PR tries to cover the cpp/ray part of the bootstrap, some updates there: remove the unused function/tests some API updates Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 08:50:42 -08:00
Eric Liang	6f93ea437e	Remove the flaky test tag (#21006 )	2021-12-11 01:03:17 -08:00
mwtian	b9bcd6215a	Disable two tests that are very flaky in GCS HA build (#21012 ) `//python/ray/tests:test_client_reconnect` seems to only flake under GCS HA build. The client server starts to shutdown under injected failures, unlike the behavior without GCS KV or pubsub. `//python/ray/tests:test_multi_node_3` seems to flake more often under GCS HA build, although it is still flaky without GCS HA feature flags. It seems raylet termination did not notify other processes properly. Disable these two tests before they are fixed.	2021-12-10 17:08:25 -08:00
mwtian	6871a72a5c	[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860 ) Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.	2021-12-10 14:36:57 -08:00
Kai Fricke	97ec2a03b6	[ci/buildkite] Add ml pipeline to speed up ML/RLLib tests (#20895 ) ML tests will be built in a separate bootstrap step installing all required dependencies.	2021-12-09 21:14:10 +00:00
Yi Cheng	442b1025cd	[1/gcs-mem-kv] Memory mode for internal kv (#20881 ) This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode. There are two ways to address this: - Update store client and use store client in internal kv - Add memory table into internal kv directly. The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change. So the easier way to do this is 2) and it's what this PR doing. Next: use the flag for store client	2021-12-08 10:40:35 -08:00
matthewdeng	0de105d42f	[train] update Trainer._is_tune_enabled to work when Tune is not installed (#20767 )	2021-11-29 20:08:51 -08:00
Simon Mo	ca90c63483	[Serve] Add serve failure test to CI (#20392 )	2021-11-16 08:12:08 -08:00
mwtian	a39fd74674	disable //python/ray/tests:test_autoscaler_drain_node_api in HA GCS build (#20296 )	2021-11-12 15:47:42 -08:00
xwjiang2010	ce8504b0b2	[CI] Rebalance Tune tests a bit. (#20263 )	2021-11-12 15:30:18 +00:00
chenk008	74fa267c72	Enable worker in container CI test (#20174 )	2021-11-11 16:11:06 -08:00
mwtian	0330852baf	[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111 ) ## Why are these changes needed? This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`. Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work. Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published. GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change. ## Related issue number	2021-11-11 14:59:57 -08:00
xwjiang2010	883fbd003c	[CI; Tune] Split Tune tests and examples (#20210 ) * Split Tune tests and examples part 1 into tests and examples separate. * fix typo. * fix typo. * Add docs.	2021-11-11 10:50:51 +01:00
Sven Mika	ebd56b57db	[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215 )	2021-11-10 22:20:06 +01:00
Sven Mika	50c30f89c6	[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016 )	2021-11-04 20:40:57 +01:00
Yi Cheng	65d3054a09	[build] fix the wrong flag for gcs ha test (#20052 ) ## Why are these changes needed? It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub` ## Related issue number	2021-11-04 09:59:11 -07:00
Sven Mika	4cb23d1c95	[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031 )	2021-11-04 10:58:34 +01:00
mwtian	f83195a1e1	[Build] Add GCS HA builds (#20008 ) ## Why are these changes needed? Add builds for Python tests with GCS pubsub enabled. ## Related issue number	2021-11-03 11:58:16 -07:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Sven Mika	e6ae08f416	[RLlib] Optionally don't drop last ts in v-trace calculations (APPO and IMPALA). (#19601 )	2021-11-03 10:01:34 +01:00
Sven Mika	2d24ef0d32	[RLlib] Add all simple learning tests as `framework=tf2`. (#19273 ) * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and Tune tests have been moved to python 3.7 * fix tune test_sampler::testSampleBoundsAx * fix re-install ray for py3.7 tests Co-authored-by: avnishn <avnishn@uw.edu>	2021-11-02 12:10:17 +01:00
mwtian	7afdfdc6dd	[CI] narrow down tests that run when files change (#19656 )	2021-10-29 16:47:54 -07:00
matthewdeng	bfb0ef1b08	move jsonschema to core dependencies and update default AutoscalerPrometheusMetrics (#19831 )	2021-10-28 13:04:22 -07:00
Amog Kamsetty	db863aafc0	Revert "Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 )" (#19763 ) This reverts commit `e58fcca404`.	2021-10-26 17:32:56 -07:00
Amog Kamsetty	e58fcca404	Revert "[Docker] Support multiple CUDA Versions (#19505 )" (#19756 ) This reverts commit `f0053d405b`.	2021-10-26 12:55:20 -07:00
Avnish Narayan	ad87ddf93e	[rllib] Add deterministic test to gpu (#19306 ) Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-10-26 10:11:39 -07:00
Amog Kamsetty	f0053d405b	[Docker] Support multiple CUDA Versions (#19505 ) * wip * wip * update * finish * deprecate * debug * fix and address comments * try catch * fix * split tests * force * merge * docs * wip * fix and check * update readme * fix * fix * fix sanity checking * format	2021-10-25 18:57:05 -07:00
Jiajun Yao	256bf0bf3a	[Release] Bump up dask to latest compatible version 2021.9.1 (#19592 ) * Bump up dask to latest compatible version 2021.9.1 * Bump up dask to latest compatible version 2021.9.1	2021-10-22 09:16:28 -07:00
Simon Mo	03805d4064	[Serve] Good error message when Serve not installed and ensure Serve installs ray[default] (#19570 )	2021-10-21 13:47:29 -07:00
architkulkarni	b8941338d3	[runtime env] Raise error when creating runtime env when ray[default] is not installed (#19491 )	2021-10-19 09:16:04 -05:00
matthewdeng	4674c78050	[Train] Rename Ray SGD v2 to Ray Train (#19436 )	2021-10-18 22:27:46 -07:00
Kai Fricke	d8d8901192	[ci/tune] Remove deprecated `jenkins_only` tag from test tags (#19287 )	2021-10-12 10:05:46 +01:00
SangBin Cho	0ef0d9a77d	Revert "[core] Assign tasks to the first available worker (#18167 )" (#19180 ) This reverts commit `545db13800`.	2021-10-07 10:38:37 -07:00
Stephanie Wang	545db13800	[core] Assign tasks to the first available worker (#18167 ) * Convert worker pool to queue * Start up to backlog size more workers * fixes * Prestart workers according to num available CPUs * lint * x * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * dedicated workers * Fix tests * x * fix * asan * asan * Workers can only exec tasks with same job ID * size_t for runtime env hash, fix unit tests * include job ID in runtime env hash, remove from worker registration msg * x * conflict * debug * Schedule and dispatch periodically, skip if no new tasks * Update src/ray/common/task/task_spec.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/scheduling/cluster_task_manager.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-10-05 13:45:50 -07:00
Kai Fricke	3dc176c42e	[ci/tune] Add SGD and Tune GPU pipeline step to CI (#18469 ) * [ci/tune] Add Tune GPU pipeline step to CI * cont. * add sgd gpu tests * format yaml, fix imports * install horovod; fix line wrapping * set GPU per worker to 0.5 * fix import * move test to 4gpu machine * fix lint * lint * set visible devices * pull in tf gpu fix * Fix Tune GPU pipeline step * nit * Disable GPU tests until we have some * Re-add empty rllib tests Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>	2021-10-01 18:34:05 -07:00
architkulkarni	0f0b161ea1	Revert "Revert "[Serve] [doc] Improve runtime env doc"" (#18943 ) * Revert "Revert "[Serve] [doc] Improve runtime env doc (#18782)" (#18935)" This reverts commit `e4f4c79252`.	2021-09-30 13:28:44 -05:00
Yi Cheng	e4f4c79252	Revert "[Serve] [doc] Improve runtime env doc (#18782 )" (#18935 ) This reverts commit `d4d71985d5`.	2021-09-27 21:52:13 -07:00
architkulkarni	d4d71985d5	[Serve] [doc] Improve runtime env doc (#18782 )	2021-09-27 16:12:03 -05:00
Chen Shen	35aa944ef4	Fix thread-safety in global state accessor (#18746 )	2021-09-19 12:01:31 -07:00
mwtian	efdbfcfdfb	[Build] Generate Bazel config for compiling with clang and libc++ in CI (#18622 ) * Add Bazel config for building with llvm. Upgrade C++ std to 17. * Fix redis. Try fixing asan and tsan * Fix asan and format * Update comments. Co-authored-by: Chen Shen <scv119@gmail.com>	2021-09-17 19:01:07 -07:00
Sven Mika	8a72824c63	[RLlib Testig] Split and unflake more CI tests (make sure all jobs are < 30min). (#18591 )	2021-09-15 22:16:48 +02:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Simon Mo	497c5f56fa	[CI] Temporary disable worker-in-container test (#18606 ) * revert again * disable tmp	2021-09-14 22:38:20 -07:00
mwtian	a3f399ef10	[Client] fix propagating errors to async calls during disconnect, and other cleanup (#18539 ) * cleanup tests and errors for clients * Fix lock and async get * rerun * Avoid running callback under lock. Make lock non-reentrant * Add all necessary apis * Removed unused APIs	2021-09-14 18:48:27 +03:00
Yi Cheng	7d1f408de9	[workflow] Move `experimental/workflow` to `workflow` (#18521 )	2021-09-13 17:45:18 -07:00
Chen Shen	5f57079041	use clang for C++ debug testing (#18343 )	2021-09-09 15:48:36 -07:00

1 2 3 4

163 commits