hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
mwtian	c0eeb36209	[Core][Pubsub] Support publishing / subscribing to Actor / Job / Node info via GCS (#19903 ) ## Why are these changes needed? This is the first step in migrating Redis pubsub to be GCS pubsub based. Changes include: - Remove `SubscribeAll()` API for Actor pubsub since it is only used in tests. Supporting both `Subscribe()` and `SubscribeAll()` APIs would be too complex without much return. - Update `Subscribe()` API to accept a done status callback. - Implement `SubscribeAll()` / `Unsubscribe()`(from channel) API in Ray pubsub. - Implement using Ray pubsub for Actor, Job, Node info and Node resource publishing / subscribing. GCS changes are tested with GCS server test in GCS pubsub mode. ## Related issue number	2021-11-02 22:47:05 -07:00
Eric Liang	9e448db731	[RFC] Add tsan build mode (#19971 )	2021-11-02 22:29:51 -07:00
Jiajun Yao	6acf276959	Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19918 ) * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1	2021-11-03 12:17:55 +09:00
Lixin Wei	a369fc97cf	[scheduler] Remove isFeasible (#19931 )	2021-11-02 17:40:46 -07:00
mwtian	ef4b6e4648	[Core][GCS] remove gcs object manager (#19963 )	2021-11-02 16:20:53 -07:00
Edward Oakes	14d0889fbc	[serve] Rename BackendInfo -> DeploymentInfo (#19947 )	2021-11-02 17:09:15 -05:00
iasoon	7e6ea9e3df	[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION (#19431 )	2021-11-02 17:08:52 -05:00
SangBin Cho	f1eedb15b6	Revert "[core] Fix wrong local resource view in raylet (#19911 )" (#19992 ) This reverts commit `a907168184`. ## Why are these changes needed? This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. ## Related issue number	2021-11-02 14:27:05 -07:00
Edward Oakes	f8a6cad0b7	[job submission] SDK prototype w/ dynamic working_dir uploads (#19843 )	2021-11-02 16:01:54 -05:00
Amog Kamsetty	f4b425f84c	[Release/Xgboost] Fix master install (#19991 )	2021-11-02 13:50:14 -07:00
Siyuan (Ryans) Zhuang	3c9f91bd1d	[Workflow] Group options into a single workflow step options dataclass (#19654 ) * group options into workflow step options * fix comments * cleanup virtual actor call options * fix default value * step_options.make() * rename	2021-11-02 12:25:30 -07:00
gjoliver	9385b6c1be	[RLlib] Make a few LRSchedule and EntropyCoeffSchedule tests more reliable. (#19934 )	2021-11-02 16:52:56 +01:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00
SangBin Cho	857f23652f	Add more shuffle tests to CI (#17684 ) * IP * done * done	2021-11-02 08:07:59 -07:00
SangBin Cho	563eb0bca2	[Runtime env] Add a test to make sure resource deadlock message is not printed when waiting for workers (#19870 ) * ip * Add a runtime env resource deadlock msg test * Fix a bug * Skip on windows	2021-11-02 07:48:55 -07:00
Sven Mika	2d24ef0d32	[RLlib] Add all simple learning tests as `framework=tf2`. (#19273 ) * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and Tune tests have been moved to python 3.7 * fix tune test_sampler::testSampleBoundsAx * fix re-install ray for py3.7 tests Co-authored-by: avnishn <avnishn@uw.edu>	2021-11-02 12:10:17 +01:00
Will Drevo	97f04b118d	[RLlib; Docs] Added fixes to CartPole example. (#19908 ) * Added fixes to CartPole example * Apply suggestions from code review Co-authored-by: will <will@anyscale.com> Co-authored-by: Sven Mika <sven@anyscale.io>	2021-11-02 10:06:39 +01:00
Simon Mo	6040319d02	[CI] Pin aiohttp version to fix master branch (#19948 )	2021-11-01 23:00:08 -07:00
Qing Wang	da6894848d	Support Java namespace APIs (#19468 ) ## Why are these changes needed? ## Related issue number #16474	2021-11-02 11:05:40 +08:00
Kai Yang	a33466e905	[Core] Fail inflight tasks on actor restarting (#19354 ) ## Why are these changes needed? If an actor failover is triggered, but the RPC connection between the caller and the crashed actor instance is not disconnected automatically, subsequent tasks to the new actor instance may not be executed. The root cause is that the sequence numbers of tasks sent to the new actor instance is not starting from 0. Details can be found in #14727. This PR fixes it by ensuring all inflight actor tasks fail immediately when actor failover is detected (via actor state notifications). ## Related issue number closes #14727	2021-11-02 11:03:12 +08:00
Yi Cheng	a907168184	[core] Fix wrong local resource view in raylet (#19911 ) ## Why are these changes needed? When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent. 1. local node has used all some pg resource 2. gcs broadcast node resources 3. local node now have resources 4. scheduler picks local node 5. local node can't schedule the task 6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs ## Related issue number #19438	2021-11-01 19:52:03 -07:00
xwjiang2010	c48d86e469	[CI] change git protocol to use https. (#19964 )	2021-11-01 19:38:58 -07:00
Amog Kamsetty	3a52187da8	[Release/Lightning] Add Ray lightning user test (#19812 ) * wip * wip * add ray lightning test * fix * update * merge and add * fix * fix * rename * autoscale * add tblib * gloo backend * typo * upgrade torch * latest and master	2021-11-01 18:29:48 -07:00
Amog Kamsetty	474e44f7e0	[Release/Horovod] Add user test for Horovod (#19661 ) * infra * wip * add test * typo * typo * update * rename * fix * full path * formatting * reorder * update * update * Update release/horovod_tests/workloads/horovod_user_test.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * bump num_workers * update installs * try * add pip_packages * min_workers * fix * bump pg timeout * Fix symlink * fix * fix * cmake * fix * pin filelock * final * update * fix * Update release/horovod_tests/workloads/horovod_user_test.py * fix * fix * separate compute template * test latest and master Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-11-01 18:28:07 -07:00
matthewdeng	e1e4a45b8d	[train] add simple Ray Train release tests (#19817 ) * [train] add simple Ray Train release tests * simplify tests * update * driver requirements * move to test * remove connect * fix * fix * fix torch * gpu * add assert * remove assert * use gloo backend * fix * finish Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-01 18:25:19 -07:00
Jiajun Yao	05c63f0208	[workflow] Mark workflow test_recovery as large test (#19950 ) ## Why are these changes needed? move test_recovery to large test ## Related issue number	2021-11-01 15:50:38 -07:00
Sven Mika	0b308719f8	[RLlib; Docs overhaul] Docstring cleanup: rllib/utils (#19829 )	2021-11-01 21:46:02 +01:00
Sven Mika	bab9c0f670	[RLlib; Docs overhaul] Redo: Docstring cleanup: Trainer, trainer_template, Callbacks."" (#19830 )	2021-11-01 21:45:11 +01:00
Alex Wu	80fb3f10ae	[ci] Script for building M1 wheels (#19925 ) This PR includes a script for building wheels for Macs with M1 processors. It roughly follows the pattern of the other scripts with a few differences. Manually installs nvm Uses miniforge conda to install python/pip instead of python foundation .pkgs Doesn't pin numpy (we probably shouldn't be pinning it in the other scripts either...) Commit detection falls back to git instead of erroring All of these changes were made so that the script works on a laptop, which comes with a subset of the dependencies that the x86 buildkite image comes with.	2021-11-01 11:44:59 -07:00
xwjiang2010	1803ca13b6	Adding release logs for 1.8.0. (#19867 )	2021-11-01 10:26:04 -07:00
Hao Zhang	a03c4363b5	[Collective] Allow send/recv partial tensors in Send/Recv primitives (#19921 )	2021-11-01 10:25:43 -07:00
Edward Oakes	ee57025be6	[serve] Rename BackendConfig -> DeploymentConfig (#19923 )	2021-11-01 10:24:02 -07:00
Sven Mika	ea2bea7e30	[RLlib; Docs overhaul] Docstring cleanup: Offline. (#19808 )	2021-11-01 10:59:53 +01:00
Tao Wang	7a2e9e00e8	[Tiny]Remove duplicated assignment (#19866 )	2021-11-01 11:44:01 +08:00
mwtian	cb8dc5c94e	Fix unused import warning in streaming.proto (#19912 ) ## Why are these changes needed? This generates a warning when calling `protoc` on the proto. ## Related issue number	2021-10-31 13:29:51 -07:00
architkulkarni	702bffe072	[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906 )	2021-10-31 10:48:48 -05:00
architkulkarni	de8a9b5151	[runtime env] Always print package pushing logs regardless of size (#19897 )	2021-10-31 10:47:37 -05:00
Edward Oakes	e507b7ba6e	[serve] Rename BackendVersion -> DeploymentVersion (#19798 )	2021-10-31 10:27:19 -05:00
Chen Shen	961742f8e7	[Core] deflake windows test failure (test_task_retry_mini_integration) #19916	2021-10-30 15:13:38 -07:00
xwjiang2010	4d293c4cee	Increase horovod_test disk space. (#19917 )	2021-10-30 14:41:31 -07:00
Sven Mika	4d945fe651	[RLlib] Issue 19878: Re-instate bare_metal_policy example script (#19881 )	2021-10-30 12:50:39 -07:00
Stephanie Wang	630a8cacb3	Revert "[core] Fail objects when pull/reconstruction hangs (#19789 )" (#19904 ) This reverts commit `e6d60d7376`.	2021-10-30 10:54:39 -07:00
Kim Pevey	3ff4fde0f5	[Doc] Update newsreader example (#19893 )	2021-10-29 22:25:40 -07:00
Kim Pevey	8aa61566fa	[Doc] Example docs minor wording fixes (#19890 )	2021-10-29 22:15:35 -07:00
Kim Pevey	96480d97d6	[DOC] Minor typos/fixes to Tips for First Timers (#19887 ) * fix typos * some more fixes Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2021-10-29 22:13:15 -07:00
mwtian	7afdfdc6dd	[CI] narrow down tests that run when files change (#19656 )	2021-10-29 16:47:54 -07:00
mwtian	d32facdef8	[Doc][Bazel] add comment to not use Bazel test result cache (#19842 ) To avoid confusions in future, add a comment about why Ray is not using Bazel test result cache.	2021-10-29 16:46:22 -07:00
chenk008	57363995f3	[runtime env] Move container related code to runtime env (#19067 )	2021-10-29 16:31:11 -07:00
Jiao	bb0ebb7903	[job submission] Temporarily make pydantic imports conditional (#19827 )	2021-10-29 18:09:18 -05:00
Gagandeep Singh	f549e528c7	Bumped time limit in test_cancel::test_comprehensive (#19871 )	2021-10-29 15:51:49 -07:00

1 2 3 4 5 ...

10167 commits