hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Sven Mika	50c30f89c6	[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016 )	2021-11-04 20:40:57 +01:00
Jiao	6cfb52ff1d	[job submission] Add stop API + subprocess cleanup (#19860 )	2021-11-04 13:59:47 -05:00
Alex Wu	36a214386f	[docs] PyData ray dataset talk (#20038 )	2021-11-04 10:33:18 -07:00
Yi Cheng	65d3054a09	[build] fix the wrong flag for gcs ha test (#20052 ) ## Why are these changes needed? It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub` ## Related issue number	2021-11-04 09:59:11 -07:00
Yi Cheng	7bb4c87780	[gcs] use gcs kv in internal kv (#19933 ) ## Why are these changes needed? It's part of redis removal project. This PR focus on using gcs kv in internal kv. - gcs client is introduced - internal kv is updated to use gcs rpc client based kv - related code got updated. The other PR will update components using redis to use internal kv. ## Related issue number https://github.com/ray-project/ray/issues/19443	2021-11-04 09:57:39 -07:00
Chen Shen	5c0e012ba3	[Core][Core-worker] de-escalate the error message when worker is accessed after shutdown. (#20049 ) * de-escalate * lint	2021-11-04 09:56:39 -07:00
Yi Cheng	b3b88a46f7	[pg] Fix the test case which hangs because of scheduling dead lock (#20048 ) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438	2021-11-04 09:56:23 -07:00
Amog Kamsetty	f67b526b7a	[Tune] Fix PTL tutorial docs (#19999 )	2021-11-04 09:21:28 -07:00
xwjiang2010	f1179cbccd	[tune] Remove unused clean_trial_placement_group. (#19960 )	2021-11-04 08:55:42 -07:00
architkulkarni	bcb63961d9	[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009 )	2021-11-04 10:16:14 -05:00
Jiajun Yao	7e013366ac	[scheduler] Update local object store usage (#20026 ) * Update local object store usage * Update local object store usage * Update local object store usage	2021-11-04 08:11:32 -07:00
Sven Mika	4cb23d1c95	[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031 )	2021-11-04 10:58:34 +01:00
mwtian	a26474156d	Use GCC 9 in GPU docker (#20024 )	2021-11-03 22:53:17 -07:00
SangBin Cho	8d115b96b5	[Tests] Try deflaking test placement group mini integration. (#19886 ) * done * fix	2021-11-03 20:54:59 -07:00
Qing Wang	4373aa1e3b	Support generating a UUID string as the anonymous namespace for Java worker. (#19986 ) Why are these changes needed? For Java worker, we generate a UUID string as the namespace if a job is not specified a namespace by user. Related issue number #16474	2021-11-04 11:40:17 +08:00
Alex Wu	b52cc6ecdc	[scheduler] Fix dynamically increasing resources (#20036 )	2021-11-03 17:28:44 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Edward Oakes	91c730efd0	[serve] Rename backend -> deployment in replica.py (#20020 )	2021-11-03 17:46:10 -05:00
Amog Kamsetty	ede9d0ed76	[CI] Pin keras (#20032 ) * try fix * try again * revert back * add todo	2021-11-03 15:32:10 -07:00
Clark Zinzow	a0841106ff	[Datasets] Follow-up to groupby standard deviation PR (#20035 )	2021-11-03 13:56:34 -07:00
SangBin Cho	eacaff5d8d	[Core] Try scheduling tasks when the local resource creation is updated (#20019 ) ## Why are these changes needed? Check https://github.com/ray-project/ray/pull/19996/files#r741963616 ## Related issue number	2021-11-03 12:29:51 -07:00
Richard Hamnett	f4256a4ddc	[Doc] Update installation.rst for nightly build (#20034 ) Ensure clean removal of previous ray nightly before updating.	2021-11-03 12:05:07 -07:00
mwtian	f83195a1e1	[Build] Add GCS HA builds (#20008 ) ## Why are these changes needed? Add builds for Python tests with GCS pubsub enabled. ## Related issue number	2021-11-03 11:58:16 -07:00
Clark Zinzow	665954d48c	Add standard deviation aggregation. (#20010 )	2021-11-03 11:38:23 -07:00
Yi Cheng	555b87d552	[gcs] Enable grpc based broadcasting by default (#19716 ) ## Why are these changes needed? This is part of redis removal project. This PR is going to enable grpc based broadcasting by default. ## Related issue number <!-- For example: "Closes #1234" --> #19438 ## Checks	2021-11-03 10:02:37 -07:00
Alex Wu	3d7d341dd0	[test] Fix test_actor_scheduling_not_block_with_placement_group (missing num_cpus=1) (#20006 )	2021-11-03 09:08:50 -07:00
Jiajun Yao	5de4a38948	[CI] Run Java CI on Mac (#19757 ) Why are these changes needed? Enable Java tests on Mac CI to avoid more breakages. Related issue number Closes #19700	2021-11-03 23:40:05 +08:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Edward Oakes	e1e0cb5eaa	[serve] Rename backend tag -> deployment name (#19997 )	2021-11-03 09:49:52 -05:00
Edward Oakes	b2ddea255d	[job submission] Add job submission ID + status to /api/snapshot (#19994 )	2021-11-03 09:49:28 -05:00
SangBin Cho	73f21b83a0	[Core] improve the GCS address update failure error message. (#19930 ) * improve the error message. * improve the error msg further	2021-11-03 07:02:44 -07:00
Will Drevo	f359b21541	[RLlib; Docs] Updated RLlib training example page (#19932 )	2021-11-03 12:34:18 +01:00
Sven Mika	e6ae08f416	[RLlib] Optionally don't drop last ts in v-trace calculations (APPO and IMPALA). (#19601 )	2021-11-03 10:01:34 +01:00
Sven Mika	cf21c634a3	[RLlib] Fix deprecated warning for torch_ops.py (soft-replaced by torch_utils.py). (#19982 )	2021-11-03 10:00:46 +01:00
Eric Liang	28d4cfb039	[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg (#19910 )	2021-11-03 01:37:06 -07:00
Yi Cheng	99034f5af5	Revert "Revert "[core] Fix wrong local resource view in raylet (#1991… (#19996 ) This reverts commit `f1eedb15b6`. ## Why are these changes needed? Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself. ## Related issue number #19438	2021-11-03 00:11:40 -07:00
Eric Liang	398d4cbf34	[data] Skip tests locally if moto server is not installed	2021-11-02 23:56:32 -07:00
mwtian	c0eeb36209	[Core][Pubsub] Support publishing / subscribing to Actor / Job / Node info via GCS (#19903 ) ## Why are these changes needed? This is the first step in migrating Redis pubsub to be GCS pubsub based. Changes include: - Remove `SubscribeAll()` API for Actor pubsub since it is only used in tests. Supporting both `Subscribe()` and `SubscribeAll()` APIs would be too complex without much return. - Update `Subscribe()` API to accept a done status callback. - Implement `SubscribeAll()` / `Unsubscribe()`(from channel) API in Ray pubsub. - Implement using Ray pubsub for Actor, Job, Node info and Node resource publishing / subscribing. GCS changes are tested with GCS server test in GCS pubsub mode. ## Related issue number	2021-11-02 22:47:05 -07:00
Eric Liang	9e448db731	[RFC] Add tsan build mode (#19971 )	2021-11-02 22:29:51 -07:00
Jiajun Yao	6acf276959	Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19918 ) * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1	2021-11-03 12:17:55 +09:00
Lixin Wei	a369fc97cf	[scheduler] Remove isFeasible (#19931 )	2021-11-02 17:40:46 -07:00
mwtian	ef4b6e4648	[Core][GCS] remove gcs object manager (#19963 )	2021-11-02 16:20:53 -07:00
Edward Oakes	14d0889fbc	[serve] Rename BackendInfo -> DeploymentInfo (#19947 )	2021-11-02 17:09:15 -05:00
iasoon	7e6ea9e3df	[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION (#19431 )	2021-11-02 17:08:52 -05:00
SangBin Cho	f1eedb15b6	Revert "[core] Fix wrong local resource view in raylet (#19911 )" (#19992 ) This reverts commit `a907168184`. ## Why are these changes needed? This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. ## Related issue number	2021-11-02 14:27:05 -07:00
Edward Oakes	f8a6cad0b7	[job submission] SDK prototype w/ dynamic working_dir uploads (#19843 )	2021-11-02 16:01:54 -05:00
Amog Kamsetty	f4b425f84c	[Release/Xgboost] Fix master install (#19991 )	2021-11-02 13:50:14 -07:00
Siyuan (Ryans) Zhuang	3c9f91bd1d	[Workflow] Group options into a single workflow step options dataclass (#19654 ) * group options into workflow step options * fix comments * cleanup virtual actor call options * fix default value * step_options.make() * rename	2021-11-02 12:25:30 -07:00
gjoliver	9385b6c1be	[RLlib] Make a few LRSchedule and EntropyCoeffSchedule tests more reliable. (#19934 )	2021-11-02 16:52:56 +01:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00

1 2 3 4 5 ...

10204 commits