hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Simon Mo	4d583da7d5	[Serve] Add verbose log for nightly test only (#20088 )	2021-11-04 16:15:22 -07:00
SangBin Cho	56bab61fba	[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680 ) * done * Make it work * Fix issues * done * try * done * Fix remaining bugs.	2021-11-04 14:41:00 -07:00
Eric Liang	585d472fdf	Add configuration context to dataset (#19907 )	2021-11-04 14:36:51 -07:00
Alex Wu	4ffb7ccfac	[scheduler][cleanup] Remove one cpu optimization (#20022 ) * . * remove test * Update cluster_task_manager.cc * Update cluster_task_manager.cc * lint * lint * . Co-authored-by: Alex Wu <alex@anyscale.com>	2021-11-04 14:18:13 -07:00
Edward Oakes	49d308138f	[serve] Rename backend_state -> deployment_state (#20040 )	2021-11-04 15:46:45 -05:00
Philipp Moritz	a64e32c53b	[docs] Fix broken links in documentation and add linkcheck to documentation (#20030 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-11-04 13:19:43 -07:00
Sven Mika	50c30f89c6	[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016 )	2021-11-04 20:40:57 +01:00
Jiao	6cfb52ff1d	[job submission] Add stop API + subprocess cleanup (#19860 )	2021-11-04 13:59:47 -05:00
Yi Cheng	7bb4c87780	[gcs] use gcs kv in internal kv (#19933 ) ## Why are these changes needed? It's part of redis removal project. This PR focus on using gcs kv in internal kv. - gcs client is introduced - internal kv is updated to use gcs rpc client based kv - related code got updated. The other PR will update components using redis to use internal kv. ## Related issue number https://github.com/ray-project/ray/issues/19443	2021-11-04 09:57:39 -07:00
Yi Cheng	b3b88a46f7	[pg] Fix the test case which hangs because of scheduling dead lock (#20048 ) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438	2021-11-04 09:56:23 -07:00
Amog Kamsetty	f67b526b7a	[Tune] Fix PTL tutorial docs (#19999 )	2021-11-04 09:21:28 -07:00
xwjiang2010	f1179cbccd	[tune] Remove unused clean_trial_placement_group. (#19960 )	2021-11-04 08:55:42 -07:00
architkulkarni	bcb63961d9	[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009 )	2021-11-04 10:16:14 -05:00
SangBin Cho	8d115b96b5	[Tests] Try deflaking test placement group mini integration. (#19886 ) * done * fix	2021-11-03 20:54:59 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Edward Oakes	91c730efd0	[serve] Rename backend -> deployment in replica.py (#20020 )	2021-11-03 17:46:10 -05:00
Amog Kamsetty	ede9d0ed76	[CI] Pin keras (#20032 ) * try fix * try again * revert back * add todo	2021-11-03 15:32:10 -07:00
Clark Zinzow	a0841106ff	[Datasets] Follow-up to groupby standard deviation PR (#20035 )	2021-11-03 13:56:34 -07:00
Clark Zinzow	665954d48c	Add standard deviation aggregation. (#20010 )	2021-11-03 11:38:23 -07:00
Alex Wu	3d7d341dd0	[test] Fix test_actor_scheduling_not_block_with_placement_group (missing num_cpus=1) (#20006 )	2021-11-03 09:08:50 -07:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Edward Oakes	e1e0cb5eaa	[serve] Rename backend tag -> deployment name (#19997 )	2021-11-03 09:49:52 -05:00
Edward Oakes	b2ddea255d	[job submission] Add job submission ID + status to /api/snapshot (#19994 )	2021-11-03 09:49:28 -05:00
Yi Cheng	99034f5af5	Revert "Revert "[core] Fix wrong local resource view in raylet (#1991… (#19996 ) This reverts commit `f1eedb15b6`. ## Why are these changes needed? Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself. ## Related issue number #19438	2021-11-03 00:11:40 -07:00
Eric Liang	398d4cbf34	[data] Skip tests locally if moto server is not installed	2021-11-02 23:56:32 -07:00
Eric Liang	9e448db731	[RFC] Add tsan build mode (#19971 )	2021-11-02 22:29:51 -07:00
Jiajun Yao	6acf276959	Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19918 ) * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1 * Listen to 127.0.0.1 if node ip is 127.0.0.1	2021-11-03 12:17:55 +09:00
mwtian	ef4b6e4648	[Core][GCS] remove gcs object manager (#19963 )	2021-11-02 16:20:53 -07:00
Edward Oakes	14d0889fbc	[serve] Rename BackendInfo -> DeploymentInfo (#19947 )	2021-11-02 17:09:15 -05:00
iasoon	7e6ea9e3df	[serve] split ReplicaStartupState.PENDING into PENDING_ALLOCATION and PENDING_INITIALIZATION (#19431 )	2021-11-02 17:08:52 -05:00
SangBin Cho	f1eedb15b6	Revert "[core] Fix wrong local resource view in raylet (#19911 )" (#19992 ) This reverts commit `a907168184`. ## Why are these changes needed? This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. ## Related issue number	2021-11-02 14:27:05 -07:00
Edward Oakes	f8a6cad0b7	[job submission] SDK prototype w/ dynamic working_dir uploads (#19843 )	2021-11-02 16:01:54 -05:00
Siyuan (Ryans) Zhuang	3c9f91bd1d	[Workflow] Group options into a single workflow step options dataclass (#19654 ) * group options into workflow step options * fix comments * cleanup virtual actor call options * fix default value * step_options.make() * rename	2021-11-02 12:25:30 -07:00
SangBin Cho	857f23652f	Add more shuffle tests to CI (#17684 ) * IP * done * done	2021-11-02 08:07:59 -07:00
SangBin Cho	563eb0bca2	[Runtime env] Add a test to make sure resource deadlock message is not printed when waiting for workers (#19870 ) * ip * Add a runtime env resource deadlock msg test * Fix a bug * Skip on windows	2021-11-02 07:48:55 -07:00
Sven Mika	2d24ef0d32	[RLlib] Add all simple learning tests as `framework=tf2`. (#19273 ) * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and Tune tests have been moved to python 3.7 * fix tune test_sampler::testSampleBoundsAx * fix re-install ray for py3.7 tests Co-authored-by: avnishn <avnishn@uw.edu>	2021-11-02 12:10:17 +01:00
Simon Mo	6040319d02	[CI] Pin aiohttp version to fix master branch (#19948 )	2021-11-01 23:00:08 -07:00
Kai Yang	a33466e905	[Core] Fail inflight tasks on actor restarting (#19354 ) ## Why are these changes needed? If an actor failover is triggered, but the RPC connection between the caller and the crashed actor instance is not disconnected automatically, subsequent tasks to the new actor instance may not be executed. The root cause is that the sequence numbers of tasks sent to the new actor instance is not starting from 0. Details can be found in #14727. This PR fixes it by ensuring all inflight actor tasks fail immediately when actor failover is detected (via actor state notifications). ## Related issue number closes #14727	2021-11-02 11:03:12 +08:00
Yi Cheng	a907168184	[core] Fix wrong local resource view in raylet (#19911 ) ## Why are these changes needed? When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent. 1. local node has used all some pg resource 2. gcs broadcast node resources 3. local node now have resources 4. scheduler picks local node 5. local node can't schedule the task 6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs ## Related issue number #19438	2021-11-01 19:52:03 -07:00
Amog Kamsetty	3a52187da8	[Release/Lightning] Add Ray lightning user test (#19812 ) * wip * wip * add ray lightning test * fix * update * merge and add * fix * fix * rename * autoscale * add tblib * gloo backend * typo * upgrade torch * latest and master	2021-11-01 18:29:48 -07:00
Amog Kamsetty	474e44f7e0	[Release/Horovod] Add user test for Horovod (#19661 ) * infra * wip * add test * typo * typo * update * rename * fix * full path * formatting * reorder * update * update * Update release/horovod_tests/workloads/horovod_user_test.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * bump num_workers * update installs * try * add pip_packages * min_workers * fix * bump pg timeout * Fix symlink * fix * fix * cmake * fix * pin filelock * final * update * fix * Update release/horovod_tests/workloads/horovod_user_test.py * fix * fix * separate compute template * test latest and master Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-11-01 18:28:07 -07:00
matthewdeng	e1e4a45b8d	[train] add simple Ray Train release tests (#19817 ) * [train] add simple Ray Train release tests * simplify tests * update * driver requirements * move to test * remove connect * fix * fix * fix torch * gpu * add assert * remove assert * use gloo backend * fix * finish Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-01 18:25:19 -07:00
Jiajun Yao	05c63f0208	[workflow] Mark workflow test_recovery as large test (#19950 ) ## Why are these changes needed? move test_recovery to large test ## Related issue number	2021-11-01 15:50:38 -07:00
Sven Mika	bab9c0f670	[RLlib; Docs overhaul] Redo: Docstring cleanup: Trainer, trainer_template, Callbacks."" (#19830 )	2021-11-01 21:45:11 +01:00
Alex Wu	80fb3f10ae	[ci] Script for building M1 wheels (#19925 ) This PR includes a script for building wheels for Macs with M1 processors. It roughly follows the pattern of the other scripts with a few differences. Manually installs nvm Uses miniforge conda to install python/pip instead of python foundation .pkgs Doesn't pin numpy (we probably shouldn't be pinning it in the other scripts either...) Commit detection falls back to git instead of erroring All of these changes were made so that the script works on a laptop, which comes with a subset of the dependencies that the x86 buildkite image comes with.	2021-11-01 11:44:59 -07:00
Hao Zhang	a03c4363b5	[Collective] Allow send/recv partial tensors in Send/Recv primitives (#19921 )	2021-11-01 10:25:43 -07:00
Edward Oakes	ee57025be6	[serve] Rename BackendConfig -> DeploymentConfig (#19923 )	2021-11-01 10:24:02 -07:00
architkulkarni	702bffe072	[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906 )	2021-10-31 10:48:48 -05:00
architkulkarni	de8a9b5151	[runtime env] Always print package pushing logs regardless of size (#19897 )	2021-10-31 10:47:37 -05:00
Edward Oakes	e507b7ba6e	[serve] Rename BackendVersion -> DeploymentVersion (#19798 )	2021-10-31 10:27:19 -05:00

1 2 3 4 5 ...

5488 commits