hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Lixin Wei	8e666ca1e9	[Core] Fix Used Memory Calculation (#20127 ) * fix memory * fix	2021-11-08 17:36:32 -08:00
Kai Fricke	9c2b8c8501	[tune] Deprecate DurableTrainable (#19880 )	2021-11-08 20:56:07 +00:00
Amog Kamsetty	f8430e6eca	[CI] Pin shortuuid to fix CI (#20153 )	2021-11-08 12:08:32 -08:00
Amog Kamsetty	b1f24768a1	[Tune] More fixes to PTL Tutorial (#20065 ) * ptl-fix-2 * improve * fix	2021-11-08 09:13:44 -08:00
Gagandeep Singh	31812d026c	Bumped time limit for test_worker_startup_count in test_basic_3.py (#20056 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2021-11-08 09:02:28 -08:00
Kai Yang	e84391d1d3	[Core] Encode job ID in randomized task IDs for user-created threads (#19320 ) ## Why are these changes needed? Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID. This PR updates the task ID generation code to always encode the job ID. A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads. ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2021-11-08 21:00:40 +08:00
Linsong Chu	e189d8d4bc	[workflow] fix s3 storage path (#20115 ) ## Why are these changes needed? To fix two path related issues when s3 is used as storage backend: 1. a leading slash will be added to the path due to the behavior of `parse.urlparse`. 2. When `step_id=""`, double slashes will be added in the path. Details are explained in https://github.com/ray-project/ray/issues/20114 ## Related issue number https://github.com/ray-project/ray/issues/20114 https://github.com/ray-project/ray/issues/19027	2021-11-07 15:57:33 -08:00
dependabot[bot]	adf39941f4	[data](deps): Bump dask[complete] (#20125 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2021.9.1 to 2021.11.0. - [Release notes](https://github.com/dask/dask/releases) - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](https://github.com/dask/dask/compare/2021.09.1...2021.11.0) --- updated-dependencies: - dependency-name: dask[complete] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-07 11:55:39 -08:00
xwjiang2010	866fa9590f	[tune] clean up legacy branch in update_avail_resources. (#20071 )	2021-11-05 10:28:46 -07:00
matthewdeng	78e9ff7c91	[train][datasets] add example for big data training (#20042 ) * [train][datasets] add example for big data training * add title docstring * lint and dependencies * add dask_ml requirement	2021-11-05 09:28:48 -07:00
Chen Shen	320f9dc234	[Core][CoreWorker] increase the default port range (#19541 ) * increase the port range * Update doc/source/configure.rst Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2021-11-05 09:25:44 -07:00
Alex Wu	146b3d6bcc	[scheduler] Include depth and function descriptor in scheduling class (#20004 )	2021-11-05 08:19:48 -07:00
Simon Mo	3d5cbc6e62	[Serve] Fix HTTP error handling behavior and add tests (#20093 )	2021-11-05 10:15:54 -05:00
SangBin Cho	8299aae918	[Placement Group] Add stats to pg scheduling (#19841 ) * Add an e2e stats to pg scheduling * Fix bugs. * fix a bug. * Revert "fix a bug." This reverts commit dd7e03d1346fa39e54898effaaf8a2771103176e. * done except unit tests. * done except unit tests. * Add unit tests. * Address code review. * done * Fix * done * Fixed the test	2021-11-05 06:51:42 -07:00
Amog Kamsetty	adb8d77b2b	[Deps] Bump tensorflow on Docker image and add Codeowners (#20041 )	2021-11-05 00:58:34 -07:00
dependabot[bot]	60e9737679	[tune](deps): Bump mlflow in /python/requirements/ml (#19913 ) Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.19.0 to 1.21.0. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst) - [Commits](https://github.com/mlflow/mlflow/compare/v1.19.0...v1.21.0) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-04 23:37:01 -07:00
dependabot[bot]	9897ee0eab	[tune](deps): Bump onnxruntime in /python/requirements/ml (#19666 ) Bumps [onnxruntime](https://github.com/microsoft/onnxruntime) from 1.8.0 to 1.9.0. - [Release notes](https://github.com/microsoft/onnxruntime/releases) - [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md) - [Commits](https://github.com/microsoft/onnxruntime/compare/v1.8.0...v1.9.0) --- updated-dependencies: - dependency-name: onnxruntime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-04 23:34:48 -07:00
dependabot[bot]	f214c4a4ab	[tune](deps): Bump datasets from 1.11.0 to 1.14.0 in /python/requirements/ml (#19645 ) * [tune](deps): Bump datasets in /python/requirements/ml Bumps [datasets](https://github.com/huggingface/datasets) from 1.11.0 to 1.14.0. - [Release notes](https://github.com/huggingface/datasets/releases) - [Commits](https://github.com/huggingface/datasets/compare/1.11.0...1.14.0) --- updated-dependencies: - dependency-name: datasets dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update requirements_tune.txt Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2021-11-04 23:33:55 -07:00
Clark Zinzow	6ade6f0be6	[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. (#20044 )	2021-11-04 22:48:49 -07:00
mwtian	fb0ede38ba	[CI] [macOS] avoid installing latest setuptools (#20064 )	2021-11-04 21:35:03 -07:00
architkulkarni	c5175073b2	[runtime env] Add garbage collection for conda envs (#20072 )	2021-11-04 23:13:34 -05:00
Edward Oakes	360993612c	[serve] Remove lingering backend references (#20085 )	2021-11-04 20:32:13 -05:00
Eric Liang	6102912494	Dataset doc updates (#19815 )	2021-11-04 18:13:40 -07:00
SangBin Cho	44b38e9aa1	Add Chaos testing fixture + test actor tasks chaos test in CI (#19975 ) * Basic CI tests done * Fix an issue * shutdown to conftest * Addressed code review.	2021-11-04 16:27:35 -07:00
Simon Mo	4d583da7d5	[Serve] Add verbose log for nightly test only (#20088 )	2021-11-04 16:15:22 -07:00
SangBin Cho	56bab61fba	[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680 ) * done * Make it work * Fix issues * done * try * done * Fix remaining bugs.	2021-11-04 14:41:00 -07:00
Eric Liang	585d472fdf	Add configuration context to dataset (#19907 )	2021-11-04 14:36:51 -07:00
Alex Wu	4ffb7ccfac	[scheduler][cleanup] Remove one cpu optimization (#20022 ) * . * remove test * Update cluster_task_manager.cc * Update cluster_task_manager.cc * lint * lint * . Co-authored-by: Alex Wu <alex@anyscale.com>	2021-11-04 14:18:13 -07:00
Edward Oakes	49d308138f	[serve] Rename backend_state -> deployment_state (#20040 )	2021-11-04 15:46:45 -05:00
Philipp Moritz	a64e32c53b	[docs] Fix broken links in documentation and add linkcheck to documentation (#20030 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-11-04 13:19:43 -07:00
Sven Mika	50c30f89c6	[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016 )	2021-11-04 20:40:57 +01:00
Jiao	6cfb52ff1d	[job submission] Add stop API + subprocess cleanup (#19860 )	2021-11-04 13:59:47 -05:00
Yi Cheng	7bb4c87780	[gcs] use gcs kv in internal kv (#19933 ) ## Why are these changes needed? It's part of redis removal project. This PR focus on using gcs kv in internal kv. - gcs client is introduced - internal kv is updated to use gcs rpc client based kv - related code got updated. The other PR will update components using redis to use internal kv. ## Related issue number https://github.com/ray-project/ray/issues/19443	2021-11-04 09:57:39 -07:00
Yi Cheng	b3b88a46f7	[pg] Fix the test case which hangs because of scheduling dead lock (#20048 ) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438	2021-11-04 09:56:23 -07:00
Amog Kamsetty	f67b526b7a	[Tune] Fix PTL tutorial docs (#19999 )	2021-11-04 09:21:28 -07:00
xwjiang2010	f1179cbccd	[tune] Remove unused clean_trial_placement_group. (#19960 )	2021-11-04 08:55:42 -07:00
architkulkarni	bcb63961d9	[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009 )	2021-11-04 10:16:14 -05:00
SangBin Cho	8d115b96b5	[Tests] Try deflaking test placement group mini integration. (#19886 ) * done * fix	2021-11-03 20:54:59 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Edward Oakes	91c730efd0	[serve] Rename backend -> deployment in replica.py (#20020 )	2021-11-03 17:46:10 -05:00
Amog Kamsetty	ede9d0ed76	[CI] Pin keras (#20032 ) * try fix * try again * revert back * add todo	2021-11-03 15:32:10 -07:00
Clark Zinzow	a0841106ff	[Datasets] Follow-up to groupby standard deviation PR (#20035 )	2021-11-03 13:56:34 -07:00
Clark Zinzow	665954d48c	Add standard deviation aggregation. (#20010 )	2021-11-03 11:38:23 -07:00
Alex Wu	3d7d341dd0	[test] Fix test_actor_scheduling_not_block_with_placement_group (missing num_cpus=1) (#20006 )	2021-11-03 09:08:50 -07:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Edward Oakes	e1e0cb5eaa	[serve] Rename backend tag -> deployment name (#19997 )	2021-11-03 09:49:52 -05:00
Edward Oakes	b2ddea255d	[job submission] Add job submission ID + status to /api/snapshot (#19994 )	2021-11-03 09:49:28 -05:00
Yi Cheng	99034f5af5	Revert "Revert "[core] Fix wrong local resource view in raylet (#1991… (#19996 ) This reverts commit `f1eedb15b6`. ## Why are these changes needed? Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself. ## Related issue number #19438	2021-11-03 00:11:40 -07:00
Eric Liang	398d4cbf34	[data] Skip tests locally if moto server is not installed	2021-11-02 23:56:32 -07:00
Eric Liang	9e448db731	[RFC] Add tsan build mode (#19971 )	2021-11-02 22:29:51 -07:00

1 2 3 4 5 ...

5512 commits