hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-07 02:51:39 -05:00

Author	SHA1	Message	Date
SangBin Cho	b0550aa440	[Core] Fix the named actor get or create race condition (#20126 ) * Fix done. * Fixed. * clean up * Done	2021-11-09 02:27:54 -08:00
Edward Oakes	c04e5af1eb	[job submission] Rename log files to job-driver-{job_id}.{out,err} (#20170 )	2021-11-08 23:10:56 -08:00
Edward Oakes	50f2cf8a74	[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164 )	2021-11-08 23:10:27 -08:00
Jiao	d46caa9856	[job submission] Remove test_utils dependency (#20168 ) Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-11-08 23:08:43 -08:00
SangBin Cho	5c4fb4dc91	[Core]Chaos testing nightly (#20059 ) * Done initial stage. * lint * . * Finished. * Fix lint	2021-11-08 21:57:53 -08:00
Stephanie Wang	ffcc5935d7	[core] Evict lineage to bound memory usage (#19946 ) * bound lineage * Bound lineage in bytes * test * Lineage evicted error * Lineage evicted * lint * test * test * comment * doc * x * x * x * x	2021-11-08 21:53:40 -08:00
architkulkarni	e5e62d8991	[runtime env] Fix runtime env conda test and enable it in CI (#20121 )	2021-11-08 18:33:19 -08:00
Lixin Wei	8e666ca1e9	[Core] Fix Used Memory Calculation (#20127 ) * fix memory * fix	2021-11-08 17:36:32 -08:00
Kai Fricke	9c2b8c8501	[tune] Deprecate DurableTrainable (#19880 )	2021-11-08 20:56:07 +00:00
Amog Kamsetty	f8430e6eca	[CI] Pin shortuuid to fix CI (#20153 )	2021-11-08 12:08:32 -08:00
Amog Kamsetty	b1f24768a1	[Tune] More fixes to PTL Tutorial (#20065 ) * ptl-fix-2 * improve * fix	2021-11-08 09:13:44 -08:00
Gagandeep Singh	31812d026c	Bumped time limit for test_worker_startup_count in test_basic_3.py (#20056 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2021-11-08 09:02:28 -08:00
Kai Yang	e84391d1d3	[Core] Encode job ID in randomized task IDs for user-created threads (#19320 ) ## Why are these changes needed? Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID. This PR updates the task ID generation code to always encode the job ID. A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads. ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2021-11-08 21:00:40 +08:00
Linsong Chu	e189d8d4bc	[workflow] fix s3 storage path (#20115 ) ## Why are these changes needed? To fix two path related issues when s3 is used as storage backend: 1. a leading slash will be added to the path due to the behavior of `parse.urlparse`. 2. When `step_id=""`, double slashes will be added in the path. Details are explained in https://github.com/ray-project/ray/issues/20114 ## Related issue number https://github.com/ray-project/ray/issues/20114 https://github.com/ray-project/ray/issues/19027	2021-11-07 15:57:33 -08:00
dependabot[bot]	adf39941f4	[data](deps): Bump dask[complete] (#20125 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2021.9.1 to 2021.11.0. - [Release notes](https://github.com/dask/dask/releases) - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](https://github.com/dask/dask/compare/2021.09.1...2021.11.0) --- updated-dependencies: - dependency-name: dask[complete] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-07 11:55:39 -08:00
xwjiang2010	866fa9590f	[tune] clean up legacy branch in update_avail_resources. (#20071 )	2021-11-05 10:28:46 -07:00
matthewdeng	78e9ff7c91	[train][datasets] add example for big data training (#20042 ) * [train][datasets] add example for big data training * add title docstring * lint and dependencies * add dask_ml requirement	2021-11-05 09:28:48 -07:00
Chen Shen	320f9dc234	[Core][CoreWorker] increase the default port range (#19541 ) * increase the port range * Update doc/source/configure.rst Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2021-11-05 09:25:44 -07:00
Alex Wu	146b3d6bcc	[scheduler] Include depth and function descriptor in scheduling class (#20004 )	2021-11-05 08:19:48 -07:00
Simon Mo	3d5cbc6e62	[Serve] Fix HTTP error handling behavior and add tests (#20093 )	2021-11-05 10:15:54 -05:00
SangBin Cho	8299aae918	[Placement Group] Add stats to pg scheduling (#19841 ) * Add an e2e stats to pg scheduling * Fix bugs. * fix a bug. * Revert "fix a bug." This reverts commit dd7e03d1346fa39e54898effaaf8a2771103176e. * done except unit tests. * done except unit tests. * Add unit tests. * Address code review. * done * Fix * done * Fixed the test	2021-11-05 06:51:42 -07:00
Amog Kamsetty	adb8d77b2b	[Deps] Bump tensorflow on Docker image and add Codeowners (#20041 )	2021-11-05 00:58:34 -07:00
dependabot[bot]	60e9737679	[tune](deps): Bump mlflow in /python/requirements/ml (#19913 ) Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.19.0 to 1.21.0. - [Release notes](https://github.com/mlflow/mlflow/releases) - [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst) - [Commits](https://github.com/mlflow/mlflow/compare/v1.19.0...v1.21.0) --- updated-dependencies: - dependency-name: mlflow dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-04 23:37:01 -07:00
dependabot[bot]	9897ee0eab	[tune](deps): Bump onnxruntime in /python/requirements/ml (#19666 ) Bumps [onnxruntime](https://github.com/microsoft/onnxruntime) from 1.8.0 to 1.9.0. - [Release notes](https://github.com/microsoft/onnxruntime/releases) - [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md) - [Commits](https://github.com/microsoft/onnxruntime/compare/v1.8.0...v1.9.0) --- updated-dependencies: - dependency-name: onnxruntime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2021-11-04 23:34:48 -07:00
dependabot[bot]	f214c4a4ab	[tune](deps): Bump datasets from 1.11.0 to 1.14.0 in /python/requirements/ml (#19645 ) * [tune](deps): Bump datasets in /python/requirements/ml Bumps [datasets](https://github.com/huggingface/datasets) from 1.11.0 to 1.14.0. - [Release notes](https://github.com/huggingface/datasets/releases) - [Commits](https://github.com/huggingface/datasets/compare/1.11.0...1.14.0) --- updated-dependencies: - dependency-name: datasets dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update requirements_tune.txt Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2021-11-04 23:33:55 -07:00
Clark Zinzow	6ade6f0be6	[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. (#20044 )	2021-11-04 22:48:49 -07:00
mwtian	fb0ede38ba	[CI] [macOS] avoid installing latest setuptools (#20064 )	2021-11-04 21:35:03 -07:00
architkulkarni	c5175073b2	[runtime env] Add garbage collection for conda envs (#20072 )	2021-11-04 23:13:34 -05:00
Edward Oakes	360993612c	[serve] Remove lingering backend references (#20085 )	2021-11-04 20:32:13 -05:00
Eric Liang	6102912494	Dataset doc updates (#19815 )	2021-11-04 18:13:40 -07:00
SangBin Cho	44b38e9aa1	Add Chaos testing fixture + test actor tasks chaos test in CI (#19975 ) * Basic CI tests done * Fix an issue * shutdown to conftest * Addressed code review.	2021-11-04 16:27:35 -07:00
Simon Mo	4d583da7d5	[Serve] Add verbose log for nightly test only (#20088 )	2021-11-04 16:15:22 -07:00
SangBin Cho	56bab61fba	[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680 ) * done * Make it work * Fix issues * done * try * done * Fix remaining bugs.	2021-11-04 14:41:00 -07:00
Eric Liang	585d472fdf	Add configuration context to dataset (#19907 )	2021-11-04 14:36:51 -07:00
Alex Wu	4ffb7ccfac	[scheduler][cleanup] Remove one cpu optimization (#20022 ) * . * remove test * Update cluster_task_manager.cc * Update cluster_task_manager.cc * lint * lint * . Co-authored-by: Alex Wu <alex@anyscale.com>	2021-11-04 14:18:13 -07:00
Edward Oakes	49d308138f	[serve] Rename backend_state -> deployment_state (#20040 )	2021-11-04 15:46:45 -05:00
Philipp Moritz	a64e32c53b	[docs] Fix broken links in documentation and add linkcheck to documentation (#20030 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-11-04 13:19:43 -07:00
Sven Mika	50c30f89c6	[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016 )	2021-11-04 20:40:57 +01:00
Jiao	6cfb52ff1d	[job submission] Add stop API + subprocess cleanup (#19860 )	2021-11-04 13:59:47 -05:00
Yi Cheng	7bb4c87780	[gcs] use gcs kv in internal kv (#19933 ) ## Why are these changes needed? It's part of redis removal project. This PR focus on using gcs kv in internal kv. - gcs client is introduced - internal kv is updated to use gcs rpc client based kv - related code got updated. The other PR will update components using redis to use internal kv. ## Related issue number https://github.com/ray-project/ray/issues/19443	2021-11-04 09:57:39 -07:00
Yi Cheng	b3b88a46f7	[pg] Fix the test case which hangs because of scheduling dead lock (#20048 ) ## Why are these changes needed? In this test case, the following case could happen: 1. actor creation first uses all resource in local node which is a GPU node 2. the actor need GPU will not be able to be scheduled since we only have one GPU node The fixing is just a short term fix and only tries to connect to the head node with CPU resources. ## Related issue number #19438	2021-11-04 09:56:23 -07:00
Amog Kamsetty	f67b526b7a	[Tune] Fix PTL tutorial docs (#19999 )	2021-11-04 09:21:28 -07:00
xwjiang2010	f1179cbccd	[tune] Remove unused clean_trial_placement_group. (#19960 )	2021-11-04 08:55:42 -07:00
architkulkarni	bcb63961d9	[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009 )	2021-11-04 10:16:14 -05:00
SangBin Cho	8d115b96b5	[Tests] Try deflaking test placement group mini integration. (#19886 ) * done * fix	2021-11-03 20:54:59 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Edward Oakes	91c730efd0	[serve] Rename backend -> deployment in replica.py (#20020 )	2021-11-03 17:46:10 -05:00
Amog Kamsetty	ede9d0ed76	[CI] Pin keras (#20032 ) * try fix * try again * revert back * add todo	2021-11-03 15:32:10 -07:00
Clark Zinzow	a0841106ff	[Datasets] Follow-up to groupby standard deviation PR (#20035 )	2021-11-03 13:56:34 -07:00
Clark Zinzow	665954d48c	Add standard deviation aggregation. (#20010 )	2021-11-03 11:38:23 -07:00

1 2 3 4 5 ...

5519 commits