hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Guyang Song	ab55b808c5	[runtime env] move worker env to runtime env in Java (#19060 )	2021-10-11 17:25:09 +08:00
Shantanu	0c4603f836	[core] nicer error message for unpickleable exceptions (#17936 ) * [core] nicer error message for unpickleable exceptions I ran into a case where we had an exception that wasn't unpickleable: ``` pickle.loads(pickle.dumps(filelock.Timeout())) ``` When a filelock.Timeout is raised on a worker, it gets surfaced in a way that makes ray look like it was responsible: ``` ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file' ``` This PR turns the following stacktrace: ``` return ray.get(refs, timeout=timeout) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(args, kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get raise value ray.exceptions.RaySystemError: System error: __init__() missing 1 required positional argument: 'lock_file' traceback: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects obj = self._deserialize_object(data, metadata, object_ref) File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object return RayError.from_bytes(obj) File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 28, in from_bytes return pickle.loads(ray_exception.serialized_exception) TypeError: __init__() missing 1 required positional argument: 'lock_file' ``` into this: ``` ... return ray.get(refs, timeout=timeout) File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(args, *kwargs) File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1623, in get raise value ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception traceback: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 29, in from_bytes return pickle.loads(ray_exception.serialized_exception) TypeError: __init__() missing 1 required positional argument: 'lock_file' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 254, in deserialize_objects obj = self._deserialize_object(data, metadata, object_ref) File "/opt/conda/lib/python3.7/site-packages/ray/serialization.py", line 213, in _deserialize_object return RayError.from_bytes(obj) File "/opt/conda/lib/python3.7/site-packages/ray/exceptions.py", line 31, in from_bytes raise RuntimeError("Failed to unpickle serialized exception") from e RuntimeError: Failed to unpickle serialized exception ``` lint * test_unpickleable_stacktrace * lint * . * . Co-authored-by: hauntsaninja <>	2021-10-11 01:19:19 -07:00
SangBin Cho	3b865b463a	[Core] Fix GPU first scheduling that is not working with placement group (#19141 ) * done * Revert "done" This reverts commit 56b18f0a7d14c5466d726c3ed1264f3e1506771e. * ip * Revert "Revert "done"" This reverts commit a34c90b0920893f4efbf171b8159f0d08a10dca0. * Done * Remove unnecessary log message * skip test on windows * Handle the code review.	2021-10-11 00:12:25 -07:00
Sasha Sobol	e8d1fc36cb	Make binbacking prioritize nodes better (#19212 ) * Make binbacking prioritize nodes better Make binpacking prefer nodes that match multiple resource types. * spelling * order demands when binpacking, starting from complex ones * add stability to resource demand ordering * lint * logging * add comments * +comment * use set	2021-10-10 14:56:47 -04:00
Guyang Song	bae543c956	[runtime env] support eager_install in runtime env (#17949 )	2021-10-09 17:59:57 +08:00
Eric Liang	a92f1fedf4	Revert "[tune/wip] Exclude trial checkpoints in experiment sync (#19185 )" (#19245 ) This reverts commit `44b0b6eb20`.	2021-10-08 19:47:12 -07:00
Eric Liang	b59317520d	Revert "[Workflow] workflow.delete (#19178 )" (#19247 ) This reverts commit `7ea512f343`.	2021-10-08 19:12:55 -07:00
Alex Wu	7ea512f343	[Workflow] workflow.delete (#19178 ) Why are these changes needed? This PR implements workflow.delete which allows users to delete the information in storage related to a workflow. (This assumes the workflow isn't currently running). Related issue number Closes #18848	2021-10-08 16:11:59 -07:00
Jiajun Yao	c31f0e17e6	Replace ray.__commit__ with the actual commit SHA when we build the windows (#19213 ) wheel	2021-10-08 16:06:52 -07:00
Sven Mika	d439fd7f17	[RLlib] TF2/eager memory leak fixes. (#19198 )	2021-10-09 00:11:53 +02:00
Edward Oakes	47447c71e0	[serve] Remove excessive backend_state.update() calls in unit tests (#19225 ) These extra update cycles are no longer needed now that we removed the SHOULD_START and SHOULD_STOP states.	2021-10-08 16:36:44 -05:00
mwtian	b066627539	[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 (#19216 )	2021-10-08 12:58:18 -07:00
Patrick Ames	fa047c050b	[data] Make directory creation in dataset output path optional. (#19202 )	2021-10-08 12:36:10 -07:00
Edward Oakes	9cf19b67cc	[serve] Remove log poll client from replicas (#19145 ) In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.	2021-10-08 12:32:42 -05:00
Edward Oakes	86d1a5bfc6	[serve] Catch ConnectionError during shutdown in LongPollClient (#19224 )	2021-10-08 12:31:35 -05:00
Edward Oakes	93bcea7bdd	[serve] Clean up kv store file, skip on windows (#19194 )	2021-10-08 12:30:48 -05:00
Kai Fricke	44b0b6eb20	[tune/wip] Exclude trial checkpoints in experiment sync (#19185 )	2021-10-08 18:26:03 +01:00
Kai Fricke	e5e1ba93d9	[tune] Use queue to display JupyterNotebookReporter updates in Ray client (#19137 )	2021-10-08 18:23:20 +01:00
Antoni Baum	c7d6f838f6	[tune] Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144 )	2021-10-08 18:16:26 +01:00
Eric Liang	8beabb283b	Force disable placement_group for all dataset tasks (#19208 )	2021-10-08 10:16:09 -07:00
Kai Fricke	f1606acc2b	[tune] Fix durable(str) name for class trainables, preventing trial recovery (#19223 )	2021-10-08 17:32:05 +01:00
architkulkarni	1aab892623	[Runtime Env] add excludes to known fields for runtime env (#19206 )	2021-10-07 22:47:49 -07:00
Eric Liang	8dded14798	Refactor LazyBlockList to simplify union of lists (#19214 )	2021-10-07 22:07:52 -07:00
SangBin Cho	afaee05e1e	[Placement Group] Fix placement group removal leak (#19138 )	2021-10-07 22:04:12 -07:00
Simon Mo	46e80348ad	[Serve] Make long poll wait for non-existent keys (#19205 )	2021-10-07 19:10:22 -07:00
Kai Fricke	8d89e2d546	[tune] Prevent errors with retained trainables in global registry (#19184 ) This PR fixes #19183 by introducing three improvements: String trainables are prefixed with Durable, e.g. DurablePPO Durable trainables cannot be wrapped twice with tune.durable() MRO resolution in _WrappedDurableTrainables indicates we already have a DurableTrainable - thus we catch this with a try/except block	2021-10-07 17:17:01 -07:00
Edward Oakes	454163912f	Revert "[serve] Delete kv store local path after unit tests (#19165 )" (#19188 ) This reverts commit `b90af4dae5`.	2021-10-07 14:26:18 -05:00
Edward Oakes	1fa81673bd	[runtime_env] Clean up validation logic (#18984 ) Splits the runtime_env parsing/validation and overriding into two separate codepaths. Adds unit testing for both.	2021-10-07 14:24:41 -05:00
Kai Fricke	45aad4ee9a	[tune] Add resume="AUTO" and enhance resume error messages (#19181 )	2021-10-07 19:00:56 +01:00
Stephanie Wang	940f84cedb	[core] Remove unused plasma promotion path (#19122 ) * remove unused * lint * lint * lint	2021-10-07 10:55:50 -07:00
SangBin Cho	0ef0d9a77d	Revert "[core] Assign tasks to the first available worker (#18167 )" (#19180 ) This reverts commit `545db13800`.	2021-10-07 10:38:37 -07:00
Antoni Baum	f1587c06fd	[tune] Ensure loc in progress reporter is filled (#19182 )	2021-10-07 15:43:49 +01:00
Edward Oakes	b90af4dae5	[serve] Delete kv store local path after unit tests (#19165 )	2021-10-07 08:55:22 -05:00
Kai Fricke	a8cf8c648c	[tune] track and print elapsed time in reporters (#19139 )	2021-10-07 10:56:17 +01:00
Avnish Narayan	bbc64a7c3d	[RLlib] Pin Gym to 0.19 (#19170 ) Gym appears to have cut a release, 0.21. It isn't clear what changes were made between 0.19/0.20 and 0.21, as there is no change log available for the 0.21 release, so for now we'll pin gym to 0.19 until we can fully understand the breaking changes in gym 0.21. I suspect some things have just been removed from the regular gym installation that rllib has previously relied on. Will address later.	2021-10-07 07:59:02 +02:00
mwtian	fe413c3c5e	[Client] disable auto init for get_runtime_context() (#19127 )	2021-10-06 20:20:47 -07:00
Eric Liang	86cbe3e833	[data] Add support for repeating and re-windowing a DatasetPipeline (#19091 )	2021-10-06 20:13:43 -07:00
Edward Oakes	0f915820e1	[serve] Rename backend_worker -> replica (#19150 )	2021-10-06 16:39:17 -05:00
Chris K. W	d1517c33ab	[client] deflake test_object_ref_cleanup (#19153 )	2021-10-06 14:06:43 -07:00
Kai Fricke	9f77cd8d28	[tune] Deflake PBT Async test (#19135 )	2021-10-06 12:24:22 -07:00
Edward Oakes	9316a9977f	[serve] Support kwargs to deployment constructor (#19023 )	2021-10-06 14:16:23 -05:00
Frank Luan	77d0a08c38	[docker] Fix missing space in docker.py warning (#19128 )	2021-10-06 12:09:26 -07:00
Ian Rodney	8cab8d3ae9	[Datasets] Clean Up docs around pipelining -> windowing rename (#19142 )	2021-10-06 11:07:55 -07:00
Chris K. W	db1105fa83	[client] Skip test_valid_actor_state tests on windows (#19114 ) * skip test_wrapped_actor_creation on windows * rerun windows ci * mark test_valid_actor_state_2 as flaky * mark test_valid_actor_state * rerun	2021-10-06 09:17:59 -07:00
Amog Kamsetty	db0483a29a	[SGD] SGD Namespace Consistency (#19048 ) * wip * update * add callbacks * fix * fix * update * add * address comments	2021-10-05 15:56:42 -07:00
Matti Picus	63dd22c7c2	add msvcp140.dll to the wheel on windows (#19062 ) * add msvcp140.dll to the wheel on windows * fixes from review * be more verbose * Update setup.py Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2021-10-05 15:12:46 -07:00
Stephanie Wang	545db13800	[core] Assign tasks to the first available worker (#18167 ) * Convert worker pool to queue * Start up to backlog size more workers * fixes * Prestart workers according to num available CPUs * lint * x * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * dedicated workers * Fix tests * x * fix * asan * asan * Workers can only exec tasks with same job ID * size_t for runtime env hash, fix unit tests * include job ID in runtime env hash, remove from worker registration msg * x * conflict * debug * Schedule and dispatch periodically, skip if no new tasks * Update src/ray/common/task/task_spec.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/scheduling/cluster_task_manager.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/raylet/worker_pool.h Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-10-05 13:45:50 -07:00
Yi Cheng	ecf7b86585	[workflow] Avoid running workflow step multiple times. (#19090 ) When workflow recover, it'll try to reconstruct the DAG. However, it's step scoped, which means if a workflow is passed to multiple steps, it'll be executed multiple times which breaks the exactly-once semantic. For ObjectRef it's ok since it'll be cached with serialization context, but we also need a similar thing for Workflow input. This logic is put in workflow layer instead of serialization layer because it's dedupe on app layer. Issue #18997 has race conditions, and it's also related to this one. The reason is that multiple steps will try to issue writes to virtual actors at the same time which is not allowed right now and can lead to race condition.	2021-10-05 13:43:27 -07:00
Kai Fricke	957f9e9d99	[client] Undo PySpark's monkey patching of namedtuples for PickleStub (#19034 )	2021-10-05 10:43:50 -07:00
SangBin Cho	83cb992d5b	Revert pull retry (#19068 ) * Revert "[Object manager] fix comments" This reverts commit `56debfc063`. * Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)" This reverts commit `d12e35ce53`. * Fix a lint issue	2021-10-04 11:20:43 -07:00

1 2 3 4 5 ...

5239 commits