hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Guyang Song	5783cdb254	[runtime env] runtime env inheritance refactor (#22244 ) Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc): - The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment. - Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent. We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run. Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior: - If there is no runtime env option when we create actor, inherit the parent runtime env. - Otherwise, use the optional runtime env directly and don't do the merging. Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like: ```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))``` This new API also can be used in ray client.	2022-02-21 18:13:22 +08:00
Gagandeep Singh	3cb85859cd	Unskipped tests for Windows (#21702 ) This set of tests passes without issues on Windows for me, so unskipping them here.	2022-02-20 11:48:59 -08:00
Clark Zinzow	76e8247d4d	[Datasets] Force local metadata resolution when unserializable `Partitioning` object provided. (#22477 )	2022-02-18 21:21:34 -08:00
Amog Kamsetty	04feea4afe	[rllib] Upper bound `gym` version (#22510 ) gym had 0.22 release today which is breaking a lot of the rllib tests and examples. Temporarily pins gym version for now.	2022-02-18 17:39:22 -08:00
Jiajun Yao	6a17653ba7	API stability annotations for ray commands (#22420 ) Annotate ray commands that are intended to be public.	2022-02-18 17:13:36 -08:00
Guyang Song	57a94aae12	[runtime env][bugfix] Fix runtime env retry (#22495 ) - Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing. - Add test case for this.	2022-02-18 17:09:06 -08:00
Jiajun Yao	baa14d695a	Round robin during spread scheduling (#21303 ) - Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently. - Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later. - Prefer not to spill back tasks that are waiting for args since the pull is already in progress.	2022-02-18 15:05:35 -08:00
mwtian	5a4c6d2e88	[Core] release GIL when running `parallel_memcopy()` / `memcpy()` during serializations (#22492 ) While investigating #22161, it is observed GIL is held for an extended amount of time (up to 1000s) with stack trace [1]. It is possible either there are many iterations within `Pickle5Writer.write_to()` calling `ray::parallel_memcopy()`, or a few `ray::parallel_memcopy()` taking a long time (less likely). Either way, `ray::parallel_memcopy()` or `std::memcpy()` should not hold GIL.	2022-02-18 14:11:12 -08:00
Stephanie Wang	03a5589591	[core] Enable lineage reconstruction in CI (#21519 ) Enables lineage reconstruction in all CI and release tests.	2022-02-18 11:04:20 -08:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	1f160114a0	[serve] [CI] change serve:test_runtime_env from medium to large (#22474 ) This test was timing out occasionally.	2022-02-18 08:50:47 -06:00
Archit Kulkarni	df85d31095	[Serve] Make handle serializable (#22473 )	2022-02-17 17:29:44 -08:00
Ian Rodney	c9a4b17f99	[YAMLs] Fix comments about autoscaler round-robining (#22002 )	2022-02-17 13:59:05 -08:00
SangBin Cho	4ecb2afc2c	[State] Add pid to the actor table data. (#22434 ) It is requested by users that they'd like to get the pid of actors using ray.state.actors. This PR addresses that.	2022-02-17 06:22:29 -08:00
Eric Liang	786c5759de	[data] Stage fusion optimizations, off by default (#22373 ) This PR adds the following stage fusion optimizations (off by default). In a later PR, I plan to enable this by default for DatasetPipelines. - Stage fusion: Whether to fuse compatible OneToOne stages. - Read stage fusion: Whether to fuse read stages into downstream OneToOne stages. This is accomplished by rewriting the read stage (LazyBlockList) into a transformation over a collection of read tasks (BlockList -> MapBatches(do_read)). - Shuffle stage fusion: Whether to fuse compatible OneToOne stages into shuffle stages that support specifying a map-side block UDF. Stages are considered compatible if their compute strategy is the same ("tasks" vs "actors"), and they have the same Ray remote args. Currently, the PR is ignoring the remote args of read tasks, but this will be fixed as a followup (I didn't want to change the read tasks default here).	2022-02-16 21:08:27 -08:00
Yi Cheng	e10a2fbcf9	[workflow] Move `test_basic_workflows_2.py` to large test (#22416 ) test_basic_workflows_2.py timeout. Move it to the large test suite.	2022-02-16 17:05:02 -08:00
Yi Cheng	83257a4193	Revert "[Client] chunked get requests" (#22455 ) Reverts ray-project/ray#22100 linux://python/ray/tests:test_runtime_env_working_dir_remote_uri becomes very flaky after this PR.	2022-02-16 16:43:43 -08:00
Chen Shen	30ec0df9cc	[placement group] fix pg benchmark regression #22441 We added a warmup time in timeit which affects the pg benchmark time accounting. add an option to cancel warmup.	2022-02-16 16:24:51 -08:00
Archit Kulkarni	606e2b2cde	Update license for MLflow's conda utils and virtualenv-clone (#22402 ) When we vendor third-party code, we should update LICENSE file. Previously we vendored two pieces of code: - conda utilities from MLflow - virtualenv-clone But we only included the attribution in the relevant source files, not in our LICENSE file. This PR adds the necessary info to our LICENSE file.	2022-02-16 10:00:23 -06:00
Hao Chen	f2bbcf5adc	Fix test_traceback incompatibility with pytest 6.x (#22375 ) Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 18:04:19 +08:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Eric Liang	2158df3a73	[data] Pre-reqs for implementing stage fusion (#22374 )	2022-02-15 14:59:07 -08:00
Chris K. W	9a7979d9a2	[Client] chunked get requests (#22100 ) Why are these changes needed? Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet. Proto changes GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large). Server changes Mainly just updating GetObject logic to yield chunks instead of returning Client changes At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all). ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly. await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.	2022-02-16 00:07:16 +02:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Edward Oakes	9c07eabab9	[serve] Remove unused `filter_tag` and errant `str` redefinition (#22400 )	2022-02-15 15:33:10 -06:00
Eric Liang	df4b56d32e	[minor] Fix dataset shuffle bug on empty blocks. (#22367 ) There's an edge case where we can crash if empty blocks end up in shuffle (type gets inferred as Arrow, then fails when we add list-type blocks).	2022-02-15 13:18:54 -08:00
SangBin Cho	6eace8a305	[Test] Change the default encoding to utf-8 (#22286 ) Follow up - https://github.com/ray-project/ray/pull/22248#pullrequestreview-878073629	2022-02-15 11:35:48 -08:00
Jialing He	4c73560b31	[runtime env] Support clone `virtualenv` from an existing `virtualenv` (#22309 ) Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv from an existing virtualenv. More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499 Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-02-15 12:51:01 -06:00
Matti Picus	199bf558e2	move slow test from small (timeout 60s) to medium (timeout 300s) (#22167 )	2022-02-15 09:55:30 -08:00
Gagandeep Singh	7dc097a947	Unskipped tests for Windows (#21809 ) These tests are passing without issues on my Windows machine, so unskipping them to check on CI. I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed. Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2022-02-15 09:04:47 -08:00
Gagandeep Singh	a8341dfc29	Replace `queue.Queue` with `multiprocessing.JoinableQueue` (#21860 ) Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300 And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121 AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.	2022-02-15 09:01:17 -08:00
Kai Fricke	c866131cc0	[tune] Retry cloud sync up/down/delete on fail (#22029 )	2022-02-15 12:27:29 +00:00
dependabot[bot]	35ae459434	[tune](deps): Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071 ) * [tune](deps): Bump flaml from 0.6.7 to 0.9.6 in /python/requirements/ml Bumps [flaml](https://github.com/microsoft/FLAML) from 0.6.7 to 0.9.6. - [Release notes](https://github.com/microsoft/FLAML/releases) - [Commits](https://github.com/microsoft/FLAML/compare/v0.6.7...v0.9.6) --- updated-dependencies: - dependency-name: flaml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-15 09:53:58 +00:00
Jun Gong	6f5afcbce9	[RLlib] Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239 )	2022-02-15 09:09:24 +01:00
Yi Cheng	2fbbd21351	[workflow] Fix event loop can't find in thread (#22363 ) Event loop will only be set in main thread by default and this will make workflow unable to work if it's called in thread other than main thread which can happen when it's called from a library (for example ray serve). This PR fixed it.	2022-02-14 23:31:32 -08:00
matthewdeng	8f9e0d7f6b	[train] add TorchTensorboardProfilerCallback (#22345 ) The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`. ``` \| File "ray_sgd_training.py", line 18, in <module> \| from ray import train \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module> \| from ray.train.callbacks import TrainingCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module> \| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module> \| from torch.profiler import profile \| ModuleNotFoundError: No module named 'torch.profiler' ``` A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes: 1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized. 2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed: ``` >>> import ray >>> import ray.train >>> import ray.train.torch >>> from ray.train.torch import TorchWorkerProfiler >>> twp = TorchWorkerProfiler() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__ "Torch Profiler requires torch>=1.8.1. " ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler. ```	2022-02-14 16:16:55 -08:00
Eric Liang	35a157948e	Lay the groundwork for lazy dataset optimization (no behavior changes) (#22233 ) This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.	2022-02-14 15:03:58 -08:00
Jialing He	192f9de421	[runtime env] Introduce async Manager.create (#22311 )	2022-02-14 16:26:47 -06:00
Matti Picus	845861fdc1	[runtime env] use pytest tmp_path, os.path.sep, and unskip most tests for windows (#22342 )	2022-02-14 16:04:10 -06:00
Archit Kulkarni	0e350c0074	[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780 ) We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two. Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-02-14 16:03:48 -06:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
dependabot[bot]	767b349b99	[data](deps): Bump dask[complete] (#22334 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2022.1.0 to 2022.2.0. - [Release notes](https://github.com/dask/dask/releases) - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](https://github.com/dask/dask/compare/2022.01.0...2022.02.0) --- updated-dependencies: - dependency-name: dask[complete] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-14 12:44:20 -08:00
Edward Oakes	610930ae6a	[serve] Improve health check failure semantics (#22297 )	2022-02-14 14:04:03 -06:00
Clark Zinzow	443416907e	[Datasets] Fix boolean tensor column representation and slicing. (#22323 ) This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.	2022-02-14 10:36:35 -08:00
Max Pumperla	d594b668bb	[docs] [tune] hyperopt notebook (#22315 )	2022-02-12 02:46:03 -08:00
Eric Liang	85d6946c95	Split test_dataset.py into two (#22303 )	2022-02-12 00:21:25 -08:00
Amog Kamsetty	4cbbc81f4c	[Train] Add support for `trainer.best_checkpoint` and `Trainer.load_checkpoint_path` (#22306 ) Closes #22226	2022-02-11 22:29:37 -08:00
Kaushik B	8515fdd6db	[tune] Update Lightning examples to support PTL 1.5 (#20562 ) To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407 Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-02-11 17:45:06 -08:00
Amog Kamsetty	e8e35169c6	[Train] Allow `train` methods to be called outside of the session (#21969 ) Updates to address @worldveil's feedback: Include import train.torch in the docs Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error. Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>	2022-02-11 17:42:55 -08:00
jialin	851b853352	add optional empty lines filter in read_text (#22298 ) ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter; with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False. Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>	2022-02-11 14:49:45 -08:00

1 2 3 4 5 ...

6095 commits