hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Hao Chen	f2bbcf5adc	Fix test_traceback incompatibility with pytest 6.x (#22375 ) Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 18:04:19 +08:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Eric Liang	2158df3a73	[data] Pre-reqs for implementing stage fusion (#22374 )	2022-02-15 14:59:07 -08:00
Chris K. W	9a7979d9a2	[Client] chunked get requests (#22100 ) Why are these changes needed? Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet. Proto changes GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large). Server changes Mainly just updating GetObject logic to yield chunks instead of returning Client changes At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all). ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly. await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.	2022-02-16 00:07:16 +02:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Edward Oakes	9c07eabab9	[serve] Remove unused `filter_tag` and errant `str` redefinition (#22400 )	2022-02-15 15:33:10 -06:00
Eric Liang	df4b56d32e	[minor] Fix dataset shuffle bug on empty blocks. (#22367 ) There's an edge case where we can crash if empty blocks end up in shuffle (type gets inferred as Arrow, then fails when we add list-type blocks).	2022-02-15 13:18:54 -08:00
SangBin Cho	6eace8a305	[Test] Change the default encoding to utf-8 (#22286 ) Follow up - https://github.com/ray-project/ray/pull/22248#pullrequestreview-878073629	2022-02-15 11:35:48 -08:00
Jialing He	4c73560b31	[runtime env] Support clone `virtualenv` from an existing `virtualenv` (#22309 ) Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv from an existing virtualenv. More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499 Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-02-15 12:51:01 -06:00
Matti Picus	199bf558e2	move slow test from small (timeout 60s) to medium (timeout 300s) (#22167 )	2022-02-15 09:55:30 -08:00
Gagandeep Singh	7dc097a947	Unskipped tests for Windows (#21809 ) These tests are passing without issues on my Windows machine, so unskipping them to check on CI. I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed. Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2022-02-15 09:04:47 -08:00
Gagandeep Singh	a8341dfc29	Replace `queue.Queue` with `multiprocessing.JoinableQueue` (#21860 ) Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300 And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121 AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.	2022-02-15 09:01:17 -08:00
Kai Fricke	c866131cc0	[tune] Retry cloud sync up/down/delete on fail (#22029 )	2022-02-15 12:27:29 +00:00
dependabot[bot]	35ae459434	[tune](deps): Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071 ) * [tune](deps): Bump flaml from 0.6.7 to 0.9.6 in /python/requirements/ml Bumps [flaml](https://github.com/microsoft/FLAML) from 0.6.7 to 0.9.6. - [Release notes](https://github.com/microsoft/FLAML/releases) - [Commits](https://github.com/microsoft/FLAML/compare/v0.6.7...v0.9.6) --- updated-dependencies: - dependency-name: flaml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-15 09:53:58 +00:00
Jun Gong	6f5afcbce9	[RLlib] Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239 )	2022-02-15 09:09:24 +01:00
Yi Cheng	2fbbd21351	[workflow] Fix event loop can't find in thread (#22363 ) Event loop will only be set in main thread by default and this will make workflow unable to work if it's called in thread other than main thread which can happen when it's called from a library (for example ray serve). This PR fixed it.	2022-02-14 23:31:32 -08:00
matthewdeng	8f9e0d7f6b	[train] add TorchTensorboardProfilerCallback (#22345 ) The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`. ``` \| File "ray_sgd_training.py", line 18, in <module> \| from ray import train \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module> \| from ray.train.callbacks import TrainingCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module> \| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module> \| from torch.profiler import profile \| ModuleNotFoundError: No module named 'torch.profiler' ``` A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes: 1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized. 2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed: ``` >>> import ray >>> import ray.train >>> import ray.train.torch >>> from ray.train.torch import TorchWorkerProfiler >>> twp = TorchWorkerProfiler() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__ "Torch Profiler requires torch>=1.8.1. " ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler. ```	2022-02-14 16:16:55 -08:00
Eric Liang	35a157948e	Lay the groundwork for lazy dataset optimization (no behavior changes) (#22233 ) This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.	2022-02-14 15:03:58 -08:00
Jialing He	192f9de421	[runtime env] Introduce async Manager.create (#22311 )	2022-02-14 16:26:47 -06:00
Matti Picus	845861fdc1	[runtime env] use pytest tmp_path, os.path.sep, and unskip most tests for windows (#22342 )	2022-02-14 16:04:10 -06:00
Archit Kulkarni	0e350c0074	[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780 ) We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two. Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-02-14 16:03:48 -06:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
dependabot[bot]	767b349b99	[data](deps): Bump dask[complete] (#22334 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2022.1.0 to 2022.2.0. - [Release notes](https://github.com/dask/dask/releases) - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](https://github.com/dask/dask/compare/2022.01.0...2022.02.0) --- updated-dependencies: - dependency-name: dask[complete] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-14 12:44:20 -08:00
Edward Oakes	610930ae6a	[serve] Improve health check failure semantics (#22297 )	2022-02-14 14:04:03 -06:00
Clark Zinzow	443416907e	[Datasets] Fix boolean tensor column representation and slicing. (#22323 ) This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.	2022-02-14 10:36:35 -08:00
Max Pumperla	d594b668bb	[docs] [tune] hyperopt notebook (#22315 )	2022-02-12 02:46:03 -08:00
Eric Liang	85d6946c95	Split test_dataset.py into two (#22303 )	2022-02-12 00:21:25 -08:00
Amog Kamsetty	4cbbc81f4c	[Train] Add support for `trainer.best_checkpoint` and `Trainer.load_checkpoint_path` (#22306 ) Closes #22226	2022-02-11 22:29:37 -08:00
Kaushik B	8515fdd6db	[tune] Update Lightning examples to support PTL 1.5 (#20562 ) To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407 Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-02-11 17:45:06 -08:00
Amog Kamsetty	e8e35169c6	[Train] Allow `train` methods to be called outside of the session (#21969 ) Updates to address @worldveil's feedback: Include import train.torch in the docs Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error. Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>	2022-02-11 17:42:55 -08:00
jialin	851b853352	add optional empty lines filter in read_text (#22298 ) ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter; with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False. Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>	2022-02-11 14:49:45 -08:00
Edward Oakes	49b3e6c53c	[serve] Support user-provided health check via `def check_health(self)` method (#22178 )	2022-02-11 12:53:37 -06:00
matthewdeng	2c204a755b	[train] add minimal installation test suite (#22300 ) Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path. Example: If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.	2022-02-11 10:09:00 -08:00
Archit Kulkarni	1c0c2aaba2	[runtime env] Add test for scheduling task after failed job env (#22224 ) Adds a test to make sure a failed job runtime env creation doesn't hang the cluster (i.e. tasks can still be scheduled on the job, as long as the tasks' runtime env can be created.). Test requested by @rkooo567, good idea!	2022-02-11 11:01:16 -06:00
Chen Shen	bb6cb0898b	[Dataset] avoid pyarrow 7.0.0 for dataset (#22253 )	2022-02-11 00:32:47 -08:00
Eric Liang	02add259ca	Add more details to the internal error for "worker cannot find registered function" (#22302 ) This adds some more debug information for this internal error that shouldn't happen.	2022-02-10 23:20:17 -08:00
Edward Oakes	dd097b7a9b	[serve] Fix HTTP proxy controller namespace bug (#22287 ) Closes https://github.com/ray-project/ray/issues/22265 This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.	2022-02-10 21:05:35 -06:00
Clark Zinzow	13c8e10b3b	[Datasets] Unrevert NaN handling. (#22291 ) Reverts #22258, unreverting #20787. The fix is in the ["Fix tests" commit](`b559da2407`), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](https://github.com/ray-project/ray/pull/22258#issuecomment-1035404700).	2022-02-10 16:19:53 -08:00
Amog Kamsetty	09e46066eb	[Train] Fix accuracy calculation for CIFAR example (#22292 ) Same as #21689 except for cifar	2022-02-10 15:06:31 -08:00
Archit Kulkarni	94f73de23c	Revert "[Serve] [Windows] Unskip all but `test_redeploy_single_replica` in `test_deploy.py` (#21391 )" (#22299 ) This reverts commit `000c56f764`.	2022-02-10 14:49:34 -08:00
Edward Oakes	48adb6f7bb	[serve] Introduce DeploymentStatus, poll for statuses instead of using async goals (#22121 )	2022-02-10 12:33:04 -08:00
mwtian	9bc6f13515	[Autoscaler] make `--redis-address` not required (#22083 ) `--redis-address` should not be required, since starting autoscaler with `--gcs-address` is supported too.	2022-02-10 11:20:31 -08:00
Liu Bao	824453dd17	[runtime env] Create virtualenv for pip runtime env. (#21801 )	2022-02-10 12:25:18 -06:00
mwtian	c9fed9dec2	Revert "[Client] avoid locking in async send" (#22283 ) Reverts ray-project/ray#22193, which makes `windows://python/ray/serve:test_ray_client` very flaky (timeouts).	2022-02-10 10:17:07 -08:00
shrekris-anyscale	cc9018c29a	Obtain deployment definitions via import (#22272 ) Currently, Serve deployments must store their class or function definitions in the `Deployment` object's `func_or_class` attribute. However, the declarative API must be able to initiate deployments using only their import path. This allows users to separately define their functions or classes, and pull these functions and classes into their clusters via [remote URIs](https://docs.ray.io/en/releases-1.9.2/handling-dependencies.html#remote-uris). With this change, `Deployment` objects can store an import path string as their `func_or_class`. This import path is then used to import the deployment's code definition when the `Deployment`'s replica is created.	2022-02-10 10:20:45 -06:00
mwtian	2cee219250	[Core] avoid warning when receiving too much logs from a different job (#22102 ) When logs are not intended for the current driver, skip logging warning about too much logs being generated, and clear the counters for log rates. Ideally the log subscriber should only subscribe to logs from the current job, and system logs. But the change has risk and we can do it in another PR.	2022-02-10 15:17:26 +09:00
Balaji Veeramani	abad268549	Comment `fmt: off` annotations (#21984 ) Code formatting is disabled in several modules with the explanation > [The module] ignores yapf because yapf doesn't allow comments right after code blocks, but we put comments right after code blocks to prevent large white spaces in the documentation. Since we no longer use YAPF, it may be possible to re-enable code formatting on these modules. I've added "FIXME" comments requesting developers to check whether code formatter appeasements are still necessary.	2022-02-09 22:12:11 -08:00
Jiajun Yao	07a1ba8e34	Update local object store usage (#22157 ) * Update local object store usage * fix * test	2022-02-09 22:08:25 -08:00
Dmitri Gekhtman	f51566e622	Prep K8s operator for the Ray 1.11.0 release. (#22264 ) For consistency and safety, we fix an explicit 6379 port for all default and example configs for Ray on K8s. Documentation is updated to recommend matching Ray versions in operator and Ray cluster.	2022-02-09 18:59:50 -08:00
Stephanie Wang	495eb14179	[core] Recover spilled objects that are lost during node failure (#21485 ) * Failing test * trigger recovery from ref counter * x * update * lint * stress test * update * format * x	2022-02-09 18:22:16 -08:00

1 2 3 4 5 ...

6076 commits