hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
SangBin Cho	2ed5bb7a5f	[Nightly Test] Addressed client failure properly (#22438 ) When the client returns the code that's not 0, we should raise RuntimeError to properly propagate errors	2022-02-16 09:03:17 -08:00
Archit Kulkarni	606e2b2cde	Update license for MLflow's conda utils and virtualenv-clone (#22402 ) When we vendor third-party code, we should update LICENSE file. Previously we vendored two pieces of code: - conda utilities from MLflow - virtualenv-clone But we only included the attribution in the relevant source files, not in our LICENSE file. This PR adds the necessary info to our LICENSE file.	2022-02-16 10:00:23 -06:00
Jun Gong	04dd536987	[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426 )	2022-02-16 13:06:09 +01:00
Hao Chen	f2bbcf5adc	Fix test_traceback incompatibility with pytest 6.x (#22375 ) Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 18:04:19 +08:00
Qing Wang	7c45d1a366	[doc][Java] Add doc page for java concurrency group. (#21600 ) Add document page for Java concurrency group. Co-authored-by: Kai Yang <kfstorm@outlook.com>	2022-02-16 17:57:03 +08:00
Eric Liang	92550500bc	Split workflow and dataset tests (#22415 )	2022-02-16 01:47:55 -08:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Eric Liang	10172d8663	Add more codeowners to datasets (#22409 )	2022-02-15 15:44:20 -08:00
mwtian	839bc5019f	Fix building Windows wheels (#22388 ) (#22391 ) This fixes Windows wheel build issue on master and releases/1.11.0 branch. If the issue happens more often we can try to run iwyu.	2022-02-15 15:24:10 -08:00
Eric Liang	2158df3a73	[data] Pre-reqs for implementing stage fusion (#22374 )	2022-02-15 14:59:07 -08:00
mwtian	32035eb125	[Pubsub] increase subscriber timeout (#22394 ) As mentioned in https://github.com/ray-project/ray/issues/22161#issuecomment-1039661368, increase subscriber timeout to avoid subscriber state being deleted too soon.	2022-02-15 14:48:19 -08:00
Chris K. W	9a7979d9a2	[Client] chunked get requests (#22100 ) Why are these changes needed? Switches GetObject from unary-unary to unary-streaming so that large objects can be streamed across multiple messages (currently hardcoded to 64MiB chunks). This will allow users to retrieve objects larger than 2GiB from a remote cluster. If the transfer is interrupted by a recoverable gRPC error (i.e. temporary disconnect), then the request will be retried starting from the first chunk that hasn't been received yet. Proto changes GetRequest's now have the field start_chunk_id, to indicate which chunk to start from (useful if the we have to retry a request after already receiving some chunks). GetResponses now have a chunk_id (0 indexed chunk of the serialized object), total_chunks (total number of chunks, used in async transfers to determine when all chunks have been received), and total_size (the total size of the object in bytes, used to raise user warnings if the object being retrieved is very large). Server changes Mainly just updating GetObject logic to yield chunks instead of returning Client changes At the moment, objects can be retrieved directly from the raylet servicer (ray.get) or asynchronously over the datapath (await some_remote_func.remote()). In both cases, the request will error if the chunk isn't valid (server side error) or if a chunk is received out of order (shouldn't happen in practice, since gRPC guarantees that messages in a stream either arrive in order or not at all). ray.get is fairly straightforward, and changes are mainly to accommodate yielding from the stub instead of taking the value directly. await some_remote_func.remote() is similar, but to keep things consistent with other async handling collecting the chunks is handled by a ChunkCollector, which wraps around the original callback.	2022-02-16 00:07:16 +02:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Edward Oakes	9c07eabab9	[serve] Remove unused `filter_tag` and errant `str` redefinition (#22400 )	2022-02-15 15:33:10 -06:00
Eric Liang	df4b56d32e	[minor] Fix dataset shuffle bug on empty blocks. (#22367 ) There's an edge case where we can crash if empty blocks end up in shuffle (type gets inferred as Arrow, then fails when we add list-type blocks).	2022-02-15 13:18:54 -08:00
SangBin Cho	6eace8a305	[Test] Change the default encoding to utf-8 (#22286 ) Follow up - https://github.com/ray-project/ray/pull/22248#pullrequestreview-878073629	2022-02-15 11:35:48 -08:00
Jialing He	4c73560b31	[runtime env] Support clone `virtualenv` from an existing `virtualenv` (#22309 ) Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv from an existing virtualenv. More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499 Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-02-15 12:51:01 -06:00
Chen Shen	4ad1fba100	[refactor cluster-task-manage 3/n] Separate stats reporting into its own file (#22359 ) * wip * refactor	2022-02-15 10:48:00 -08:00
Simon Mo	495221e7d2	[Doc] Update Serve logo for tune user guide (#22369 ) We have deprecated the old logo.	2022-02-15 12:10:08 -06:00
Hao Chen	78597d3089	[train] Minor fixes on Ray Train user guide doc (#22379 ) Fixes some typos and format issues.	2022-02-15 10:09:27 -08:00
Matti Picus	199bf558e2	move slow test from small (timeout 60s) to medium (timeout 300s) (#22167 )	2022-02-15 09:55:30 -08:00
Gagandeep Singh	7dc097a947	Unskipped tests for Windows (#21809 ) These tests are passing without issues on my Windows machine, so unskipping them to check on CI. I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed. Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2022-02-15 09:04:47 -08:00
Gagandeep Singh	a8341dfc29	Replace `queue.Queue` with `multiprocessing.JoinableQueue` (#21860 ) Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300 And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121 AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.	2022-02-15 09:01:17 -08:00
ZhuSenlin	37ef372a10	Use shared_ptr to instead of object in cluster_scheduling_resources_ to reduce rehash cost. (#22376 ) 1. In scheduling optimization, we should encapsulate `SchedulingResources`, `GcsNodeInfo` and other node related information into a `NodeContext` for use, which requires that `SchedulingResources` is shareable. This PR does not involve the transformation logic of `NodeContext`, but only transforms `SchedulingResources` into shareable. 2. `cluster_scheduling_resources_` holds raw object of `SchedulingResources`, which will bring some overhead when rehash (even though the std::move used when rehash). Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-02-15 23:43:59 +08:00
Kai Fricke	c866131cc0	[tune] Retry cloud sync up/down/delete on fail (#22029 )	2022-02-15 12:27:29 +00:00
Jun Gong	b729a9390f	[RLlib] Add example commands for using `setup-dev.py` with RLlib for improved dev setup stability and developer experience. (#22380 )	2022-02-15 12:00:36 +01:00
dependabot[bot]	35ae459434	[tune](deps): Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071 ) * [tune](deps): Bump flaml from 0.6.7 to 0.9.6 in /python/requirements/ml Bumps [flaml](https://github.com/microsoft/FLAML) from 0.6.7 to 0.9.6. - [Release notes](https://github.com/microsoft/FLAML/releases) - [Commits](https://github.com/microsoft/FLAML/compare/v0.6.7...v0.9.6) --- updated-dependencies: - dependency-name: flaml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-15 09:53:58 +00:00
Sven Mika	5ca6a56e16	[RLlib] Bug fix: eval-workers in offline RL setup have no env, even though eval_config includes env key. (#22350 )	2022-02-15 09:32:43 +01:00
Akash Patel	ae6068277b	update grpc to 1.43 (#21866 ) add patch for newer setuptools, can be removed once grpc 1.44 is release Why are these changes needed? With grpc updated to 1.43, one of the patches is not needed. Patch needed when building locally for newer setuptools version. See grpc/grpc#28392 for more details. Also needed as a prereq to #21221	2022-02-15 00:20:56 -08:00
mwtian	59d9e20a4c	Revert "Revert "[Release 1.11.0][Core] avoid unnecessary work during event st… (#22144 )" (#22284 ) This reverts commit `6235b6d7e9`. Looks like windows://python/ray/tests:test_dataclient_disconnect has similar level of flakiness as before the revert. This seems unrelated and the test needs to be fixed in another way.	2022-02-15 00:20:28 -08:00
Jun Gong	6f5afcbce9	[RLlib] Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239 )	2022-02-15 09:09:24 +01:00
Steven Morad	5d52b599aa	[RLlib] Fix zero gradients for ppo-clipped vf (#22171 )	2022-02-15 08:57:18 +01:00
Yi Cheng	2fbbd21351	[workflow] Fix event loop can't find in thread (#22363 ) Event loop will only be set in main thread by default and this will make workflow unable to work if it's called in thread other than main thread which can happen when it's called from a library (for example ray serve). This PR fixed it.	2022-02-14 23:31:32 -08:00
matthewdeng	8f9e0d7f6b	[train] add TorchTensorboardProfilerCallback (#22345 ) The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`. ``` \| File "ray_sgd_training.py", line 18, in <module> \| from ray import train \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module> \| from ray.train.callbacks import TrainingCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module> \| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback \| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module> \| from torch.profiler import profile \| ModuleNotFoundError: No module named 'torch.profiler' ``` A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes: 1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized. 2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed: ``` >>> import ray >>> import ray.train >>> import ray.train.torch >>> from ray.train.torch import TorchWorkerProfiler >>> twp = TorchWorkerProfiler() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__ "Torch Profiler requires torch>=1.8.1. " ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler. ```	2022-02-14 16:16:55 -08:00
Eric Liang	35a157948e	Lay the groundwork for lazy dataset optimization (no behavior changes) (#22233 ) This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.	2022-02-14 15:03:58 -08:00
Jialing He	192f9de421	[runtime env] Introduce async Manager.create (#22311 )	2022-02-14 16:26:47 -06:00
Matti Picus	845861fdc1	[runtime env] use pytest tmp_path, os.path.sep, and unskip most tests for windows (#22342 )	2022-02-14 16:04:10 -06:00
Archit Kulkarni	0e350c0074	[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780 ) We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two. Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2022-02-14 16:03:48 -06:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
dependabot[bot]	767b349b99	[data](deps): Bump dask[complete] (#22334 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2022.1.0 to 2022.2.0. - [Release notes](https://github.com/dask/dask/releases) - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](https://github.com/dask/dask/compare/2022.01.0...2022.02.0) --- updated-dependencies: - dependency-name: dask[complete] dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-14 12:44:20 -08:00
Edward Oakes	610930ae6a	[serve] Improve health check failure semantics (#22297 )	2022-02-14 14:04:03 -06:00
Yi Cheng	8a3bd6c275	[gcs/ha] Enable HA flags by default (#21608 ) PR to enable all three flags for GCS HA: - RAY_bootstrap_with_gcs=1 - RAY_gcs_grpc_based_pubsub=1 - RAY_gcs_storage=memory	2022-02-14 11:13:17 -08:00
Clark Zinzow	443416907e	[Datasets] Fix boolean tensor column representation and slicing. (#22323 ) This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.	2022-02-14 10:36:35 -08:00
Chen Shen	db5de9c35c	[scheduler-refactor 2/n] move actor reporting into helper class too (#22333 ) * move this * address comments	2022-02-14 02:13:14 -08:00
Alex Wu	276ff2b7ed	[docs][autoscaler] Add maintainers for node providers (#22237 ) This PR adds documentation for the maintainers of the various node providers. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-02-12 11:31:32 -08:00
Max Pumperla	d594b668bb	[docs] [tune] hyperopt notebook (#22315 )	2022-02-12 02:46:03 -08:00
Eric Liang	85d6946c95	Split test_dataset.py into two (#22303 )	2022-02-12 00:21:25 -08:00
Amog Kamsetty	4cbbc81f4c	[Train] Add support for `trainer.best_checkpoint` and `Trainer.load_checkpoint_path` (#22306 ) Closes #22226	2022-02-11 22:29:37 -08:00
SangBin Cho	640d92c385	It seems like the S3 read sometimes fails; #22214 . I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue. It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.	2022-02-12 11:58:58 +09:00
Yi Cheng	531e215921	[gcs] Fix in_memory_store not handling nullptr callback issue (#22321 ) in memory store is not handling the nullptr callback well which leads to gcs crash in node failure tests. This PR fixed it.	2022-02-11 18:35:40 -08:00

1 2 3 4 5 ...

11325 commits