hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	8affbc7be6	[tune/train] Consolidate checkpoint manager 3: Ray Tune (#24430 ) Update: This PR is now part 3 of a three PR group to consolidate the checkpoints. 1. Part 1 adds the common checkpoint management class #24771 2. Part 2 adds the integration for Ray Train #24772 3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class. Old PR description: This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core. - This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core - This PR prepares Tune to move to a CheckpointStrategy object - In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 12:05:34 +01:00
kourosh hakhamaneshi	4cdd508f70	[RLlib] Added CRR implementation. (#25499 )	2022-06-08 11:42:02 +02:00
Lixin Wei	00dbff507f	[Build] Upgrade the tool for generating `compile_commands.json` (#25430 )	2022-06-08 15:00:49 +08:00
shrekris-anyscale	d75fd5d9f3	Make shrekris-anyscale a codeowner of Serve (docs) (#25576 )	2022-06-07 18:25:46 -07:00
xwjiang2010	29a063afdf	[air] add feast example (#25417 )	2022-06-07 14:55:42 -07:00
Amog Kamsetty	e0a63f770f	[Data/AIR] Move `TensorExtension` to `ray.air` for use in other packages (#25517 ) Moves Tensor extensions to ray.air to facilitate their use in other Ray libraries (AIR, Serve).	2022-06-07 14:53:22 -07:00
xwjiang2010	76b34d4a03	[air] add to_air_checkpoint method for inference only workload. (#25444 ) Follow up on our last discussion for supporting piecemeal fashion air users. Only did for tensorflow for now, want to collect some feedback on API naming, package structure etc and I will add others.	2022-06-07 14:50:39 -07:00
Sebastián Ramírez	3257994e80	♻️ Refactor types to detect invalid extra arguments (#25541 ) Currently, each function decorated with `@ray.remote` is marked with type annotations as a `RemoteFunction` class (only used for type annotations, autocompletion, inline errors, etc). The current class takes several type parameters. And then it uses those parameters in the extended `func.remote()` method. But with the current type annotations, it marks any of the unused type parameters as `None`. This means that calling the `.remote()` method would check the first (actual) arguments and the rest are marked as `None`, but that means that for type annotations it considers "correct" to pass extra `None` arguments, while actually, that would not be valid. So, this doesn't show an error, but it should: <img width="371" alt="Screenshot 2022-06-07 at 05 38 48" src="https://user-images.githubusercontent.com/1326112/172360355-9b344220-7824-4b5c-87da-038f5b53fe04.png"> ...those 2 extra `None` values should be marked as invalid. After this PR, those invalid extra arguments would be marked as invalid: <img width="588" alt="Screenshot 2022-06-07 at 05 42 10" src="https://user-images.githubusercontent.com/1326112/172360956-424b40d4-8197-4663-8298-617a1df37658.png"> And: <img width="687" alt="Screenshot 2022-06-07 at 05 42 50" src="https://user-images.githubusercontent.com/1326112/172361140-eb93c675-f5d6-4e0c-b9b2-83c4801bb450.png"> ## More context I also tried the new `TypeVarTuple`, it might simplify these type annotations in the future, but it's not currently supported by mypy yet, it's a very recent addition to the language (and `typing_extensions`) so it's probably too early to adopt it.	2022-06-07 14:34:34 -07:00
Antoni Baum	3876fcdbe8	[CI] Add bazel py_test checking for Serve (#25509 )	2022-06-07 10:54:10 -07:00
Jun Gong	9b65d5535d	[RLlib] Introduce basic connectors library. (#25311 )	2022-06-07 19:18:14 +02:00
Amog Kamsetty	4e887fe776	[Tune] Remove docstring for private _StatusReporter (#25520 ) Remove outdated docstrings for _StatusReporter. In response to https://discuss.ray.io/t/how-to-use-ray-tune-function-runner-statusreporter-with-tune-with-parameters/6400/2	2022-06-07 10:11:29 -07:00
Simon Mo	7471b1fa41	[Serve] [AIR] ModelWrapper improvements and docs (#25003 ) * batching collation code and tests * wip notebook for np and dataframe * finish content * reset ray-more-libs changes * add comments * run through * Apply suggestions from code review Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> * rename package * lint * richard's comment Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>	2022-06-07 08:53:10 -07:00
John B Nelson	e913352bdc	[Doc] Remove trailing period symbol install instruction (#25543 )	2022-06-07 08:08:04 -07:00
Kai Fricke	45bf925ef0	[train/serve] Fix torch tune serve test (#25547 ) #24772 broke the smoke test as it was not run on CI - this PR hotfixes this	2022-06-07 15:54:37 +01:00
Kai Fricke	984b9a5e6c	[tune/train] Consolidate checkpoint manager 2: Ray Train (#24772 ) This is a follow-up from #24771 which moves the Ray Train implementation to use the new common checkpoint manager class.	2022-06-07 13:51:42 +01:00
Rohan Potdar	a9d8da0100	[RLlib]: Doubly Robust Off-Policy Evaluation. (#25056 )	2022-06-07 12:52:19 +02:00
Artur Niederfahrenhorst	429d0f0eee	[RLlib] Fix multi agent environment checks for observations that contain only some agents' obs each step. (#25506 )	2022-06-07 10:33:35 +02:00
Artur Niederfahrenhorst	35bd397181	[RLlib] Better default values for `training_intensity` and `target_network_update_freq` for R2D2. (#25510 )	2022-06-07 10:29:56 +02:00
Zhe Zhang	6793426a9d	[Docs; RLlib] Remove `$` from rllib pip install instructions (#25358 )	2022-06-07 08:57:17 +02:00
Philipp Moritz	ec02e78b01	[docs] Use better method to mock ObjectRef (#25535 ) Actually fix #25498	2022-06-06 23:50:52 -07:00
Eric Liang	c1afbcb6f4	[air] Enforce API stability annotations for AIR module (#25485 )	2022-06-06 22:52:21 -07:00
Eric Liang	78688a0903	Enable streaming ingest in AIR (#25428 ) This adds the following options to DatasetConfig, which can be used to enable streaming ingest. ``` # Whether the dataset should be streamed into memory using pipelined reads. # When enabled, get_dataset_shard() returns DatasetPipeline instead of Dataset. # The amount of memory to use is controlled by `stream_window_size`. # False by default for all datasets. use_stream_api: Optional[bool] = None # Configure the streaming window size in bytes. A typical value is something like # 20% of object store memory. If set to -1, then an infinite window size will be # used (similar to bulk ingest). This only has an effect if use_stream_api is set. # Set to 1.0 GiB by default. stream_window_size: Optional[float] = None # Whether to enable global shuffle (per pipeline window in streaming mode). Note # that this is an expensive all-to-all operation, and most likely you want to use # local shuffle instead. # False by default for all datasets. global_shuffle: Optional[bool] = None ```	2022-06-06 17:42:15 -07:00
Yi Cheng	aabe9e73ef	Revert "[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027 )" (#25530 ) This reverts commit `9a510f92cf`.	2022-06-06 16:41:42 -07:00
Richard Liaw	86837fa637	[docs/air] update order of documentation in toc (#25527 ) Signed-off-by: Richard Liaw <rliaw@berkeley.edu>	2022-06-06 16:23:30 -07:00
Amog Kamsetty	365fc44754	[AIR] Update to new Predictor interface (#25425 ) Updates the Predictor interface to have Pandas as a narrow waist.	2022-06-06 15:41:38 -07:00
G Goswami	7ddc23a8f5	Fixing example (#25524 ) Remove quotes from K8s job submission example in docs.	2022-06-06 18:21:19 -04:00
Richard Liaw	36aee6a1c4	[air/docs] Update documentation structure (#25475 ) Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-06-06 15:15:11 -07:00
Philipp Moritz	406c2c5778	[docs] Fix mock objects in Ray Core docs (#25498 ) Our API references are currently showing mock objects for some of our APIs -- this PR fixes them for the Ray Core API reference.	2022-06-06 15:09:01 -07:00
Kai Fricke	a0c8db1b5e	[release] Update download_wheels.sh to include Python 3.10 (#25508 ) Currently the download script does not contain python 3.10	2022-06-06 22:42:50 +01:00
simonsays1980	2a5d322e70	[tune] Relative logdir paths in trials for ExperimentAnalysis in remote buckets (#25063 ) When running an experiment for example in the cloud and syncing to a bucket the logdir path in the trials will be changed when working with the checkpoints in the bucket. There are some workarounds, but the easier solution is to also add a rel_logdir containing the relative path to the trials/checkpoints that can handle any changes in the location of experiment results. As discussed with @Yard1 and @krfricke Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-06-06 22:41:41 +01:00
Vince Jankovics	68444cd390	[tune] Custom resources per worker added to default_resource_request (#24463 ) This resolves the `TODO(ekl): add custom resources here once tune supports them` item. Also, related to the discussion [here](https://discuss.ray.io/t/reserve-workers-on-gpu-node-for-trainer-workers-only/5972/5). Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-06-06 22:41:02 +01:00
Florian Boucault	9a510f92cf	[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027 )	2022-06-06 14:23:00 -07:00
Zhe Zhang	2d74ecc2ec	[Docs] [Clusters] Fix issues in the overview part of Cluster Deployment Guide, and fix a typo (#25473 ) * Fix issues in the overview part, and fix a typo * Addressing comment Co-authored-by: Alex Wu <alex@anyscale.com>	2022-06-06 14:11:41 -07:00
Philipp Moritz	8aff562c2f	[docs] Cleanup ray init docs (#25492 )	2022-06-06 13:16:32 -07:00
Balaji Veeramani	5e06baa77e	[AIR] Remove `/Users/balaji` from Torch example (#25515 )	2022-06-06 13:13:54 -07:00
Sihan Wang	0441834021	[Serve] Fix test_standalone flacky (#25513 )	2022-06-06 13:13:32 -07:00
kimikuri	60f59bd804	[Serve] Fix misspell in Serve Doc User Guides. (#25494 )	2022-06-06 13:00:20 -07:00
shrekris-anyscale	e433424796	[Serve] Checkpoint the `DeploymentState`'s `_deleting` attribute (#25478 )	2022-06-06 12:06:51 -07:00
Eric Liang	94dec83a60	[data] Rename data.impl to data._internal (#25486 )	2022-06-06 11:39:53 -07:00
shrekris-anyscale	ce3faed897	[Serve] Avoid deserializing `ReplicaConfig` properties in the Serve controller (#25213 )	2022-06-06 11:08:06 -07:00
Jiao	aa965ba0a9	[Deployment Graph] Add visualization cookbook (#25112 )	2022-06-06 11:05:58 -07:00
mwtian	1ce0ab7b7c	[Core] Export additional metrics for workers and Raylet memory (#25418 ) Add visibility into the following to help Ray users and developers debug performance and OOM issues: Raylet memory usage broken down by USS vs remaining RSS. Total workers' count, CPU percentage usage, and memory usage.	2022-06-06 10:58:14 -07:00
Andrew Li	3853186472	Exposed upscaling_speed and idle_timeout_minutes to values.yaml, #25312 (#25495 ) Exposed upscaling_speed and idle_timeout_minutes to values.yaml.	2022-06-06 13:26:06 -04:00
Alex Wu	a9bf8d455f	[github] Update code owners for cluster docs (#25507 ) In the same spirit of #25479 adding myself and @DmitriGekhtman as code owners of the autoscaler/cluster launcher docs since we are also the code owners for the code.	2022-06-06 09:36:39 -07:00
Balaji Veeramani	c4898ed7df	[AIR] [Datasets] Add `convert_pandas_to_tf_tensor` (#25133 ) Dataset.to_tf and TensorflowPredictor attempt to convert Pandas dataframes to NumPy arrays by calling DataFrame.values. However, DataFrame.values fails if the dataframe contains multidimensional arrays. This PR solves this problem by introducing a function convert_pandas_to_tf_tensor. The implementation of the function is based on the implementation of convert_pandas_to_torch_tensor.	2022-06-06 08:29:51 -07:00
Artur Niederfahrenhorst	5133978adc	[RLlib] PG policy subclassing conversion. (#25288 )	2022-06-06 13:07:47 +02:00
Artur Niederfahrenhorst	243038d00a	[RLlib] Issue 25401: Faulty usage of get_filter_config in ComplexInputNetworks (#25493 )	2022-06-06 13:04:17 +02:00
kourosh hakhamaneshi	d49d0efbaf	[RLlib] Bug fix: when on GPU, sample_batch.to_device() only converts the device and does not convert float64 to float32. (#25460 )	2022-06-06 12:43:11 +02:00
Artur Niederfahrenhorst	c4a0e9d0f2	[RLlib] Disambiguate timestep fragment storage unit in replay buffers. (#25242 )	2022-06-06 11:35:49 +02:00
Sebastián Ramírez	298742d724	♻️ Refactor type annotations for `.remote()` to avoid incorrect autocompletion and checks (#25480 ) With the current type annotations for the `.remote()` method generated in decorated functions, editors understand that there are some keyword arguments `arg0`, `arg1`, etc. Which are incorrect as the actual function will probably have different names for its arguments. For example, this shouldn't autocomplete `arg0`, `arg1`, etc: <img width="407" alt="Screenshot 2022-06-04 at 06 13 46" src="https://user-images.githubusercontent.com/1326112/171996654-12248369-cf10-4fce-9ea2-5deb4ca8e2bd.png"> If anything, it should autocomplete `x` and `y` (although that's currently [not perfectly doable](https://github.com/python/typing/discussions/1163)). By updating the type annotations to use [arguments prefixed with double underscores](https://mypy.readthedocs.io/en/stable/protocols.html?highlight=double%20underscore#callback-protocols) at least it tells tooling to not provide autocompletion for those args (which would be incorrect). While still providing inline errors for invalid types. <img width="880" alt="Screenshot 2022-06-04 at 06 20 26" src="https://user-images.githubusercontent.com/1326112/171996806-560c0fa8-0ee3-477c-9906-71e880c84e56.png">	2022-06-05 16:21:53 -07:00

... 2 3 4 5 6 ...

13032 commits