hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

Author	SHA1	Message	Date
SangBin Cho	856bea31fb	[State Observability] Ray log CLI / API (#25481 ) This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done. # If there's only 1 match, print a file content. Otherwise, print all files that match glob. ray logs [glob_filter] --node-id=[head node by default] Args: --tail: Tail the last X lines --follow: Follow the new logs --actor-id: The actor id --pid --node-ip: For worker logs --node-id: The node id of the log --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)	2022-06-13 05:52:57 -07:00
Jiao	f8b0ab7e78	[Ray DAG] Add documentation in `more options` section (#25528 )	2022-06-12 09:47:20 -07:00
Philipp Moritz	d8ec5929b6	Exclude Bazel build files from Ray wheels (#25679 ) Including the Bazel build files in the wheel leads to problems if the Ray wheels are brought in as a dependency from another bazel workspace, since that workspace will not recurse into the directories of the wheel that contain BUILD files -- this can lead to dropped files. This only happens for macOS wheels, on linux wheels the BUILD files were already excluded.	2022-06-11 16:05:59 -07:00
Sven Mika	130b7eeaba	[RLlib] `Trainer` to `Algorithm` renaming. (#25539 )	2022-06-11 15:10:39 +02:00
Yi Cheng	0c527b4502	[1/2][serve] Use GcsClient to replace the kv client to use timeout. (#25633 ) Timeout is only introduced in GcsClient due to the reason that ray client is not defining the timeout well for their API and it's a lot of effort to make it work e2e. For built-in component, we should use GcsClient directly. This PR use GcsClient to replace the old one to integrate GCS HA with Ray Serve.	2022-06-10 23:41:49 -07:00
Eric Liang	d36fd77548	[air] Allow fusing task and actor stages if they have compatible resource types (#25683 )	2022-06-10 19:04:27 -07:00
Clark Zinzow	4fb92dd2f1	[Datasets] Fix `__array__` protocol on `TensorArrayElement` and `TensorArray`. (#25647 ) This PR fixes two issues with the __array__ protocol on the tensor extension: 1. The __array__ protocol on TensorArrayElement was missing the dtype parameter, causing np.asarray(tae, dtype=some_dtype) calls to fail. This PR adds support for the dtype argument. 2. TensorArray and TensorArrayElement didn't support NumPy's scalar casting semantics for single-element tensors. This PR adds support for these scalar casting semantics.	2022-06-10 16:42:16 -07:00
Richard Liaw	1dd714e0fa	[rfc][doc] Add clarity to stability guidelines (#25611 )	2022-06-10 15:19:21 -07:00
Jiao	6b9b1f135b	[Deployment Graph] Move files out of `pipeline` folder (#25630 )	2022-06-10 10:39:03 -07:00
Sihan Wang	2546fbf99d	[Serve] Autoscaling for deployment graph (#25424 )	2022-06-10 10:21:49 -07:00
Simon Mo	271c7d73ac	[AIR][Serve] Add support for multi-modal array input (#25609 )	2022-06-10 09:19:42 -07:00
Jian Xiao	67b2eca6a2	Fix a few type annotations that may confuse people (#25645 )	2022-06-09 23:15:21 -07:00
shrekris-anyscale	5586b89b1c	[Serve] Improve logs for new Serve REST API (#25610 )	2022-06-09 17:04:09 -07:00
Amog Kamsetty	2614c24e47	[AIR] Add `predict_pandas` implementation (#25534 ) Implements conversion utilities and a default predict implementation for Predictor. Depends on #25517	2022-06-09 16:55:58 -07:00
matthewdeng	88524d8b57	[air] add `CustomStatefulPreprocessor` (#25497 )	2022-06-09 16:54:46 -07:00
Archit Kulkarni	6f3de2af86	[Serve] Fix outdated Serve warning message for sync handle (#25453 )	2022-06-09 14:50:48 -07:00
Simon Mo	ef1b565699	[CI] Pin starlette and fastapi version (#25604 )	2022-06-09 13:55:18 -07:00
Jimmy Yao	2511e66d7e	[Datasets] [AIR] Fixes label tensor squeezing in to_tf() (#25553 )	2022-06-09 12:32:13 -07:00
Kai Fricke	f17ced04dd	[air/tune] Exclude in remote storage upload (#25544 ) This adds an exclude option to upload_to_uri() which will be needed for refactoring the Tune syncing/sync client structure.	2022-06-09 20:12:53 +02:00
Robert	a92a06860f	[Datasets] Allow for len(Dataset) (#25152 ) Small QOL change that allows for len(Dataset) to be used rather than calling Dataset.count()	2022-06-09 10:36:41 -07:00
matthewdeng	eff72f9a72	[train] fix transformers example for multi-gpu (#24832 ) Accelerate depends on this environment variable to set for proper GPU device placement. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-09 09:17:35 -07:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
Jian Xiao	ce103b4ffa	Eagerly clears object memory before Python GC kicks in when consuming DatasetPipeline (#25461 )	2022-06-09 00:37:56 -07:00
Amog Kamsetty	1316a2d05e	[AIR/Train] Move `ray.air.train` to `ray.train` (#25570 )	2022-06-08 21:34:18 -07:00
Dmitri Gekhtman	836b08597f	[kuberay][autoscaler] Use new autoscaling fields from the KubeRay operator (#25386 ) This PR incorporates recent autoscaler config changes from KubeRay.	2022-06-08 20:09:43 -07:00
matthewdeng	ba0a2a022a	[datasets] add `Dataset.randomize_block_order` (#25568 ) This exposes a low-cost way to perform a pseudo global shuffle. For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk. This allows the workload to be spread across the nodes on which the dataset blocks are on.	2022-06-08 18:39:15 -07:00
Clark Zinzow	6987ab5966	[Datasets] [Hotfix] Fix stats construction for from_* APIs. (#25601 ) Stats construction on the from_arrow and from_numpy (and from_pandas with Pandas block support disabled) is currently broken since we weren't resolving the block metadata before passing it to the stats, causing future ds.stats() calls to fail. This PR fixes this and adds some test coverage. Drivebys: - Adds stats for from_pandas() zero-copy path (metadata fetch only). - Changes "from_numpy" stats stage name to "from_numpy_refs", to be consistent with stats for other from_*() APIs.	2022-06-08 18:04:40 -07:00
shrekris-anyscale	f3c2bd6718	[Serve] Make REST API deployments inherit top-level runtime_env (#25502 )	2022-06-08 15:58:00 -07:00
Antoni Baum	16733c2271	[AIR] Delayed type checking for Preprocessors (#25587 ) Breaks the hard dependency on Preprocessor imports for type hints in AIR. Preparation for move of Preprocessors to `ray.data`. Trainer still has a hard dependency due to an `isinstance` check.	2022-06-08 13:15:54 -07:00
Hanming Lu	d3e5bf97b5	more informative GCPNodeProvider create_node return (#25416 ) More informative return value for GCPNodeProvider create_node	2022-06-08 12:34:09 -07:00
Amog Kamsetty	3a728c4e35	[Train] Mark Trainer interfaces as Deprecated (#25573 ) Marks Trainer interfaces as Deprecated. This PR does not make any changes to the docs. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 12:30:32 -07:00
Stephanie Wang	6274bb354c	[tests] Deflake test_reconstruction.py::test_basic_reconstruction_actor_task[False] (#25456 ) This test was flaky because actor tasks can fail if submitted when the actor process is failed or restarting. This PR changes the test to be more stressful so that the error is easier to reproduce and changes the max_retries parameter to -1 so that the actor task will succeed. Related issue number Closes #24942.	2022-06-08 11:21:57 -07:00
Sihan Wang	a9e7836e8c	[Serve] Skip flaky test_autoscaling_policy on windows (#25526 )	2022-06-08 10:33:40 -07:00
Clark Zinzow	9dc0bb3d5e	[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031 )" (#25531 ) Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)" This reverts commit `fb2933a78f`. * Skip shuffle memory release test.	2022-06-08 10:33:25 -07:00
Amog Kamsetty	1be32e5977	[AIR] Add `_predict_arrow` interface for Predictor (#25579 ) * add interface * update docstring	2022-06-08 10:27:29 -07:00
Pamphile Roy	0bbc3379bd	Fix SciPy pinning (#25148 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2022-06-08 10:26:59 -07:00
Amog Kamsetty	80ae651f25	[Train] Clean up `ray.train` package (#25566 )	2022-06-08 10:22:36 -07:00
Kai Fricke	8affbc7be6	[tune/train] Consolidate checkpoint manager 3: Ray Tune (#24430 ) Update: This PR is now part 3 of a three PR group to consolidate the checkpoints. 1. Part 1 adds the common checkpoint management class #24771 2. Part 2 adds the integration for Ray Train #24772 3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class. Old PR description: This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core. - This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core - This PR prepares Tune to move to a CheckpointStrategy object - In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 12:05:34 +01:00
Amog Kamsetty	e0a63f770f	[Data/AIR] Move `TensorExtension` to `ray.air` for use in other packages (#25517 ) Moves Tensor extensions to ray.air to facilitate their use in other Ray libraries (AIR, Serve).	2022-06-07 14:53:22 -07:00
xwjiang2010	76b34d4a03	[air] add to_air_checkpoint method for inference only workload. (#25444 ) Follow up on our last discussion for supporting piecemeal fashion air users. Only did for tensorflow for now, want to collect some feedback on API naming, package structure etc and I will add others.	2022-06-07 14:50:39 -07:00
Sebastián Ramírez	3257994e80	♻️ Refactor types to detect invalid extra arguments (#25541 ) Currently, each function decorated with `@ray.remote` is marked with type annotations as a `RemoteFunction` class (only used for type annotations, autocompletion, inline errors, etc). The current class takes several type parameters. And then it uses those parameters in the extended `func.remote()` method. But with the current type annotations, it marks any of the unused type parameters as `None`. This means that calling the `.remote()` method would check the first (actual) arguments and the rest are marked as `None`, but that means that for type annotations it considers "correct" to pass extra `None` arguments, while actually, that would not be valid. So, this doesn't show an error, but it should: <img width="371" alt="Screenshot 2022-06-07 at 05 38 48" src="https://user-images.githubusercontent.com/1326112/172360355-9b344220-7824-4b5c-87da-038f5b53fe04.png"> ...those 2 extra `None` values should be marked as invalid. After this PR, those invalid extra arguments would be marked as invalid: <img width="588" alt="Screenshot 2022-06-07 at 05 42 10" src="https://user-images.githubusercontent.com/1326112/172360956-424b40d4-8197-4663-8298-617a1df37658.png"> And: <img width="687" alt="Screenshot 2022-06-07 at 05 42 50" src="https://user-images.githubusercontent.com/1326112/172361140-eb93c675-f5d6-4e0c-b9b2-83c4801bb450.png"> ## More context I also tried the new `TypeVarTuple`, it might simplify these type annotations in the future, but it's not currently supported by mypy yet, it's a very recent addition to the language (and `typing_extensions`) so it's probably too early to adopt it.	2022-06-07 14:34:34 -07:00
Antoni Baum	3876fcdbe8	[CI] Add bazel py_test checking for Serve (#25509 )	2022-06-07 10:54:10 -07:00
Jun Gong	9b65d5535d	[RLlib] Introduce basic connectors library. (#25311 )	2022-06-07 19:18:14 +02:00
Amog Kamsetty	4e887fe776	[Tune] Remove docstring for private _StatusReporter (#25520 ) Remove outdated docstrings for _StatusReporter. In response to https://discuss.ray.io/t/how-to-use-ray-tune-function-runner-statusreporter-with-tune-with-parameters/6400/2	2022-06-07 10:11:29 -07:00
Simon Mo	7471b1fa41	[Serve] [AIR] ModelWrapper improvements and docs (#25003 ) * batching collation code and tests * wip notebook for np and dataframe * finish content * reset ray-more-libs changes * add comments * run through * Apply suggestions from code review Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> * rename package * lint * richard's comment Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>	2022-06-07 08:53:10 -07:00
Kai Fricke	984b9a5e6c	[tune/train] Consolidate checkpoint manager 2: Ray Train (#24772 ) This is a follow-up from #24771 which moves the Ray Train implementation to use the new common checkpoint manager class.	2022-06-07 13:51:42 +01:00
Rohan Potdar	a9d8da0100	[RLlib]: Doubly Robust Off-Policy Evaluation. (#25056 )	2022-06-07 12:52:19 +02:00
Eric Liang	c1afbcb6f4	[air] Enforce API stability annotations for AIR module (#25485 )	2022-06-06 22:52:21 -07:00
Eric Liang	78688a0903	Enable streaming ingest in AIR (#25428 ) This adds the following options to DatasetConfig, which can be used to enable streaming ingest. ``` # Whether the dataset should be streamed into memory using pipelined reads. # When enabled, get_dataset_shard() returns DatasetPipeline instead of Dataset. # The amount of memory to use is controlled by `stream_window_size`. # False by default for all datasets. use_stream_api: Optional[bool] = None # Configure the streaming window size in bytes. A typical value is something like # 20% of object store memory. If set to -1, then an infinite window size will be # used (similar to bulk ingest). This only has an effect if use_stream_api is set. # Set to 1.0 GiB by default. stream_window_size: Optional[float] = None # Whether to enable global shuffle (per pipeline window in streaming mode). Note # that this is an expensive all-to-all operation, and most likely you want to use # local shuffle instead. # False by default for all datasets. global_shuffle: Optional[bool] = None ```	2022-06-06 17:42:15 -07:00
Yi Cheng	aabe9e73ef	Revert "[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027 )" (#25530 ) This reverts commit `9a510f92cf`.	2022-06-06 16:41:42 -07:00

... 3 4 5 6 7 ...

7152 commits