hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
mwtian	1483c4553c	use smaller instance for scheduling tests (#25635 ) m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.	2022-06-10 17:09:35 +00:00
Simon Mo	271c7d73ac	[AIR][Serve] Add support for multi-modal array input (#25609 )	2022-06-10 09:19:42 -07:00
Sven Mika	7c39aa5fac	[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076 )	2022-06-10 17:09:18 +02:00
Artur Niederfahrenhorst	94d6c212df	[RLlib] Replay Buffer API documentation. (#24683 )	2022-06-10 16:47:51 +02:00
Artur Niederfahrenhorst	c3645928ca	[RLlib] Fix no gradient clipping happening in QMix. (#25656 )	2022-06-10 13:51:26 +02:00
Avnish Narayan	730df43656	[RLlib] Issue 25503: Replace torch.range with torch.arange. (#25640 )	2022-06-10 13:21:54 +02:00
kourosh hakhamaneshi	b3a351925d	[RLlib] Added meaningful error for multi-agent failure of SampleCollector in case no agent steps in episode. (#25596 )	2022-06-10 12:30:43 +02:00
Artur Niederfahrenhorst	8af9ef8fee	[RLlib] Discussion 6432: Automatic `train_batch_size` calculation fix. (#25621 )	2022-06-10 12:15:57 +02:00
Jian Xiao	67b2eca6a2	Fix a few type annotations that may confuse people (#25645 )	2022-06-09 23:15:21 -07:00
shrekris-anyscale	5586b89b1c	[Serve] Improve logs for new Serve REST API (#25610 )	2022-06-09 17:04:09 -07:00
Antoni Baum	445400d727	[CI] Print a summary of broken links in LinkCheck (#25634 )	2022-06-09 17:03:53 -07:00
Amog Kamsetty	2614c24e47	[AIR] Add `predict_pandas` implementation (#25534 ) Implements conversion utilities and a default predict implementation for Predictor. Depends on #25517	2022-06-09 16:55:58 -07:00
matthewdeng	88524d8b57	[air] add `CustomStatefulPreprocessor` (#25497 )	2022-06-09 16:54:46 -07:00
Archit Kulkarni	6f3de2af86	[Serve] Fix outdated Serve warning message for sync handle (#25453 )	2022-06-09 14:50:48 -07:00
Simon Mo	ef1b565699	[CI] Pin starlette and fastapi version (#25604 )	2022-06-09 13:55:18 -07:00
Jimmy Yao	2511e66d7e	[Datasets] [AIR] Fixes label tensor squeezing in to_tf() (#25553 )	2022-06-09 12:32:13 -07:00
Eric Liang	a058a98c5d	[docs] Try to clarify some advantages of bulk ingest in the AIR ingest docs (#25616 )	2022-06-09 11:47:22 -07:00
Kai Fricke	f17ced04dd	[air/tune] Exclude in remote storage upload (#25544 ) This adds an exclude option to upload_to_uri() which will be needed for refactoring the Tune syncing/sync client structure.	2022-06-09 20:12:53 +02:00
Robert	a92a06860f	[Datasets] Allow for len(Dataset) (#25152 ) Small QOL change that allows for len(Dataset) to be used rather than calling Dataset.count()	2022-06-09 10:36:41 -07:00
matthewdeng	eff72f9a72	[train] fix transformers example for multi-gpu (#24832 ) Accelerate depends on this environment variable to set for proper GPU device placement. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-09 09:17:35 -07:00
Artur Niederfahrenhorst	7495e9c89c	[RLlib] Dreamer Policy sub-classing schema. (#25585 )	2022-06-09 17:14:15 +02:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
Jian Xiao	ce103b4ffa	Eagerly clears object memory before Python GC kicks in when consuming DatasetPipeline (#25461 )	2022-06-09 00:37:56 -07:00
Amog Kamsetty	1316a2d05e	[AIR/Train] Move `ray.air.train` to `ray.train` (#25570 )	2022-06-08 21:34:18 -07:00
Dmitri Gekhtman	836b08597f	[kuberay][autoscaler] Use new autoscaling fields from the KubeRay operator (#25386 ) This PR incorporates recent autoscaler config changes from KubeRay.	2022-06-08 20:09:43 -07:00
matthewdeng	ba0a2a022a	[datasets] add `Dataset.randomize_block_order` (#25568 ) This exposes a low-cost way to perform a pseudo global shuffle. For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk. This allows the workload to be spread across the nodes on which the dataset blocks are on.	2022-06-08 18:39:15 -07:00
M Waleed Kadous	9e2e84bc1c	[docs] Add an example for simple highly parallelizable tasks. (#24885 ) It's important to show how Ray can be used for easily parallelizable independent tasks. I put this together to demonstrate how to di this.	2022-06-08 18:10:37 -07:00
Clark Zinzow	6987ab5966	[Datasets] [Hotfix] Fix stats construction for from_* APIs. (#25601 ) Stats construction on the from_arrow and from_numpy (and from_pandas with Pandas block support disabled) is currently broken since we weren't resolving the block metadata before passing it to the stats, causing future ds.stats() calls to fail. This PR fixes this and adds some test coverage. Drivebys: - Adds stats for from_pandas() zero-copy path (metadata fetch only). - Changes "from_numpy" stats stage name to "from_numpy_refs", to be consistent with stats for other from_*() APIs.	2022-06-08 18:04:40 -07:00
shrekris-anyscale	f3c2bd6718	[Serve] Make REST API deployments inherit top-level runtime_env (#25502 )	2022-06-08 15:58:00 -07:00
Antoni Baum	7616435ed0	[Docs] Capitalize Ray AIR (#25597 )	2022-06-08 14:37:53 -07:00
Kai Fricke	aa142eb377	[RLlib; CI] Add `team:rllib` tag for Bazel. (#25589 ) Currently, team:ml spans all ML (Tune, Train, AIR) tests and rllib tests. rllib tests are much more flaky and it would be good to split them up in the flaky test tracker. This PR changes Rllib-tests from team:ml to team:rllib to enable this separation. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 22:25:59 +01:00
Archit Kulkarni	6d2806f951	[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562 ) The current inheritance behavior for runtime_envs enables the following workflow for Jobs: A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir. There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior. This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following: - Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API - Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test	2022-06-08 13:54:06 -07:00
Jian Xiao	50c854b1ad	Fix hyperlink in rst doc (#25427 ) Hyperlink not working Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-06-08 13:46:23 -07:00
Antoni Baum	16733c2271	[AIR] Delayed type checking for Preprocessors (#25587 ) Breaks the hard dependency on Preprocessor imports for type hints in AIR. Preparation for move of Preprocessors to `ray.data`. Trainer still has a hard dependency due to an `isinstance` check.	2022-06-08 13:15:54 -07:00
Dmitri Gekhtman	5cc2e15a1f	[CI][minor] Disallow filters if command isn't specified (#25593 ) Trivial "developer experience" tweak to the ci repro script: disallow filtering commands if we're not running the commands.	2022-06-08 20:52:51 +01:00
Hanming Lu	d3e5bf97b5	more informative GCPNodeProvider create_node return (#25416 ) More informative return value for GCPNodeProvider create_node	2022-06-08 12:34:09 -07:00
Amog Kamsetty	3a728c4e35	[Train] Mark Trainer interfaces as Deprecated (#25573 ) Marks Trainer interfaces as Deprecated. This PR does not make any changes to the docs. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 12:30:32 -07:00
Stephanie Wang	6274bb354c	[tests] Deflake test_reconstruction.py::test_basic_reconstruction_actor_task[False] (#25456 ) This test was flaky because actor tasks can fail if submitted when the actor process is failed or restarting. This PR changes the test to be more stressful so that the error is easier to reproduce and changes the max_retries parameter to -1 so that the actor task will succeed. Related issue number Closes #24942.	2022-06-08 11:21:57 -07:00
Artur Niederfahrenhorst	9226643433	[RLlib] Issue 4965: Fixes PyTorch grad clipping logic and adds grad clipping to QMIX. (#25584 )	2022-06-08 19:40:57 +02:00
Sihan Wang	a9e7836e8c	[Serve] Skip flaky test_autoscaling_policy on windows (#25526 )	2022-06-08 10:33:40 -07:00
Clark Zinzow	9dc0bb3d5e	[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031 )" (#25531 ) Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)" This reverts commit `fb2933a78f`. * Skip shuffle memory release test.	2022-06-08 10:33:25 -07:00
Amog Kamsetty	1be32e5977	[AIR] Add `_predict_arrow` interface for Predictor (#25579 ) * add interface * update docstring	2022-06-08 10:27:29 -07:00
Pamphile Roy	0bbc3379bd	Fix SciPy pinning (#25148 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2022-06-08 10:26:59 -07:00
Amog Kamsetty	80ae651f25	[Train] Clean up `ray.train` package (#25566 )	2022-06-08 10:22:36 -07:00
Sven Mika	388fb98c79	[RLlib] CRR Tests fixes. (#25586 )	2022-06-08 19:18:55 +02:00
Kai Fricke	c3b608f757	[tune] Fix cloud tests, mark as stable (#25583 ) #25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.	2022-06-08 17:47:54 +01:00
Archit Kulkarni	3296345557	Add warning about entrpoint command in quotes (#25519 )	2022-06-08 09:38:55 -07:00
ZhuSenlin	98685f415e	[Core] fix raylet stuck when forking worker (#25348 ) The raylet may stuck when forking worker because of the reading of fd, which is similar to this issue (https://github.com/gperftools/gperftools/issues/178). At present, the decouple field is only set when running the CI test, so we can make the read function execute only when decouple is set, so it will not affect the production job. Child process(worker): ![image](https://user-images.githubusercontent.com/2016670/171430414-a0ebaed4-1aef-4faa-9e00-5e4d993a9db0.png) Parent process(raylet): ![image](https://user-images.githubusercontent.com/2016670/171430507-2266b88d-2ad3-4493-b6a8-358c1613a561.png) Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-06-08 22:38:22 +08:00
Kai Fricke	8affbc7be6	[tune/train] Consolidate checkpoint manager 3: Ray Tune (#24430 ) Update: This PR is now part 3 of a three PR group to consolidate the checkpoints. 1. Part 1 adds the common checkpoint management class #24771 2. Part 2 adds the integration for Ray Train #24772 3. This PR builds on #24772 and includes all changes. It moves the Ray Tune integration to use the new common checkpoint manager class. Old PR description: This PR consolidates the Ray Train and Tune checkpoint managers. These concepts previously did something very similar but in different modules. To simplify maintenance in the future, we've consolidated the common core. - This PR keeps full compatibility with the previous interfaces and implementations. This means that for now, Train and Tune will have separate CheckpointManagers that both extend the common core - This PR prepares Tune to move to a CheckpointStrategy object - In follow-up PRs, we can further unify interfacing with the common core, possibly removing any train- or tune-specific adjustments (e.g. moving to setup on init rather on runtime for Ray Train) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-06-08 12:05:34 +01:00
kourosh hakhamaneshi	4cdd508f70	[RLlib] Added CRR implementation. (#25499 )	2022-06-08 11:42:02 +02:00

... 6 7 8 9 10 ...

13280 commits