hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Jun Gong	a61095a480	[RLlib] fix bandit pre-merge tests (#27554 )	2022-08-07 17:48:29 -07:00
Jun Gong	5f07987ab1	[RLlib] Fix connector examples (#27583 )	2022-08-07 17:48:09 -07:00
Jun Gong	89b2f616fd	[RLlib] doc typo (#27542 )	2022-08-07 17:47:42 -07:00
Jun Gong	f8b2128f16	[RLlib] async_request_test needs to run exclusively. (#27603 )	2022-08-07 17:47:29 -07:00
Simon Mo	efee158cec	[Serve] Use Async Handle for DAG Execution (#27411 )	2022-08-06 22:23:44 -07:00
zcin	64c550a2b1	Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403 )" (#27587 ) This reverts commit `8a9d994dd0`.	2022-08-06 21:38:55 -07:00
clarng	b404175635	Add tip to first uninstall the Python-only build before installing the Full build (#25754 ) Adds the following to install instructions: Tip If you are only editing Python files, follow instructions for Building Ray (Python Only) to avoid long build times. If you already followed the instructions in Building Ray (Python Only) and want to switch to the Full build in this section, you will need to first delete the symlinks and uninstall Ray.	2022-08-06 21:38:13 -07:00
se4ml	0a489c0c7c	[CI] Update `ci/pipeline/py_dep_analysis_test.py` to properly use `with` statement (#27600 )	2022-08-06 12:17:35 -07:00
clarng	20f01da5bd	[Core] Install memory monitor callback to kill worker when memory usage is above threshold (#27384 ) Why are these changes needed? The Ray-level OOM killer preemptively kills a worker process when the node is under memory pressure. This PR leverages the memory monitor from #27017 and supersedes #26962 to kill worker processes when the system is running low on memory. The node manager implements the callback in the memory monitor and kills the worker process with the newest task. It evicts only one worker at a time and enforces that by tracking the last evicted worker. If the eviction is still in progress it will not evict another worker even if the memory usage is above the threshold. This PR is a no-op since the monitor is disabled by default.	2022-08-06 04:17:19 -07:00
xiaofeng	9f8b596aaa	[serve][xlang]Support deploying Python deployment from Java. (#26877 ) In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature. Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>	2022-08-06 14:35:49 +08:00
Alex Wu	50e278f58b	[scheduler][autoscaler] Report placement resources for actor creation tasks (#26813 ) This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly. Closes #26806 Co-authored-by: Alex <alex@anyscale.com>	2022-08-05 22:02:44 -07:00
Clark Zinzow	f017fcd826	[AIR - Datasets] Fix column assignment in Concatenator for Pandas 1.2. (#27531 ) Heterogeneous tensor column assignment with a list-of-tensors fails for Pandas 1.2, but succeeds with a manually constructed Pandas Series.	2022-08-05 19:44:12 -07:00
Clark Zinzow	dcc1da4ce3	[Core] Fix reference counting bug on objects borrowed for a cancelled actor creation. (#27298 ) This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled. Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors. This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.	2022-08-05 19:42:34 -07:00
Alan Guo	326b5bd1ac	Convert job_manager to be async (#27123 ) Updates jobs api Updates snapshot api Updates state api Increases jobs api version to 2 Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? follow-up for #25902 (comment)	2022-08-05 19:33:49 -07:00
Nikita Vemuri	a82af8602c	[core] Support external ray dashboard URL (#27396 ) Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com Why are these changes needed? Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.	2022-08-05 19:33:10 -07:00
Eric Liang	9b467e3954	[docs] Improve the "Why Ray" and "Why AIR" sections of the docs (#27480 )	2022-08-05 18:42:45 -07:00
Alex Wu	a6b9019d38	[log_monitor] Seek when reopening a file due to inode change (#27508 ) When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test). Closes (but not really until it's cherry-picked) #27507 Co-authored-by: Alex <alex@anyscale.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-08-05 18:27:43 -07:00
clarng	098628d9bf	[doc] update autoscaler config (VM) page (#27539 ) Update autoscaler configuration docs for VM stack. Removed the video, after looking at it it fits better in overview / and is possibly outdated Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-08-05 17:03:06 -07:00
Alan Guo	05fca09f2d	Add query param to limit number of actors in api/snapshot (#27489 ) Default the value to 1000 actors Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? Reduces the latency of the api/snapshot, especially in cases where there is a ton of actors.	2022-08-05 16:48:46 -07:00
Clark Zinzow	293452dcba	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 ) This reverts commit `cf7305a`, and unreverts #25896. This was reverted due to a failing Windows test: #26287 We can merge once the failing Windows test (and all other relevant tests) pass.	2022-08-05 16:07:13 -07:00
Simon Mo	f6d19ac7c0	[Serve] Gate the deprecation warnings behind envvar (#27479 )	2022-08-05 13:38:44 -07:00
Clark Zinzow	313d553cfc	[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` (#27343 ) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.	2022-08-05 13:35:40 -07:00
Dmitri Gekhtman	06f7f33a4e	[docs] KubeRay config guide and autoscaling discussion (#27504 ) This PR adds a guide on RayCluster configuration and a page of discussion about autoscaling. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2022-08-05 13:11:28 -07:00
Jian Xiao	30cf449807	Add data ingest benchmark (#27533 ) Make sure Dataset/DatasetPipeline work performantly for data ingestion.	2022-08-05 12:31:06 -07:00
Sihan Wang	5fe586b881	[Serve/Doc] Add deployment migration guide (#27408 ) Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>	2022-08-05 14:28:48 -05:00
Clark Zinzow	bfc38de009	[Datasets] [Docs] Improve `.limit()` and `.take()` docstrings (#27367 ) Improve docstrings for .limit() and .take(), making the distinction more clear. Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-08-05 12:17:24 -07:00
Stephanie Wang	4d448e0b3e	[docs] Add codeowners for subdirectories (#27569 ) Signed-off-by: Stephanie Wang swang@cs.berkeley.edu CODEOWNERS only respects the last matching entry for a file. This PR hopefully adds the top-level docs group to all subdirs.	2022-08-05 11:37:15 -07:00
Richard Liaw	4629a3a649	[air/docs] Update Trainer documentation (#27481 ) Co-authored-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-08-05 11:21:19 -07:00
Cade Daniel	f94a2fe166	[docs][Ray Clusters] New Ray Clusters getting started page. (#27391 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-05 10:21:56 -07:00
zcin	22db41c21a	[Serve][doc] Modify and Combine Tensorflow, Pytorch, Sklearn Tutorials (#26817 )	2022-08-05 11:55:31 -05:00
zcin	04c7ccacf1	[Serve][Doc] Moves Serve REST API and Serve CLI API into separate subpages (#26914 )	2022-08-05 11:51:53 -05:00
zcin	8a9d994dd0	[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403 ) Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.	2022-08-05 11:31:00 -05:00
zcin	b5927caaae	[serve] Update version if import_path or runtime_env in config is changed (#27498 ) Previous PR that adds in lightweight config updates: https://github.com/ray-project/ray/pull/27000. It only tracks the config options for `deployments` (bumps version if certain deployment options are changed, but otherwise keeps versions the same). However we should bump the versions of all deployments if `import_path` or `runtime_env` is changed.	2022-08-05 11:30:22 -05:00
Jialing He	ccf411604e	Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308 )	2022-08-05 16:32:48 +08:00
Jiajun Yao	b11d3061d8	[Doc] Core getting started page revamp (#27303 ) - Add a calculating pi example to getting started page. - Move installing ray c++ to the installation page. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-04 23:36:16 -07:00
Jiajun Yao	d7dcb1f938	Replace boost::filesystem with std::filesystem (#27522 ) This redos #27319 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-04 21:33:51 -07:00
Archit Kulkarni	1714d0266b	[Doc] [Serve] Refresh code for "monitoring" for 2.0 (#27400 )	2022-08-04 20:10:12 -07:00
Dmitri Gekhtman	b1d838446c	[autoscaler] Fix Prometheus metric autoscaler hang bug (#27532 ) Failed node launch can lead to an extra unexpected error in the node launcher due to the definition of a mock prometheus metric method. This failure leads to a permanently hanging autoscaler with "launching nodes" never cleared out and the autoscaler unable to proceed to launch nodes. This PR fixes the method signature leading to the unexpected failure.	2022-08-04 19:48:31 -07:00
Avnish Narayan	6a31b61580	[RLlib] CQL change hparams and data reading strategy (#27451 )	2022-08-04 18:55:32 -07:00
Bill Chambers	73bc572405	[AIR/docs] Adding Source Libraries (#27518 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-04 15:56:40 -07:00
Alex Wu	eb9c5d8fa7	[autoscaler][aws] Bump max keys per account (#27506 ) Signed-off-by: Alex Wu <alex@anyscale.io> This is a minor QoL improvement to bump the hardcoded limit for number of aws keys per account. The limit is arbitrary and has been bumped before. AFAICT the fundamental aws limit is a 5000 key per region limit which we are not close to.	2022-08-04 15:12:55 -07:00
SangBin Cho	5298ee83b2	[Test] Revert (partially) Fix windows buildkite (#26615 ) (#27495 ) Root cause: https://www.shell-tips.com/bash/source-dot-command/#gsc.tab=0 Using . will execute the command in the "current shell" in a bash script. It looks like removing . command from ci.sh init means that we will lose the set -eo command used within ci.sh init applied to next test running commands because set -eo is called within a child process, not the current shell (so the future command won't have the set -eo configured).	2022-08-04 13:55:48 -07:00
Philipp Moritz	ef260702a2	[docs] Better defaults for installing Ray (#27500 )	2022-08-04 11:20:08 -07:00
Bill Chambers	19dc19a2c5	Fix Ray Air Docs Install (#27501 )	2022-08-04 10:47:10 -07:00
shrekris-anyscale	11abc89746	[Serve] [Docs] Use dashboard agent port in REST API documentation (#27450 )	2022-08-04 10:24:57 -07:00
Philipp Moritz	64fc1155b7	[docs] K8s docs intro polish and KubeRay architecture diagram (#27488 ) * Save work Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * Update Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * consistency Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * update Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * fixes Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * simplify Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * update Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * fix Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * update Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * wording Signed-off-by: Philipp Moritz <pcmoritz@gmail.com> * update Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>	2022-08-04 10:07:15 -07:00
Richard Liaw	b2cd34cc5c	[air] Remove checkpoint user guide and update key concepts and docstring (#27455 )	2022-08-04 08:55:26 -07:00
xwjiang2010	8d5c07b781	[air/train/docs] Add trainer user guide and update trainer docs (#27389 ) This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections. The main file to review is here: doc/source/ray-air/trainer.rst. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-08-04 13:59:50 +01:00
SangBin Cho	afd6597056	Revert "Replace boost::filesystem with std::filesystem (#27338 )" (#27483 ) This reverts commit `c50faa126c`.	2022-08-04 02:18:59 -07:00
Tao Wang	d4a1cebaa3	[C++ worker]Support ActorHandle type parameter (#27364 ) Now c++ worker doesn't support `ActorHandle` type parameter. When we pass an `ActorHandle` object to a task, it will incur this error: ![image](https://user-images.githubusercontent.com/5276001/182349872-a616ff55-6a2b-454d-9831-18877b56c228.png) The reason is that caller just deserializes the actor handle but doesn't register it to core worker, so if we call tasks of the actor, it will not be found in local.	2022-08-04 16:39:52 +08:00

1 2 3 4 5 ...

13858 commits