hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Simon Mo	b9a2fb79b6	[AIR][Docs] Remove the excessive printing from Torch examples (#27903 )	2022-08-16 09:09:54 -07:00
Peyton Murray	4d19c0222b	[AIR] Add rich notebook repr for DataParallelTrainer (#26335 )	2022-08-16 08:51:14 -07:00
Dmitri Gekhtman	bceef503b2	[Kubernetes][docs] Restore legacy Ray operator migration discussion (#27841 ) This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator. To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code. These notes are linked from the new docs. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2022-08-16 08:46:31 -07:00
xwjiang2010	91f506304d	[air] [checkpoint manager] handle nested metrics properly as scoring attribute. (#27715 ) handle nested metrics properly as scoring attribute Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-16 17:43:58 +02:00
Sven Mika	436c89ba1a	[RLlib] Eval workers use async req manager. (#27390 )	2022-08-16 12:05:55 +02:00
liuyang-my	1c4b3879a1	[Serve]Fix classloader bug in Java Deployment (#27899 ) We have encountered `java.lang.ClassNotFoundException` when deploying Java Ray Serve deployments. The property `ray.job.code-search-path` which specifies the search path of user's classes is not working. The reason is that `ray.job.code-search-path` is loaded in an independent classloader in Ray context, but Serve Replica initialized user class with `AppClassLoader`. We need to change the classloader used to construct user classes to the one in Ray context.	2022-08-16 15:22:00 +08:00
Alex Wu	c2abfdb2f7	[autoscaler][observability] Observability into when/why nodes fail to launch (#27697 ) This change adds launch failures to the recent failures section of ray status when a node provider provides structured error information. For node providers which don't provide this optional information, there is now change in behavior. For reference, when trying to launch a node type with a quota issue, it looks like the following. InsufficientInstanceCapacity is the standard term for this issue.. ``` ======== Autoscaler status: 2022-08-11 22:22:10.735647 ======== Node status --------------------------------------------------------------- Healthy: 1 cpu_4_ondemand Pending: quota, 1 launching Recent failures: quota: InsufficientInstanceCapacity (last_attempt: 22:22:00) Resources --------------------------------------------------------------- Usage: 0.0/4.0 CPU 0.00/9.079 GiB memory 0.00/4.539 GiB object_store_memory Demands: (no resource demands) ``` ``` available_node_types: cpu_4_ondemand: node_config: InstanceType: m4.xlarge ImageId: latest_dlami resources: {} min_workers: 0 max_workers: 0 quota: node_config: InstanceType: p4d.24xlarge ImageId: latest_dlami resources: {} min_workers: 1 max_workers: 1 ``` Co-authored-by: Alex <alex@anyscale.com>	2022-08-15 18:14:29 -07:00
Chen Shen	f05c744a65	[Doc] minor fix on accessing AWS/S3 update the doc.	2022-08-15 16:53:31 -07:00
xwjiang2010	a3236b6225	[air] fix ptl release test (#27773 ) Signed-off-by: xwjiang2010 xwjiang2010@gmail.com	2022-08-15 14:47:33 -07:00
Yuan-Chi Chang	34c494260f	[workflow] Documentation of http events (#27166 ) Documentation updates for the newly introduced HTTPEventProvider and HTTPListener in Ray 2.0.	2022-08-15 14:23:04 -07:00
Jiajun Yao	06ef4ab94e	Fix broken links in the code (#27873 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-15 13:11:42 -07:00
xwjiang2010	68cc544da6	[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869 ) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 19:03:40 +02:00
Archit Kulkarni	058c239cf1	[runtime env] Test common failure scenarios (#25977 ) Tests the following failure scenarios: - Fail to upload data in `ray.init()` (`working_dir`, `py_modules`) - Eager install fails in `ray.init()` for some other reason (bad `pip` package) - Fail to download data from GCS (`working_dir`) Improves the following error message cases: - Return RuntimeEnvSetupError on failure to upload working_dir or py_modules - Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup Not covered in this PR: - RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.) - Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.) The approach is to use environment variables to induce failures in various places. The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be. I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.	2022-08-15 11:35:56 -05:00
Jiajun Yao	eb37bb857c	Revamp ray core design patterns doc [1/n]: generators (#27823 ) - Move the code snippet to doc_code folder - Move patterns to an upper level. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-15 09:24:34 -07:00
Myeongju Kim	52440f1489	[Docs] Fix a typo in index.md (#27859 ) Signed-off-by: myeongjukim <ming3772@gmail.com> Signed-off-by: myeongjukim <ming3772@gmail.com>	2022-08-15 08:26:40 -07:00
xwjiang2010	f77ec350fa	[release test] remove dask/modin_xgboost test completely. (#27865 ) The original script was removed in https://github.com/ray-project/ray/pull/27816 This is just to clean up some remainings. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 16:55:33 +02:00
SangBin Cho	d654636bfc	[Test] Fix broken test_base_trainer (#27855 ) The test was written incorrectly. This root cause was that the trainer & worker both requires 1 CPU, meaning pg requires {CPU: 1} * 2 resources. And when the max fraction is 0.001, we only allow up to 1 CPU for pg, so we cannot schedule the requested pgs in any case.	2022-08-15 07:50:18 -07:00
Cheng Su	a2c168cd6d	[Datasets][docs] Minor fix for nyc_taxi_basic_processing.ipynb (#27828 ) Went through https://docs.ray.io/en/master/data/examples/nyc_taxi_basic_processing.html, and doing some minor fix here. Fix the size_bytes() result (before this PR it was using Parquet sampling, but we disasble it later) Change one size_bytes() call to count() call as it was meant to use count() with followed wording That’s a lot of rows in doc. Changed places are as followed in screenshots:	2022-08-14 12:34:33 -07:00
SangBin Cho	9ece110d27	[State Observability] Promote the API to alpha (#27788 ) # Why are these changes needed? - Promote APIs to PublicAPI(alpha) - Change pre-alpha -> alpha - Fix a bug ray_logs is displayed to ray --help Release test result: #26610 Some APIs are subject to change at the beta stage (e.g., ray list jobs or ray logs).	2022-08-13 23:43:01 -07:00
Eric Liang	1bae3c905d	Improve text around ecosystem map (#27839 ) Signed-off-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Eric Liang <ekhliang@gmail.com>	2022-08-12 18:59:09 -07:00
SangBin Cho	999715ebec	[Core][Placement Group] Handling edge cases of max_cpu_fraction argument (#27035 ) Why are these changes needed? This PR fixes the edge cases when the max_cpu_fraction argument is used by the placement group. There was specifically an edge case where the placement group cannot be scheduled when a task or actor is scheduled and occupies the resource. The original logic to decide if the bundle scheduling exceed CPU fraction was as follow. Calculate max_reservable_cpus of the node. Calculate currently_used_cpus + bundle_cpu_request (per bundle) == total_allocation of the node. Don't schedule if total_allocation > max_reservable_cpus for the node. However, the following scenarios caused issues because currently_used_cpus can include resources that are not allocated by placement groups (e.g., actors). As a result, when the actor was already occupying the resource, the total_allocation was incorrect. For example, 4 CPUs 0.999 max fraction (so it can reserve up to 3 CPUs) 1 Actor already created (1 CPU) PG with CPU: 3 Now pg cannot be scheduled because total_allocation == 1 actor (1 CPU) + 3 bundles (3 CPUs) == 4 CPUs > 3 CPUs (max frac CPUs) However, this should work because the pg can use up to 3 CPUs, and we have enough resources. The root cause is that when we calculate the max_fraction, we should only take into account of resources allocated by bundles. To fix this, I change the logic as follow. Calculate max_reservable_cpus of the node. Calculate currently_used_cpus_by_pg_bundles + bundle_cpu_request (sum of all bundles) == total_allocation_from_pgs_and_bundles of the node. Don't schedule if total_allocation_from_pgs_and_bundles > max_reservable_cpus for the node.	2022-08-12 17:40:11 -07:00
Sihan Wang	7e7c93f6ba	[Serve] Fix memory leak issue in serve inference (#27815 )	2022-08-12 17:11:37 -07:00
shrekris-anyscale	f9d5d6df12	[Serve] [Docs] Revise Java API documentation (#27831 )	2022-08-12 17:09:40 -07:00
shrekris-anyscale	0a3c1de08b	[Serve] [Docs] Replace references to `dag.execute()` with `handle.predict.remote()` (#27784 )	2022-08-12 17:09:28 -07:00
zcin	8cb09a9fc5	Revert "Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications"" (#27662 )	2022-08-12 15:12:20 -07:00
Balaji Veeramani	55e57d4f92	[AIR] [Docs] Revise "Which preprocessor should you use?" (#27835 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-12 14:43:36 -07:00
zcin	fa37ddc584	[Serve][docs] Add type annotations to code samples (#27795 )	2022-08-12 13:41:08 -07:00
Simon Mo	192d92bb77	[Serve] [Doc] Update intro page (#27735 )	2022-08-12 11:37:18 -07:00
Archit Kulkarni	ba36365c32	[Doc] [Serve] Fix LINT by fixing outdated Ray Client doc link (#27826 )	2022-08-12 11:35:15 -07:00
Clark Zinzow	f0404e00cd	[Core] [Hotfix] Change "task failed with unretryable exception" log statement to debug-level. (#27714 ) Serve relies on being able to do quiet application-level retries, and this info-level logging is resulting in log spam hitting users. This PR demotes this log statement to debug-level to prevent this log spam. Co-authored-by: simon-mo <simon.mo@hey.com>	2022-08-12 11:28:49 -07:00
Cheng Su	7c7828f818	[Datasets] Improve size estimation of image folder data source (#27219 ) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in https://github.com/ray-project/ray/pull/27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ```	2022-08-12 11:26:03 -07:00
shrekris-anyscale	6946cb38b3	[Serve] [Docs] Revise "Serving Ray AIR Checkpoints" Header (#27824 )	2022-08-12 11:18:33 -07:00
matthewdeng	58495fe594	[data][docs] fix broken links (#27818 )	2022-08-12 11:17:34 -07:00
Alan Guo	be92dd72d5	[Dashboard] Fix edge cases for log file names in the dashboard log viewer (#27772 )	2022-08-12 09:39:54 -07:00
Archit Kulkarni	6c45625d6d	[runtime env] [CI] Skip flaky test_runtime_env_working_dir_2 tests on mac (#27799 )	2022-08-12 09:39:19 -07:00
Archit Kulkarni	518c74020c	[Serve] [Doc] Serve add API ref for Deployment.bind() and serve.build (#27811 )	2022-08-12 09:38:58 -07:00
Simon Mo	bf9f0621b9	[Serve] Minor fix to replica shutdown (#27778 )	2022-08-12 09:33:08 -07:00
shrekris-anyscale	cdf25908f7	[Serve] [Docs] Document Serve benchmarks (#27711 )	2022-08-12 11:14:05 -05:00
liuyang-my	6b886d394c	[Serve] Java documentation (#26321 )	2022-08-12 09:07:12 -07:00
Simon Mo	0badbb8b1e	[Serve][docs] Refresh http-guide (#27779 ) - Moved most code snippet to doc_code - Added section about DAGDriver - Added section discussing when should you use each abstraction layer.	2022-08-12 11:06:36 -05:00
Archit Kulkarni	92e315f970	[serve][docs] Add dev workflow page (#27746 ) Adds a page describing a development workflow for Serve applications. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>	2022-08-12 11:06:13 -05:00
shrekris-anyscale	e15960ed7e	[Serve] [Docs] Update the "Monitoring Ray Serve" Page (#27777 ) The "Monitoring Ray Serve" page explains how to inspect your Ray Serve applications. This change updates the page to remove outdated metrics that Serve no longer exposes and to upgrade code samples to use 2.0 APIs. It also improves the content's readability and organization. Link to updated "Monitoring Ray Serve" page: https://ray--27777.org.readthedocs.build/en/27777/serve/monitoring.html	2022-08-12 11:05:31 -05:00
Simon Mo	4be232e413	[Serve][Doc] Rewrite the ServeHandle page (#27775 )	2022-08-12 09:05:09 -07:00
matthewdeng	75d13faa50	[serve] fix grammar check in test (#27819 )	2022-08-12 09:02:31 -07:00
Eric Liang	52f7b89865	[docs] Editing pass on clusters docs, removing legacy material and fixing style issues (#27816 )	2022-08-12 00:15:03 -07:00
matthewdeng	9a0c1f5e0a	[data] update datasets API structure (#27592 ) Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html) ### Changes 1. Create a new Datasets API base page. 2. Split existing APIs into separate pages. 3. Split `Dataset` and `DatasetPipeline` methods into separate sections. 1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead. 2. Note: Every time we add a new method we need to explicitly add it here as well. 4. Add Input/Output APIs. 1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter. 6. Add `Block` and `DataBatch` (should we add these aliases?) 7. Remove existing `package-ref`.	2022-08-11 23:10:10 -07:00
Nikita Vemuri	87dd078e1e	fix external dashboard url if connecting to existing cluster (#27807 ) Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>	2022-08-11 17:56:24 -07:00
Jian Xiao	b1cad0a112	[Datasets] Use detached lifetime for stats actor (#25271 ) The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress. This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-08-11 17:47:13 -07:00
Cade Daniel	b7a6a1294a	Fix linkcheck introduced by Ray Clusters doc changes (#27804 ) Broken links introduced by #27756 Will defer to @ericl if he wants to merge this or fix it himself. Signed-off-by: Cade Daniel <cade@anyscale.com>	2022-08-11 16:55:20 -07:00
Chris K. W	74f28f9270	[client] Fix ignore_reinit_error behavior in ray client (#26165 ) Ray client currently errors on reinit even if ignore_reinit_error is set.	2022-08-11 14:56:54 -07:00

1 2 3 4 5 ...

13994 commits