Commit graph

14073 commits

Author SHA1 Message Date
SangBin Cho
be64df6f5d
Fix a uncaught exception upon deallocation for actors (#27637)
As specified here, https://joekuan.wordpress.com/2015/06/30/python-3-__del__-method-and-imported-modules/, the del method doesn't guarantee that modules or function definitions are still referenced, and not GC'ed. That means if you access any "modules", "functions", or "global variables", they may have been garbage collected.

This means we should not access any modules, functions, or global variables inside del method. While it's something we should handle in the sooner future more holistically, this PR fixes the issue in the short term.

The problem was that all of ray actors are decorated by trace_helper.py to make it compatible to open telemetry (maybe we should make it optional). At this time __del__ method is also decorated. When __del__ is invoked, some of functions used within this tracing decorator can be accessed and may have been deallocated (in this case, the _is_tracing_enabled was deallocated). This fixes the issue by not decorating __del__ method from tracing.
2022-08-08 11:51:25 -07:00
Eric Liang
f21ca925ac
[docs] Remove spam banner from master docs (#27599) 2022-08-08 11:47:39 -07:00
Zyiqin-Miranda
b3f06d97b2
[autoscaler] Consolidate CloudWatch agent/dashboard/alarm support; Add unit tests for AWS autoscaler CloudWatch integration (#22070)
This PR mainly adds two improvements:

We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.
2022-08-08 11:45:07 -07:00
Balaji Veeramani
5087511c46
[AIR] Change FeatureHasher input schema to expect token counts (#27523)
This makes FeatureHasher work more like sklearn's FeatureHasher.
2022-08-08 11:41:57 -07:00
Archit Kulkarni
f6328f46a3
[CI] [runtime env] Fix test_working_dir_2 timeout on Mac (#27563)
One GC test has unnecessary sleeps which are quite expensive due to the parametrization (2 x 2 x 2 = 8 iterations). They are unnecessary because they check that garbage collection of runtime env URIs doesn't occur after a certain time, but garbage collection isn't time-based.  This PR removes the sleeps.

This PR is just to fix CI; a followup PR will make the test more effective by attempting to trigger GC in a more targeted way (by starting multiple tasks with different runtime_env resources.  GC is only triggered upon *creation* of a new resource that causes the cache size to be exceeded.)

It's still not clear what exactly caused the test suite to start taking longer recently, but it might be due to some change elsewhere in Ray, since there were no runtime_env related commits in that time period.
2022-08-08 11:31:21 -05:00
Richard Liaw
f15ed3836d
[air] Render trainer docstring signatures (#27590)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-08 09:29:21 -07:00
Artur Niederfahrenhorst
4fe47d069f
[RLlib] Require ApeX LR schedule test to produce learner info. (#27557) 2022-08-08 18:19:02 +02:00
kourosh hakhamaneshi
3b2a8427af
[RLlib] Fix SampleBatch to_device(). (#27572) 2022-08-08 18:18:33 +02:00
SangBin Cho
8c190e2d09
Revert "[serve][xlang]Support deploying Python deployment from Java. (#26877)" (#27626)
This reverts commit 9f8b596aaa.
2022-08-08 06:54:27 -07:00
SangBin Cho
6084ee5a63
Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613)
This reverts commit ccf411604e.
2022-08-08 06:38:19 -07:00
SangBin Cho
b7ab0555c4
[Link Check] Fix the broken link check from the AIR doc (#27632)
The original link doesn't exist. https://docs.ray.io/en/master/_images/air-ecosystem.svg

I fixed it by linking the raw github file link. This should have the exactly same flow as before. I tried finding a link to this image file, but I couldn't. I also couldn't find an easy way to add only a link (without embedding an image). Please lmk if you prefer other option
2022-08-08 06:36:04 -07:00
SangBin Cho
f00d17ea21
[Core/Test] Break down test_ray_init.py by 2 files (#27635)
Seems like this test starts taking longer than 300 seconds because of newly added tests. Breaking down tests into 2 files
2022-08-08 06:35:16 -07:00
SangBin Cho
ad0aec1ca7
Move test placement group 3 (#27614)
This PR ccf4116 makes cluster_utils.add_node take about 1 more second because of the raylet start path refactoring.

It seems like as a result, test_placement_group_3 has an occasional timeout. I extended the test to be large. Let's see if this fixes the issue.
2022-08-08 00:59:12 -07:00
Balaji Veeramani
2a14bff99d
[Docs] Fix table width (#27611) 2022-08-07 20:19:19 -07:00
Cheng Su
aeb2346804
[AIR] Replace references of to_torch with iter_torch_batches (#27574) 2022-08-07 20:14:12 -07:00
Jun Gong
a61095a480
[RLlib] fix bandit pre-merge tests (#27554) 2022-08-07 17:48:29 -07:00
Jun Gong
5f07987ab1
[RLlib] Fix connector examples (#27583) 2022-08-07 17:48:09 -07:00
Jun Gong
89b2f616fd
[RLlib] doc typo (#27542) 2022-08-07 17:47:42 -07:00
Jun Gong
f8b2128f16
[RLlib] async_request_test needs to run exclusively. (#27603) 2022-08-07 17:47:29 -07:00
Simon Mo
efee158cec
[Serve] Use Async Handle for DAG Execution (#27411) 2022-08-06 22:23:44 -07:00
zcin
64c550a2b1
Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)" (#27587)
This reverts commit 8a9d994dd0.
2022-08-06 21:38:55 -07:00
clarng
b404175635
Add tip to first uninstall the Python-only build before installing the Full build (#25754)
Adds the following to install instructions:

Tip

If you are only editing Python files, follow instructions for Building Ray (Python Only) to avoid long build times.

If you already followed the instructions in Building Ray (Python Only) and want to switch to the Full build in this section, you will need to first delete the symlinks and uninstall Ray.
2022-08-06 21:38:13 -07:00
se4ml
0a489c0c7c
[CI] Update ci/pipeline/py_dep_analysis_test.py to properly use with statement (#27600) 2022-08-06 12:17:35 -07:00
clarng
20f01da5bd
[Core] Install memory monitor callback to kill worker when memory usage is above threshold (#27384)
Why are these changes needed?

The Ray-level OOM killer preemptively kills a worker process when the node is under memory pressure. This PR leverages the memory monitor from #27017 and supersedes #26962 to kill worker processes when the system is running low on memory. The node manager implements the callback in the memory monitor and kills the worker process with the newest task. It evicts only one worker at a time and enforces that by tracking the last evicted worker. If the eviction is still in progress it will not evict another worker even if the memory usage is above the threshold.

This PR is a no-op since the monitor is disabled by default.
2022-08-06 04:17:19 -07:00
xiaofeng
9f8b596aaa
[serve][xlang]Support deploying Python deployment from Java. (#26877)
In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature.

Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>
2022-08-06 14:35:49 +08:00
Alex Wu
50e278f58b
[scheduler][autoscaler] Report placement resources for actor creation tasks (#26813)
This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly.

Closes #26806

Co-authored-by: Alex <alex@anyscale.com>
2022-08-05 22:02:44 -07:00
Clark Zinzow
f017fcd826
[AIR - Datasets] Fix column assignment in Concatenator for Pandas 1.2. (#27531)
Heterogeneous tensor column assignment with a list-of-tensors fails for Pandas 1.2, but succeeds with a manually constructed Pandas Series.
2022-08-05 19:44:12 -07:00
Clark Zinzow
dcc1da4ce3
[Core] Fix reference counting bug on objects borrowed for a cancelled actor creation. (#27298)
This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled.

Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors.

This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.
2022-08-05 19:42:34 -07:00
Alan Guo
326b5bd1ac
Convert job_manager to be async (#27123)
Updates jobs api
Updates snapshot api
Updates state api

Increases jobs api version to 2

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
follow-up for #25902 (comment)
2022-08-05 19:33:49 -07:00
Nikita Vemuri
a82af8602c
[core] Support external ray dashboard URL (#27396)
Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com

Why are these changes needed?
Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.
2022-08-05 19:33:10 -07:00
Eric Liang
9b467e3954
[docs] Improve the "Why Ray" and "Why AIR" sections of the docs (#27480) 2022-08-05 18:42:45 -07:00
Alex Wu
a6b9019d38
[log_monitor] Seek when reopening a file due to inode change (#27508)
When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test).

Closes (but not really until it's cherry-picked) #27507

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 18:27:43 -07:00
clarng
098628d9bf
[doc] update autoscaler config (VM) page (#27539)
Update autoscaler configuration docs for VM stack.
Removed the video, after looking at it it fits better in overview / and is possibly outdated

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 17:03:06 -07:00
Alan Guo
05fca09f2d
Add query param to limit number of actors in api/snapshot (#27489)
Default the value to 1000 actors

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
Reduces the latency of the api/snapshot, especially in cases where there is a ton of actors.
2022-08-05 16:48:46 -07:00
Clark Zinzow
293452dcba
[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449)
This reverts commit cf7305a, and unreverts #25896.

This was reverted due to a failing Windows test: #26287

We can merge once the failing Windows test (and all other relevant tests) pass.
2022-08-05 16:07:13 -07:00
Simon Mo
f6d19ac7c0
[Serve] Gate the deprecation warnings behind envvar (#27479) 2022-08-05 13:38:44 -07:00
Clark Zinzow
313d553cfc
[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit() (#27343)
Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data.

This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.
2022-08-05 13:35:40 -07:00
Dmitri Gekhtman
06f7f33a4e
[docs] KubeRay config guide and autoscaling discussion (#27504)
This PR adds a guide on RayCluster configuration and a page of discussion about autoscaling.

    Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2022-08-05 13:11:28 -07:00
Jian Xiao
30cf449807
Add data ingest benchmark (#27533)
Make sure Dataset/DatasetPipeline work performantly for data ingestion.
2022-08-05 12:31:06 -07:00
Sihan Wang
5fe586b881
[Serve/Doc] Add deployment migration guide (#27408)
Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
2022-08-05 14:28:48 -05:00
Clark Zinzow
bfc38de009
[Datasets] [Docs] Improve .limit() and .take() docstrings (#27367)
Improve docstrings for .limit() and .take(), making the distinction more clear.

Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-08-05 12:17:24 -07:00
Stephanie Wang
4d448e0b3e
[docs] Add codeowners for subdirectories (#27569)
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu

CODEOWNERS only respects the last matching entry for a file. This PR hopefully adds the top-level docs group to all subdirs.
2022-08-05 11:37:15 -07:00
Richard Liaw
4629a3a649
[air/docs] Update Trainer documentation (#27481)
Co-authored-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 11:21:19 -07:00
Cade Daniel
f94a2fe166
[docs][Ray Clusters] New Ray Clusters getting started page. (#27391)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-05 10:21:56 -07:00
zcin
22db41c21a
[Serve][doc] Modify and Combine Tensorflow, Pytorch, Sklearn Tutorials (#26817) 2022-08-05 11:55:31 -05:00
zcin
04c7ccacf1
[Serve][Doc] Moves Serve REST API and Serve CLI API into separate subpages (#26914) 2022-08-05 11:51:53 -05:00
zcin
8a9d994dd0
[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
2022-08-05 11:31:00 -05:00
zcin
b5927caaae
[serve] Update version if import_path or runtime_env in config is changed (#27498)
Previous PR that adds in lightweight config updates: https://github.com/ray-project/ray/pull/27000. It only tracks the config options for `deployments` (bumps version if certain deployment options are changed, but otherwise keeps versions the same). However we should bump the versions of all deployments if `import_path` or `runtime_env` is changed.
2022-08-05 11:30:22 -05:00
Jialing He
ccf411604e
Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308) 2022-08-05 16:32:48 +08:00
Jiajun Yao
b11d3061d8
[Doc] Core getting started page revamp (#27303)
- Add a calculating pi example to getting started page.
- Move installing ray c++ to the installation page.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-04 23:36:16 -07:00