Commit graph

7508 commits

Author SHA1 Message Date
Simon Mo
0badbb8b1e
[Serve][docs] Refresh http-guide (#27779)
- Moved most code snippet to doc_code
- Added section about DAGDriver
- Added section discussing when should you use each abstraction layer.
2022-08-12 11:06:36 -05:00
shrekris-anyscale
e15960ed7e
[Serve] [Docs] Update the "Monitoring Ray Serve" Page (#27777)
The "Monitoring Ray Serve" page explains how to inspect your Ray Serve applications. This change updates the page to remove outdated metrics that Serve no longer exposes and to upgrade code samples to use 2.0 APIs. It also improves the content's readability and organization.

Link to updated "Monitoring Ray Serve" page: https://ray--27777.org.readthedocs.build/en/27777/serve/monitoring.html
2022-08-12 11:05:31 -05:00
matthewdeng
75d13faa50
[serve] fix grammar check in test (#27819) 2022-08-12 09:02:31 -07:00
Eric Liang
52f7b89865
[docs] Editing pass on clusters docs, removing legacy material and fixing style issues (#27816) 2022-08-12 00:15:03 -07:00
Nikita Vemuri
87dd078e1e
fix external dashboard url if connecting to existing cluster (#27807)
Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>
2022-08-11 17:56:24 -07:00
Jian Xiao
b1cad0a112
[Datasets] Use detached lifetime for stats actor (#25271)
The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress.

This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-08-11 17:47:13 -07:00
Cade Daniel
b7a6a1294a
Fix linkcheck introduced by Ray Clusters doc changes (#27804)
Broken links introduced by #27756

Will defer to @ericl if he wants to merge this or fix it himself.

Signed-off-by: Cade Daniel <cade@anyscale.com>
2022-08-11 16:55:20 -07:00
Chris K. W
74f28f9270
[client] Fix ignore_reinit_error behavior in ray client (#26165)
Ray client currently errors on reinit even if ignore_reinit_error is set.
2022-08-11 14:56:54 -07:00
shrekris-anyscale
8a6d2db1d3
[Serve] Fix grammar in deployment logs (#27780) 2022-08-11 13:51:42 -07:00
Ricky Xu
5ea4747448
[Core][State Observability] Nightly release test for state API (#26610)
* Initial

* Correctness test skeleton

* Added limit for listing

* Updated grpc config

* no more waiting

* metrics

* Updated constant and add test

* renamed

* actors

* actors

* actors

* dada

* actor dead?

* Script

* correct test name

* limit

* Added timeout

* release test /2

* Merged

* format+doc

* wip

Signed-off-by: rickyyx <ricky@anyscale.com>

* revert packag-lock

Signed-off-by: rickyyx <rickyx@anyscale.com>

* wip

* results

Signed-off-by: rickyx <rickyx@anyscale.com>

Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <ricky@anyscale.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Co-authored-by: rickyyx <ricky@anyscale.com>
2022-08-11 07:01:01 -07:00
Artur Niederfahrenhorst
c855469845
[RLlib] pin gym-minigrid @ 1.0.3 (#27761) 2022-08-11 12:27:44 +02:00
matthewdeng
178b1e8a25
[data] enable test_split.py tests (#27150)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-08-10 22:15:34 -07:00
Yi Cheng
c5952f2163
[serve] Add an internal os env to turn the head node pin off (#27763)
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).

This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
2022-08-10 18:13:54 -07:00
Jiajun Yao
27e38f81bd
Pin _StatsActor to the driver node (#27765)
Similar to what's done in #23397

This allows the actor to fate-share with the driver and tolerate worker node failures.
2022-08-10 17:55:06 -07:00
Balaji Veeramani
7da7dbe3fd
[AIR] Improve preprocessor documentation (#27215)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-10 17:13:22 -07:00
Cheng Su
853c859037
[Datasets] Better error message for partition filtering if no file found (#27353)
User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
2022-08-09 22:42:20 -07:00
zcin
ea2a11080f
[serve][doc] Update Serve API in tutorials code (#27579) 2022-08-09 19:59:14 -07:00
Cheng Su
bc5d8d9176
[AIR] Replace references of to_tf with iter_tf_batches (#27672) 2022-08-09 16:00:02 -07:00
Jiajun Yao
f084546d41
Fix out-of-band deserialization of actor handle (#27700)
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-09 14:25:14 -07:00
Stephanie Wang
7d0fcd7ec6
[core] Allow reuse of cluster address if Ray is not running (#27666)
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
2022-08-09 13:48:48 -07:00
Sihan Wang
22d1be5823
[Serve] Make serve.run to start serve with http on EveryNode mode (#27668)
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
2022-08-09 09:29:38 -07:00
Nikita Vemuri
0e74bc20b5
[core] Fix how protocol is removed for external ray dashboard URL (#27652)
* fix how protocol is removed for external dashboard url
2022-08-08 18:23:12 -07:00
matthewdeng
fbdec1add0
[air] remove rllib dependency from tensorflow_predictor (#27671) 2022-08-08 18:05:48 -07:00
Alan Guo
3a819fafb7
Force grpcio to be >= 1.42.0 for python 3.10 (#27269) 2022-08-08 17:37:18 -07:00
Clark Zinzow
3b151c581e
[Datasets] Delay expensive tensor extension type import until Parquet reading. (#27653)
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.

I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
2022-08-08 17:06:25 -07:00
Yi Cheng
dac7bf17d9
[serve] Make serve agent not blocking when GCS is down. (#27526)
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.
2022-08-08 16:29:42 -07:00
Balaji Veeramani
87ff765647
[AIR] Make Concatenator deterministic (#27575) 2022-08-08 15:49:46 -07:00
Yi Cheng
cadeccd9b7
[core] Fix job counter not working with storage namespace (#27627)
JobCounter is not working with storage namespace right now because the key is the same across namespaces.

This PR fixed it by just adding it there because this add the minimal changes which is safer.

A follow up PR is needed to cleanup redis storage in cpp.
2022-08-08 14:24:32 -07:00
Stephanie Wang
ccbae3325c
[core] Reconstruct manually freed objects (#27567)
Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call:

    If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown.
    If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual.

Also fixes some incidental bugs:

    Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object.
    Fix a circular lock dependency between task failure <> task dependency resolution.

Related issue number

Closes #27265.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-08 13:40:51 -07:00
Yi Cheng
1533976b82
[deflakey] test_error_handling.py in workflow (#27630)
Signed-off-by: Yi Cheng <chengyidna@gmail.com>

## Why are these changes needed?
This test timeout. Move it to large. 
```
WARNING: //python/ray/workflow:tests/test_error_handling: Test execution time (288.7s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="long" or size="large".
```
2022-08-08 13:38:37 -07:00
SangBin Cho
be64df6f5d
Fix a uncaught exception upon deallocation for actors (#27637)
As specified here, https://joekuan.wordpress.com/2015/06/30/python-3-__del__-method-and-imported-modules/, the del method doesn't guarantee that modules or function definitions are still referenced, and not GC'ed. That means if you access any "modules", "functions", or "global variables", they may have been garbage collected.

This means we should not access any modules, functions, or global variables inside del method. While it's something we should handle in the sooner future more holistically, this PR fixes the issue in the short term.

The problem was that all of ray actors are decorated by trace_helper.py to make it compatible to open telemetry (maybe we should make it optional). At this time __del__ method is also decorated. When __del__ is invoked, some of functions used within this tracing decorator can be accessed and may have been deallocated (in this case, the _is_tracing_enabled was deallocated). This fixes the issue by not decorating __del__ method from tracing.
2022-08-08 11:51:25 -07:00
Zyiqin-Miranda
b3f06d97b2
[autoscaler] Consolidate CloudWatch agent/dashboard/alarm support; Add unit tests for AWS autoscaler CloudWatch integration (#22070)
This PR mainly adds two improvements:

We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.
2022-08-08 11:45:07 -07:00
Balaji Veeramani
5087511c46
[AIR] Change FeatureHasher input schema to expect token counts (#27523)
This makes FeatureHasher work more like sklearn's FeatureHasher.
2022-08-08 11:41:57 -07:00
Archit Kulkarni
f6328f46a3
[CI] [runtime env] Fix test_working_dir_2 timeout on Mac (#27563)
One GC test has unnecessary sleeps which are quite expensive due to the parametrization (2 x 2 x 2 = 8 iterations). They are unnecessary because they check that garbage collection of runtime env URIs doesn't occur after a certain time, but garbage collection isn't time-based.  This PR removes the sleeps.

This PR is just to fix CI; a followup PR will make the test more effective by attempting to trigger GC in a more targeted way (by starting multiple tasks with different runtime_env resources.  GC is only triggered upon *creation* of a new resource that causes the cache size to be exceeded.)

It's still not clear what exactly caused the test suite to start taking longer recently, but it might be due to some change elsewhere in Ray, since there were no runtime_env related commits in that time period.
2022-08-08 11:31:21 -05:00
Richard Liaw
f15ed3836d
[air] Render trainer docstring signatures (#27590)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-08 09:29:21 -07:00
SangBin Cho
8c190e2d09
Revert "[serve][xlang]Support deploying Python deployment from Java. (#26877)" (#27626)
This reverts commit 9f8b596aaa.
2022-08-08 06:54:27 -07:00
SangBin Cho
6084ee5a63
Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613)
This reverts commit ccf411604e.
2022-08-08 06:38:19 -07:00
SangBin Cho
f00d17ea21
[Core/Test] Break down test_ray_init.py by 2 files (#27635)
Seems like this test starts taking longer than 300 seconds because of newly added tests. Breaking down tests into 2 files
2022-08-08 06:35:16 -07:00
SangBin Cho
ad0aec1ca7
Move test placement group 3 (#27614)
This PR ccf4116 makes cluster_utils.add_node take about 1 more second because of the raylet start path refactoring.

It seems like as a result, test_placement_group_3 has an occasional timeout. I extended the test to be large. Let's see if this fixes the issue.
2022-08-08 00:59:12 -07:00
Cheng Su
aeb2346804
[AIR] Replace references of to_torch with iter_torch_batches (#27574) 2022-08-07 20:14:12 -07:00
Simon Mo
efee158cec
[Serve] Use Async Handle for DAG Execution (#27411) 2022-08-06 22:23:44 -07:00
zcin
64c550a2b1
Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)" (#27587)
This reverts commit 8a9d994dd0.
2022-08-06 21:38:55 -07:00
clarng
20f01da5bd
[Core] Install memory monitor callback to kill worker when memory usage is above threshold (#27384)
Why are these changes needed?

The Ray-level OOM killer preemptively kills a worker process when the node is under memory pressure. This PR leverages the memory monitor from #27017 and supersedes #26962 to kill worker processes when the system is running low on memory. The node manager implements the callback in the memory monitor and kills the worker process with the newest task. It evicts only one worker at a time and enforces that by tracking the last evicted worker. If the eviction is still in progress it will not evict another worker even if the memory usage is above the threshold.

This PR is a no-op since the monitor is disabled by default.
2022-08-06 04:17:19 -07:00
xiaofeng
9f8b596aaa
[serve][xlang]Support deploying Python deployment from Java. (#26877)
In the previously merged pr(https://github.com/ray-project/ray/pull/22726/commits), java serve's support for python deployment was not implemented. This PR is used to implement this feature.

Co-authored-by: nanqi.yxf <nanqi.yxf@antgroup.com>
2022-08-06 14:35:49 +08:00
Alex Wu
50e278f58b
[scheduler][autoscaler] Report placement resources for actor creation tasks (#26813)
This change makes us report placement resources for actor creation tasks. Essentially, the resource model here is that a placement resource/actor creation task is a task that runs very quickly.

Closes #26806

Co-authored-by: Alex <alex@anyscale.com>
2022-08-05 22:02:44 -07:00
Clark Zinzow
f017fcd826
[AIR - Datasets] Fix column assignment in Concatenator for Pandas 1.2. (#27531)
Heterogeneous tensor column assignment with a list-of-tensors fails for Pandas 1.2, but succeeds with a manually constructed Pandas Series.
2022-08-05 19:44:12 -07:00
Clark Zinzow
dcc1da4ce3
[Core] Fix reference counting bug on objects borrowed for a cancelled actor creation. (#27298)
This PR fixes a reference counting bug for borrowed objects sent to an actor creation task that is then cancelled.

Before this PR, when actor creation is cancelled before the creation task has been scheduled, the GCS-based actor manager would would destroy the actor without replying to the task submission RPC from the actor creating worker, resulting in the reference counts on that worker to never get cleaned up. This caused us to leak borrowed objects when such cancellation-before-scheduling happened for actors.

This PR fixes this by ensuring that the task submission RPC receives a reply indicating that the actor creation task has been cancelled, at which point the submitting worker will run through the same reference counting cleanup as is done for normal task cancellation.
2022-08-05 19:42:34 -07:00
Alan Guo
326b5bd1ac
Convert job_manager to be async (#27123)
Updates jobs api
Updates snapshot api
Updates state api

Increases jobs api version to 2

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
follow-up for #25902 (comment)
2022-08-05 19:33:49 -07:00
Nikita Vemuri
a82af8602c
[core] Support external ray dashboard URL (#27396)
Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com

Why are these changes needed?
Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.
2022-08-05 19:33:10 -07:00
Alex Wu
a6b9019d38
[log_monitor] Seek when reopening a file due to inode change (#27508)
When reopening a file due to an inode change, we weren't seeking back to the right location. Now we are (with a unit test).

Closes (but not really until it's cherry-picked) #27507

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-08-05 18:27:43 -07:00