This PR is an edit pass on the Performance Tuning page after reading it with fresh eyes. None of the content was out of date so it's mostly nits and rewording some parts that were slightly confusing.
This PR
Adds notes and example on logging for Ray/K8s.
Implements an API Reference paging pointing to the configuration guide and the RayCluster CR definition.
Takes managed K8s services out of the tabbed structure, to make that page look less sad.
Adds a comparison of the KubeRay operator and legacy K8s operator
Adds an architecture diagram for the autoscaling sections
Fixes some other minor items
Adds some info about networking to the configuration guide, removes the previously planned networking page
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.
I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.
- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.
To enable Serve HA, we also need to setup:
- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5
which we should set in KubeRay.
JobCounter is not working with storage namespace right now because the key is the same across namespaces.
This PR fixed it by just adding it there because this add the minimal changes which is safer.
A follow up PR is needed to cleanup redis storage in cpp.
Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call:
If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown.
If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual.
Also fixes some incidental bugs:
Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object.
Fix a circular lock dependency between task failure <> task dependency resolution.
Related issue number
Closes#27265.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Yi Cheng <chengyidna@gmail.com>
## Why are these changes needed?
This test timeout. Move it to large.
```
WARNING: //python/ray/workflow:tests/test_error_handling: Test execution time (288.7s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="long" or size="large".
```
As specified here, https://joekuan.wordpress.com/2015/06/30/python-3-__del__-method-and-imported-modules/, the del method doesn't guarantee that modules or function definitions are still referenced, and not GC'ed. That means if you access any "modules", "functions", or "global variables", they may have been garbage collected.
This means we should not access any modules, functions, or global variables inside del method. While it's something we should handle in the sooner future more holistically, this PR fixes the issue in the short term.
The problem was that all of ray actors are decorated by trace_helper.py to make it compatible to open telemetry (maybe we should make it optional). At this time __del__ method is also decorated. When __del__ is invoked, some of functions used within this tracing decorator can be accessed and may have been deallocated (in this case, the _is_tracing_enabled was deallocated). This fixes the issue by not decorating __del__ method from tracing.
This PR mainly adds two improvements:
We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.
One GC test has unnecessary sleeps which are quite expensive due to the parametrization (2 x 2 x 2 = 8 iterations). They are unnecessary because they check that garbage collection of runtime env URIs doesn't occur after a certain time, but garbage collection isn't time-based. This PR removes the sleeps.
This PR is just to fix CI; a followup PR will make the test more effective by attempting to trigger GC in a more targeted way (by starting multiple tasks with different runtime_env resources. GC is only triggered upon *creation* of a new resource that causes the cache size to be exceeded.)
It's still not clear what exactly caused the test suite to start taking longer recently, but it might be due to some change elsewhere in Ray, since there were no runtime_env related commits in that time period.
The original link doesn't exist. https://docs.ray.io/en/master/_images/air-ecosystem.svg
I fixed it by linking the raw github file link. This should have the exactly same flow as before. I tried finding a link to this image file, but I couldn't. I also couldn't find an easy way to add only a link (without embedding an image). Please lmk if you prefer other option
This PR ccf4116 makes cluster_utils.add_node take about 1 more second because of the raylet start path refactoring.
It seems like as a result, test_placement_group_3 has an occasional timeout. I extended the test to be large. Let's see if this fixes the issue.
Adds the following to install instructions:
Tip
If you are only editing Python files, follow instructions for Building Ray (Python Only) to avoid long build times.
If you already followed the instructions in Building Ray (Python Only) and want to switch to the Full build in this section, you will need to first delete the symlinks and uninstall Ray.