Commit graph

13914 commits

Author SHA1 Message Date
Cade Daniel
03d835e4e2
[Ray Clusters][docs] Create new Running Apps on Ray Clusters section (#27723)
This adds the structure described here, namely adding a new section under Ray Clusters which is focused on running applications on Ray clusters.

Signed-off-by: Cade Daniel <cade@anyscale.com>

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-09 21:01:47 -07:00
zcin
ea2a11080f
[serve][doc] Update Serve API in tutorials code (#27579) 2022-08-09 19:59:14 -07:00
Huaiwei Sun
e33edcb0b7
[doc] css improvements for rate the doc components (#27717)
- improve the css for the rate the doc component
### Before
<img width="260" alt="Screen Shot 2022-08-09 at 2 15 19 PM" src="https://user-images.githubusercontent.com/9677264/183762845-26a7f6a8-909d-4c66-b030-7d28c7f2c65b.png">
<img width="379" alt="Screen Shot 2022-08-09 at 2 15 11 PM" src="https://user-images.githubusercontent.com/9677264/183762884-4618e4bb-4a54-401e-97f6-f363fcd086a5.png">

### After
<img width="488" alt="Screen Shot 2022-08-09 at 1 55 16 PM" src="https://user-images.githubusercontent.com/9677264/183762916-4f803ff6-801f-4b7c-a4a4-1adad8e07ff7.png">
<img width="473" alt="Screen Shot 2022-08-09 at 1 55 22 PM" src="https://user-images.githubusercontent.com/9677264/183762928-081da7be-9721-4066-8e96-ba7e0f01c59c.png">
<img width="423" alt="Screen Shot 2022-08-09 at 1 55 37 PM" src="https://user-images.githubusercontent.com/9677264/183762940-94c94361-72b2-4f2b-91af-c88c57d6886c.png">
2022-08-09 18:48:22 -07:00
Cade Daniel
2246ea7fe4
Fixing doc linter: broken links in ray-tracing.rst and ray-dag.rst (#27721)
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-08-09 18:09:29 -07:00
matthewdeng
1b19f3c593
[docs] add dask compatibility for 1.13.0 and 2.0.0 (#27699)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-08-09 16:13:02 -07:00
kourosh hakhamaneshi
4607e788c1
[RLlib] Fix test_ope flakiness (#27676) 2022-08-09 16:12:30 -07:00
Cheng Su
bc5d8d9176
[AIR] Replace references of to_tf with iter_tf_batches (#27672) 2022-08-09 16:00:02 -07:00
kourosh hakhamaneshi
3b3c20209b
[RLlib] Fix dqn reproducibility (#27459) 2022-08-09 15:56:44 -07:00
Cade Daniel
8826646303
[Ray Clusters][docs] Restructuring Clusters API reference (#27679)
*This PR:

Copies the existing clusters API reference to the new structure. The reference docs are split out into Ray Clusters (common between vms and k8s) and Ray Clusters on VMs (specific to vms). Notably, there is also a reference section for k8s, but not in this PR.
Move the three job submission user guides back into a single one. Jules had suggested that we break them out into rest/sdk/cli, but that's not P0 right now.
Fix some bugs in the left navigation bar. There should be less duplication of TOC entries. I'll keep working on related fixes in a different PR.

Signed-off-by: Cade Daniel <cade@anyscale.com>
2022-08-09 15:33:09 -07:00
shrekris-anyscale
d809d748cf
[Serve] [Docs] Add consolidated Model Composition user guide (#26860)
This change adds introductory deployment graph documentation.

Links to updated documentation:
* [Model Composition](https://ray--26860.org.readthedocs.build/en/26860/serve/model_composition.html)
* [Examples Overview](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/index.html)
* [Deployment Graph Pattern Overview](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns.html)
  * [Pattern: Linear Pipeline](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/linear_pipeline.html)
  * [Pattern: Branching Input](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/branching_input.html)
  * [Pattern: Conditional](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/conditional.html)

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
2022-08-09 17:06:23 -05:00
Jiajun Yao
f084546d41
Fix out-of-band deserialization of actor handle (#27700)
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-09 14:25:14 -07:00
Stephanie Wang
7d0fcd7ec6
[core] Allow reuse of cluster address if Ray is not running (#27666)
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
2022-08-09 13:48:48 -07:00
Richard Liaw
93a3cc222b
[docs/air] remove xgboost/lightgbm references and move AIR toc (#27687) 2022-08-09 12:49:44 -07:00
Eric Liang
92928fe86c
[docs] Minor polish on AIR getting started page (#27696) 2022-08-09 11:24:18 -07:00
Cade Daniel
13f43b939a
[docs][Ray Clusters] Key Concepts page (#27510) 2022-08-09 10:01:05 -07:00
Sihan Wang
2881d3e9f1
[Serve/Doc] Update http with serve user guide (#27536)
- Merge http user guides and http adapter
- Update the code to use bind()
- Remove some unsupported content
- minor wording improvement
2022-08-09 11:42:34 -05:00
Archit Kulkarni
dec8a660c5
[Doc] [Serve] Nits/Edits on Performance Tuning page (#27651)
This PR is an edit pass on the Performance Tuning page after reading it with fresh eyes. None of the content was out of date so it's mostly nits and rewording some parts that were slightly confusing.
2022-08-09 11:36:21 -05:00
Edward Oakes
db64717269
[serve][docs] Update key concepts page for Ray 2.0 (#27565)
Closes https://github.com/ray-project/ray/issues/27438
2022-08-09 11:34:11 -05:00
Sihan Wang
22d1be5823
[Serve] Make serve.run to start serve with http on EveryNode mode (#27668)
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
2022-08-09 09:29:38 -07:00
Richard Liaw
bb5e8c3536
fix-link-check (#27703) 2022-08-09 08:57:49 -07:00
Huaiwei Sun
e82d7ef750
[docs] css improvements (#27698) 2022-08-09 08:45:12 -07:00
Charles Sun
c358305ca6
[RLlib] DatasetReader action normalization. (#27356) 2022-08-09 16:54:03 +02:00
Sven Mika
537f7c65c1
[RLlib] CRR framework torch by default. (#27161) 2022-08-09 16:53:00 +02:00
kourosh hakhamaneshi
b84dd38f01
[RLlib] Add __getitem__ to MultiAgentBatch to access policy_batches. (#27619) 2022-08-09 16:51:26 +02:00
Dmitri Gekhtman
3293317c40
[kubernetes][docs] Logging guide, networking info, migration guide, fixes. (#27607)
This PR

Adds notes and example on logging for Ray/K8s.
Implements an API Reference paging pointing to the configuration guide and the RayCluster CR definition.
Takes managed K8s services out of the tabbed structure, to make that page look less sad.
Adds a comparison of the KubeRay operator and legacy K8s operator
Adds an architecture diagram for the autoscaling sections
Fixes some other minor items
Adds some info about networking to the configuration guide, removes the previously planned networking page

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2022-08-09 00:38:05 -07:00
Alan Guo
c3a8ba0f8a
Add maximum number of characters in logs output for jobs status message (#27581)
We've seen the API server go down from trying to return 500mb of log output
2022-08-08 20:24:51 -07:00
Nikita Vemuri
0e74bc20b5
[core] Fix how protocol is removed for external ray dashboard URL (#27652)
* fix how protocol is removed for external dashboard url
2022-08-08 18:23:12 -07:00
matthewdeng
fbdec1add0
[air] remove rllib dependency from tensorflow_predictor (#27671) 2022-08-08 18:05:48 -07:00
Alan Guo
3a819fafb7
Force grpcio to be >= 1.42.0 for python 3.10 (#27269) 2022-08-08 17:37:18 -07:00
Jian Xiao
e5c3f1cf3a
Fix a few stale Datasets documentation in AIR (#27623)
The descriptions of Datasets are not up-to-date now.
2022-08-08 17:33:23 -07:00
Clark Zinzow
3b151c581e
[Datasets] Delay expensive tensor extension type import until Parquet reading. (#27653)
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.

I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
2022-08-08 17:06:25 -07:00
Eric Liang
ffe3716c9a
[docs] Trainer user guide should come before configuring datasets for trainer guide (#27661) 2022-08-08 16:43:59 -07:00
xwjiang2010
9c7fc5ccdd
[tune/doc] fix emphasized line number. (#27648) 2022-08-08 16:37:47 -07:00
Yi Cheng
dac7bf17d9
[serve] Make serve agent not blocking when GCS is down. (#27526)
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.
2022-08-08 16:29:42 -07:00
Balaji Veeramani
87ff765647
[AIR] Make Concatenator deterministic (#27575) 2022-08-08 15:49:46 -07:00
Richard Liaw
fb43bd5baf
[air/docs] Update train gettingstarted (#27655)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-08-08 15:45:00 -07:00
kourosh hakhamaneshi
98b9fa6944
[RLlib] Hotfix for connector tests (#27654)
hot fix for rllib connector tests

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
2022-08-08 15:12:47 -07:00
Yi Cheng
cadeccd9b7
[core] Fix job counter not working with storage namespace (#27627)
JobCounter is not working with storage namespace right now because the key is the same across namespaces.

This PR fixed it by just adding it there because this add the minimal changes which is safer.

A follow up PR is needed to cleanup redis storage in cpp.
2022-08-08 14:24:32 -07:00
Stephanie Wang
ccbae3325c
[core] Reconstruct manually freed objects (#27567)
Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call:

    If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown.
    If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual.

Also fixes some incidental bugs:

    Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object.
    Fix a circular lock dependency between task failure <> task dependency resolution.

Related issue number

Closes #27265.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-08 13:40:51 -07:00
Yi Cheng
1533976b82
[deflakey] test_error_handling.py in workflow (#27630)
Signed-off-by: Yi Cheng <chengyidna@gmail.com>

## Why are these changes needed?
This test timeout. Move it to large. 
```
WARNING: //python/ray/workflow:tests/test_error_handling: Test execution time (288.7s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="long" or size="large".
```
2022-08-08 13:38:37 -07:00
Avnish Narayan
aee008ab49
[RLlib] PPO release tests tuned and re-enabled. (#27564) 2022-08-08 21:04:19 +02:00
SangBin Cho
be64df6f5d
Fix a uncaught exception upon deallocation for actors (#27637)
As specified here, https://joekuan.wordpress.com/2015/06/30/python-3-__del__-method-and-imported-modules/, the del method doesn't guarantee that modules or function definitions are still referenced, and not GC'ed. That means if you access any "modules", "functions", or "global variables", they may have been garbage collected.

This means we should not access any modules, functions, or global variables inside del method. While it's something we should handle in the sooner future more holistically, this PR fixes the issue in the short term.

The problem was that all of ray actors are decorated by trace_helper.py to make it compatible to open telemetry (maybe we should make it optional). At this time __del__ method is also decorated. When __del__ is invoked, some of functions used within this tracing decorator can be accessed and may have been deallocated (in this case, the _is_tracing_enabled was deallocated). This fixes the issue by not decorating __del__ method from tracing.
2022-08-08 11:51:25 -07:00
Eric Liang
f21ca925ac
[docs] Remove spam banner from master docs (#27599) 2022-08-08 11:47:39 -07:00
Zyiqin-Miranda
b3f06d97b2
[autoscaler] Consolidate CloudWatch agent/dashboard/alarm support; Add unit tests for AWS autoscaler CloudWatch integration (#22070)
This PR mainly adds two improvements:

We have introduced three CloudWatch Config support in previous PRs: Agent, Dashboard and Alarm. In this PR, we generalize the logic of all three config types by using enum CloudwatchConfigType.
Adds unit tests to ensure the correctness of Ray autoscaler CloudWatch integration behavior.
2022-08-08 11:45:07 -07:00
Balaji Veeramani
5087511c46
[AIR] Change FeatureHasher input schema to expect token counts (#27523)
This makes FeatureHasher work more like sklearn's FeatureHasher.
2022-08-08 11:41:57 -07:00
Archit Kulkarni
f6328f46a3
[CI] [runtime env] Fix test_working_dir_2 timeout on Mac (#27563)
One GC test has unnecessary sleeps which are quite expensive due to the parametrization (2 x 2 x 2 = 8 iterations). They are unnecessary because they check that garbage collection of runtime env URIs doesn't occur after a certain time, but garbage collection isn't time-based.  This PR removes the sleeps.

This PR is just to fix CI; a followup PR will make the test more effective by attempting to trigger GC in a more targeted way (by starting multiple tasks with different runtime_env resources.  GC is only triggered upon *creation* of a new resource that causes the cache size to be exceeded.)

It's still not clear what exactly caused the test suite to start taking longer recently, but it might be due to some change elsewhere in Ray, since there were no runtime_env related commits in that time period.
2022-08-08 11:31:21 -05:00
Richard Liaw
f15ed3836d
[air] Render trainer docstring signatures (#27590)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-08 09:29:21 -07:00
Artur Niederfahrenhorst
4fe47d069f
[RLlib] Require ApeX LR schedule test to produce learner info. (#27557) 2022-08-08 18:19:02 +02:00
kourosh hakhamaneshi
3b2a8427af
[RLlib] Fix SampleBatch to_device(). (#27572) 2022-08-08 18:18:33 +02:00
SangBin Cho
8c190e2d09
Revert "[serve][xlang]Support deploying Python deployment from Java. (#26877)" (#27626)
This reverts commit 9f8b596aaa.
2022-08-08 06:54:27 -07:00