Commit graph

14133 commits

Author SHA1 Message Date
Artur Niederfahrenhorst
c855469845
[RLlib] pin gym-minigrid @ 1.0.3 (#27761) 2022-08-11 12:27:44 +02:00
Rohan Potdar
600b8d4729
[RLlib]: Fix OPE docs. (#27460) 2022-08-11 09:14:22 +02:00
Artur Niederfahrenhorst
894e19f791
[RLlib] Dreamer's Episodic buffer should abide by ReplayBuffer API. (#27424) 2022-08-11 09:13:55 +02:00
matthewdeng
178b1e8a25
[data] enable test_split.py tests (#27150)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-08-10 22:15:34 -07:00
Stephanie Wang
043eac06ac
[docs] Revamp clusters section on job submission (#27756)
Page structure changes:

    Deploying a Ray Cluster on Kubernetes
        Getting Started -> links to jobs
    Deploying a Ray Cluster on VMs
        Getting started -> links to jobs
        User Guides
            Autoscaling (moved more content here in favor of the Getting started page)
    Running Applications on Ray Clusters
        Ray Jobs
            Quickstart Using the Ray Jobs CLI
            Python SDK
            REST API
            Ray Job Submission API Reference
            Ray Client

Content changes:

    modified "Deploying a Ray Cluster ..." quickstart pages to briefly summarize ad-hoc command execution, then link to jobs
    modified Ray Jobs example to be more incremental - start with a simple example, then show long-running script, then show example with a runtime env, instead of all of them at once
    center Ray Jobs quickstart around using the CLI. Made some minor changes to the Python SDK page to match it
    remove "Ray Jobs Architecture"
    moved "Autoscaling" content away from Kubernetes "Getting started" page into its own user guide. I think it's too complicated for "Getting Started". No content cuts.
    Cut "Viewing the dashboard" and "Ray Client" from Kubernetes "Getting started" page.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-10 20:15:55 -07:00
zcin
6776ebe5d6
[serve][docs] Document lightweight config updates (#27706)
A new feature was recently added, where Serve replicas are not restarted if only `num_replicas`, `autoscaling_config`, and/or `user_config` is updated in the config file that's redeployed. Updating docs to talk about this feature.

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
2022-08-10 21:01:16 -05:00
Yi Cheng
c5952f2163
[serve] Add an internal os env to turn the head node pin off (#27763)
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).

This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
2022-08-10 18:13:54 -07:00
matthewdeng
8eca6ae852
[rllib][release] mark long_running_many_ppo as unstable (#26874)
Per #26718 (comment)
2022-08-10 17:58:33 -07:00
Jiajun Yao
27e38f81bd
Pin _StatsActor to the driver node (#27765)
Similar to what's done in #23397

This allows the actor to fate-share with the driver and tolerate worker node failures.
2022-08-10 17:55:06 -07:00
Chen Shen
ddca52d2ca
[cluster doc] Promote new doc and deprecate the old (#27759)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-10 17:41:56 -07:00
Balaji Veeramani
7da7dbe3fd
[AIR] Improve preprocessor documentation (#27215)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-10 17:13:22 -07:00
Stephanie Wang
54a9b1d2d0
[docs] Revamp docs on observability for ray cluster apps (#27724)
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu

Various cleanups around docs on Ray cluster "Monitoring and observability". After #27723, we will move these to a common page outside of VMs/k8s subsections:

    Add links to the more comprehensive observability section.
    Move and clean up cluster-specific content from Prometheus metrics to the new Ray Cluster page. I also modified a bunch of text here because previously we were not very clear about what the recommended approach was.
    Include more specific instructions about setting up observability tools for VMs vs k8s.
2022-08-10 15:06:28 -07:00
Jiajun Yao
fe4f2b5b07
[Doc] Add a cluster xgboost example for vm stack (#27732)
This is adapted from the same example of the k8s stack.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-10 12:36:16 -07:00
Eric Liang
ebfb76ff22
[docs] Minor tweaks to AIR intro icons
Signed-off-by: Eric Liang <ekhliang@gmail.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>
2022-08-10 10:32:11 -07:00
Richard Liaw
5bf6562f38
Remove Airflow integration (#27737)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-10 09:19:06 -07:00
Artur Niederfahrenhorst
04bc845360
[RLlib] Fix priority update for sequenced batches. (#27544) 2022-08-10 12:48:25 +02:00
Chen Shen
a1d80dc195
[Cluster-launcher doc] revamp the vm part (#27431) 2022-08-10 02:43:28 -07:00
Cheng Su
853c859037
[Datasets] Better error message for partition filtering if no file found (#27353)
User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.

Signed-off-by: Cheng Su <scnju13@gmail.com>
2022-08-09 22:42:20 -07:00
Alan Guo
3c068ae748
Add new docs for the new dashboard (#27684) 2022-08-09 22:40:32 -07:00
Cade Daniel
03d835e4e2
[Ray Clusters][docs] Create new Running Apps on Ray Clusters section (#27723)
This adds the structure described here, namely adding a new section under Ray Clusters which is focused on running applications on Ray clusters.

Signed-off-by: Cade Daniel <cade@anyscale.com>

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-09 21:01:47 -07:00
zcin
ea2a11080f
[serve][doc] Update Serve API in tutorials code (#27579) 2022-08-09 19:59:14 -07:00
Huaiwei Sun
e33edcb0b7
[doc] css improvements for rate the doc components (#27717)
- improve the css for the rate the doc component
### Before
<img width="260" alt="Screen Shot 2022-08-09 at 2 15 19 PM" src="https://user-images.githubusercontent.com/9677264/183762845-26a7f6a8-909d-4c66-b030-7d28c7f2c65b.png">
<img width="379" alt="Screen Shot 2022-08-09 at 2 15 11 PM" src="https://user-images.githubusercontent.com/9677264/183762884-4618e4bb-4a54-401e-97f6-f363fcd086a5.png">

### After
<img width="488" alt="Screen Shot 2022-08-09 at 1 55 16 PM" src="https://user-images.githubusercontent.com/9677264/183762916-4f803ff6-801f-4b7c-a4a4-1adad8e07ff7.png">
<img width="473" alt="Screen Shot 2022-08-09 at 1 55 22 PM" src="https://user-images.githubusercontent.com/9677264/183762928-081da7be-9721-4066-8e96-ba7e0f01c59c.png">
<img width="423" alt="Screen Shot 2022-08-09 at 1 55 37 PM" src="https://user-images.githubusercontent.com/9677264/183762940-94c94361-72b2-4f2b-91af-c88c57d6886c.png">
2022-08-09 18:48:22 -07:00
Cade Daniel
2246ea7fe4
Fixing doc linter: broken links in ray-tracing.rst and ray-dag.rst (#27721)
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-08-09 18:09:29 -07:00
matthewdeng
1b19f3c593
[docs] add dask compatibility for 1.13.0 and 2.0.0 (#27699)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-08-09 16:13:02 -07:00
kourosh hakhamaneshi
4607e788c1
[RLlib] Fix test_ope flakiness (#27676) 2022-08-09 16:12:30 -07:00
Cheng Su
bc5d8d9176
[AIR] Replace references of to_tf with iter_tf_batches (#27672) 2022-08-09 16:00:02 -07:00
kourosh hakhamaneshi
3b3c20209b
[RLlib] Fix dqn reproducibility (#27459) 2022-08-09 15:56:44 -07:00
Cade Daniel
8826646303
[Ray Clusters][docs] Restructuring Clusters API reference (#27679)
*This PR:

Copies the existing clusters API reference to the new structure. The reference docs are split out into Ray Clusters (common between vms and k8s) and Ray Clusters on VMs (specific to vms). Notably, there is also a reference section for k8s, but not in this PR.
Move the three job submission user guides back into a single one. Jules had suggested that we break them out into rest/sdk/cli, but that's not P0 right now.
Fix some bugs in the left navigation bar. There should be less duplication of TOC entries. I'll keep working on related fixes in a different PR.

Signed-off-by: Cade Daniel <cade@anyscale.com>
2022-08-09 15:33:09 -07:00
shrekris-anyscale
d809d748cf
[Serve] [Docs] Add consolidated Model Composition user guide (#26860)
This change adds introductory deployment graph documentation.

Links to updated documentation:
* [Model Composition](https://ray--26860.org.readthedocs.build/en/26860/serve/model_composition.html)
* [Examples Overview](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/index.html)
* [Deployment Graph Pattern Overview](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns.html)
  * [Pattern: Linear Pipeline](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/linear_pipeline.html)
  * [Pattern: Branching Input](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/branching_input.html)
  * [Pattern: Conditional](https://ray--26860.org.readthedocs.build/en/26860/serve/tutorials/deployment-graph-patterns/conditional.html)

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
2022-08-09 17:06:23 -05:00
Jiajun Yao
f084546d41
Fix out-of-band deserialization of actor handle (#27700)
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-09 14:25:14 -07:00
Stephanie Wang
7d0fcd7ec6
[core] Allow reuse of cluster address if Ray is not running (#27666)
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu

Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number

Closes #27021.
2022-08-09 13:48:48 -07:00
Richard Liaw
93a3cc222b
[docs/air] remove xgboost/lightgbm references and move AIR toc (#27687) 2022-08-09 12:49:44 -07:00
Eric Liang
92928fe86c
[docs] Minor polish on AIR getting started page (#27696) 2022-08-09 11:24:18 -07:00
Cade Daniel
13f43b939a
[docs][Ray Clusters] Key Concepts page (#27510) 2022-08-09 10:01:05 -07:00
Sihan Wang
2881d3e9f1
[Serve/Doc] Update http with serve user guide (#27536)
- Merge http user guides and http adapter
- Update the code to use bind()
- Remove some unsupported content
- minor wording improvement
2022-08-09 11:42:34 -05:00
Archit Kulkarni
dec8a660c5
[Doc] [Serve] Nits/Edits on Performance Tuning page (#27651)
This PR is an edit pass on the Performance Tuning page after reading it with fresh eyes. None of the content was out of date so it's mostly nits and rewording some parts that were slightly confusing.
2022-08-09 11:36:21 -05:00
Edward Oakes
db64717269
[serve][docs] Update key concepts page for Ray 2.0 (#27565)
Closes https://github.com/ray-project/ray/issues/27438
2022-08-09 11:34:11 -05:00
Sihan Wang
22d1be5823
[Serve] Make serve.run to start serve with http on EveryNode mode (#27668)
Signed-off-by: Sihan Wang <sihanwang41@gmail.com>
2022-08-09 09:29:38 -07:00
Richard Liaw
bb5e8c3536
fix-link-check (#27703) 2022-08-09 08:57:49 -07:00
Huaiwei Sun
e82d7ef750
[docs] css improvements (#27698) 2022-08-09 08:45:12 -07:00
Charles Sun
c358305ca6
[RLlib] DatasetReader action normalization. (#27356) 2022-08-09 16:54:03 +02:00
Sven Mika
537f7c65c1
[RLlib] CRR framework torch by default. (#27161) 2022-08-09 16:53:00 +02:00
kourosh hakhamaneshi
b84dd38f01
[RLlib] Add __getitem__ to MultiAgentBatch to access policy_batches. (#27619) 2022-08-09 16:51:26 +02:00
Dmitri Gekhtman
3293317c40
[kubernetes][docs] Logging guide, networking info, migration guide, fixes. (#27607)
This PR

Adds notes and example on logging for Ray/K8s.
Implements an API Reference paging pointing to the configuration guide and the RayCluster CR definition.
Takes managed K8s services out of the tabbed structure, to make that page look less sad.
Adds a comparison of the KubeRay operator and legacy K8s operator
Adds an architecture diagram for the autoscaling sections
Fixes some other minor items
Adds some info about networking to the configuration guide, removes the previously planned networking page

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2022-08-09 00:38:05 -07:00
Alan Guo
c3a8ba0f8a
Add maximum number of characters in logs output for jobs status message (#27581)
We've seen the API server go down from trying to return 500mb of log output
2022-08-08 20:24:51 -07:00
Nikita Vemuri
0e74bc20b5
[core] Fix how protocol is removed for external ray dashboard URL (#27652)
* fix how protocol is removed for external dashboard url
2022-08-08 18:23:12 -07:00
matthewdeng
fbdec1add0
[air] remove rllib dependency from tensorflow_predictor (#27671) 2022-08-08 18:05:48 -07:00
Alan Guo
3a819fafb7
Force grpcio to be >= 1.42.0 for python 3.10 (#27269) 2022-08-08 17:37:18 -07:00
Jian Xiao
e5c3f1cf3a
Fix a few stale Datasets documentation in AIR (#27623)
The descriptions of Datasets are not up-to-date now.
2022-08-08 17:33:23 -07:00
Clark Zinzow
3b151c581e
[Datasets] Delay expensive tensor extension type import until Parquet reading. (#27653)
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.

I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
2022-08-08 17:06:25 -07:00