[Ray clusters] [docs] Copying all Ray Clusters doc content to new structure (#27062)

2025-03-05 10:01:43 -05:00 · 2022-07-27 14:22:44 -07:00 · 2022-07-27 14:22:44 -07:00 · db26c779a0
commit db26c779a0
parent 4f0fb3a5da
52 changed files with 5745 additions and 375 deletions
--- a/doc/source/cluster/cluster_under_construction/we_are_hiring.rst
+++ b/doc/source/cluster/cluster_under_construction/we_are_hiring.rst
--- a/doc/source/_toc.yml
+++ b/doc/source/_toc.yml
@ -297,7 +297,6 @@ parts:
                - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider
              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem
              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster
-              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/manual-cluster-setup
              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices
              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/multi-tenancy-best-practices
              - file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling
@ -333,8 +332,9 @@ parts:
            - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/index
              sections:
                - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis
+                - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-cli
                - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration
-                - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-job-submission
+                - file: cluster/cluster_under_construction/ray-clusters-on-vms/references/autoscaler-sdk-api

  - caption: References
    chapters:
--- a/doc/source/cluster/cluster_under_construction/getting-started.rst
+++ b/doc/source/cluster/cluster_under_construction/getting-started.rst
@ -1,264 +1,97 @@
-.. include:: /_includes/clusters/announcement.rst
-
-.. include:: we_are_hiring.rst
-
-.. _ref-cluster-getting-started-under-construction:
-
 .. warning::
    This page is under construction!

-TODO(cade)
-Direct users, based on what they are trying to accomplish, to the
-correct page between "Managing Ray Clusters on Kubernetes",
-"Managing Ray Clusters via `ray up`", and "Using Ray Clusters".
-There should be some discussion on Kubernetes vs. `ray up` for
-those looking to create new Ray clusters for the first time.
-
-
-Getting Started with Ray Clusters
-=================================
-
-This page demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
-
-For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
-
-About the demo
--------------
-
-This demo will walk through an end-to-end flow:
-
-1. Create a (basic) Python application.
-2. Launch a cluster on a cloud provider.
-3. Run the application in the cloud.
-
-Requirements
-~~~~~~~~~~~~
-
-To run this demo, you will need:
-
-* Python installed on your development machine (typically your laptop), and
-* an account at your preferred cloud provider (AWS, Azure or GCP).
-
-Setup
-~~~~~
-
-Before we start, you will need to install some Python dependencies as follows:
-
-.. tabbed:: AWS
-
-    .. code-block:: shell
-
-        $ pip install -U "ray[default]" boto3
-
-.. tabbed:: Azure
-
-    .. code-block:: shell
-
-        $ pip install -U "ray[default]" azure-cli azure-core
-
-.. tabbed:: GCP
-
-    .. code-block:: shell
-
-        $ pip install -U "ray[default]" google-api-python-client
-
-Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:
-
-.. tabbed:: AWS
-
-    Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`_.
-
-.. tabbed:: Azure
-
-    Log in using ``az login``, then configure your credentials with ``az account set -s <subscription_id>``.
-
-.. tabbed:: GCP
-
-    Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.
-
-Create a (basic) Python application
-----------------------------------
-
-We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on:
-
-.. code-block:: python
-
-    from collections import Counter
-    import socket
-    import time
-
-    def f():
-        time.sleep(0.001)
-        # Return IP address.
-        return socket.gethostbyname(socket.gethostname())
-
-    ip_addresses = [f() for _ in range(10000)]
-    print(Counter(ip_addresses))
-
-Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``.
-
-With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough<core-walkthrough>`):
-
-.. code-block:: python
-
-    from collections import Counter
-    import socket
-    import time
-
-    import ray
-
-    ray.init()
-
-    @ray.remote
-    def f():
-        time.sleep(0.001)
-        # Return IP address.
-        return socket.gethostbyname(socket.gethostname())
-
-    object_ids = [f.remote() for _ in range(10000)]
-    ip_addresses = ray.get(object_ids)
-    print(Counter(ip_addresses))
-
-Finally, let's add some code to make the output more interesting:
-
-.. code-block:: python
-
-    from collections import Counter
-    import socket
-    import time
-
-    import ray
-
-    ray.init()
-
-    print('''This cluster consists of
-        {} nodes in total
-        {} CPU resources in total
-    '''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
-
-    @ray.remote
-    def f():
-        time.sleep(0.001)
-        # Return IP address.
-        return socket.gethostbyname(socket.gethostname())
-
-    object_ids = [f.remote() for _ in range(10000)]
-    ip_addresses = ray.get(object_ids)
-
-    print('Tasks executed')
-    for ip_address, num_tasks in Counter(ip_addresses).items():
-        print('    {} tasks on {}'.format(num_tasks, ip_address))
-
-Running ``python script.py`` should now output something like:
-
-.. parsed-literal::
-
-    This cluster consists of
-        1 nodes in total
-        4.0 CPU resources in total
-
-    Tasks executed
-        10000 tasks on 127.0.0.1
-
-Launch a cluster on a cloud provider
------------------------------------
-
-To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes.
-
-A minimal sample cluster configuration file looks as follows:
-
-.. tabbed:: AWS
-
-    .. code-block:: yaml
-
-        # An unique identifier for the head node and workers of this cluster.
-        cluster_name: minimal
-
-        # Cloud-provider specific configuration.
-        provider:
-            type: aws
-            region: us-west-2
-
-.. tabbed:: Azure
-
-    .. code-block:: yaml
-
-        # An unique identifier for the head node and workers of this cluster.
-        cluster_name: minimal
-
-        # Cloud-provider specific configuration.
-        provider:
-            type: azure
-            location: westus2
-            resource_group: ray-cluster
-
-        # How Ray will authenticate with newly launched nodes.
-        auth:
-            ssh_user: ubuntu
-            # you must specify paths to matching private and public key pair files
-            # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
-            ssh_private_key: ~/.ssh/id_rsa
-            # changes to this should match what is specified in file_mounts
-            ssh_public_key: ~/.ssh/id_rsa.pub
-
-.. tabbed:: GCP
-
-    .. code-block:: yaml
-
-        # A unique identifier for the head node and workers of this cluster.
-        cluster_name: minimal
-
-        # Cloud-provider specific configuration.
-        provider:
-            type: gcp
-            region: us-west1
-
-Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
-
-After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
-
-.. code-block:: shell
-
-    $ ray up -y config.yaml
-
-Run the application in the cloud
--------------------------------
-
-We are now ready to execute the application in across multiple machines on our Ray cloud cluster.
-First, we need to edit the initialization command ``ray.init()`` in ``script.py``.
-Change it to
-
-.. code-block:: python
-
-    ray.init(address='auto')
-
-This tells your script to connect to the Ray runtime on the remote cluster instead of initializing a new Ray runtime.
-
-Next, run the following command:
-
-.. code-block:: shell
-
-    $ ray submit config.yaml script.py
-
-The output should now look similar to the following:
-
-.. parsed-literal::
-
-    This cluster consists of
-        3 nodes in total
-        6.0 CPU resources in total
-
-    Tasks executed
-        3425 tasks on xxx.xxx.xxx.xxx
-        3834 tasks on xxx.xxx.xxx.xxx
-        2741 tasks on xxx.xxx.xxx.xxx
-
-In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes.
-
-The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference <cluster-commands>`.
-
-To finish, don't forget to shut down the cluster. Run the following command:
-
-.. code-block:: shell
-
-    $ ray down -y config.yaml
+.. include:: /_includes/clusters/announcement.rst
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-index-under-construction:
+
+..
+    TODO(cade)
+    Update this to accomplish the following:
+        Direct users, based on what they are trying to accomplish, to the
+        correct page between "Managing Ray Clusters on Kubernetes",
+        "Managing Ray Clusters via `ray up`", and "Using Ray Clusters".
+        There should be some discussion on Kubernetes vs. `ray up` for
+        those looking to create new Ray clusters for the first time.
+
+Ray Clusters Overview
+=====================
+
+What is a Ray cluster?
+----------------------
+
+One of Ray's strengths is the ability to leverage multiple machines for
+distributed execution. Ray can, of course, be run on a single machine (and is
+done so often), but the real power is using Ray on a cluster of machines.
+
+Ray can automatically interact with the cloud provider to request or release
+instances. You can specify :ref:`a configuration <cluster-config>` to launch
+clusters on :ref:`AWS, GCP, Azure (community-maintained), Aliyun (community-maintained), on-premise, or even on
+your custom node provider <cluster-cloud>`. Ray can also be run on :ref:`Kubernetes <kuberay-index>` infrastructure.
+Your cluster can have a fixed size
+or :ref:`automatically scale up and down<cluster-autoscaler>` depending on the
+demands of your application.
+
+Where to go from here?
+----------------------
+
+.. panels::
+    :container: text-center
+    :column: col-lg-6 px-2 py-2
+    :card:
+
+    **Quick Start** 
+    ^^^
+
+    In this quick start tutorial you will take a sample application designed to
+    run on a laptop and scale it up in the cloud.
+
+    +++
+    .. link-button:: ref-cluster-quick-start-vms-under-construction
+        :type: ref
+        :text: Ray Clusters Quick Start
+        :classes: btn-outline-info btn-block
+    ---
+
+    **Key Concepts**
+    ^^^
+
+    Understand the key concepts behind Ray Clusters. Learn about the main
+    concepts and the different ways to interact with a cluster.
+
+    +++
+    .. link-button:: cluster-key-concepts
+        :type: ref
+        :text: Learn Key Concepts
+        :classes: btn-outline-info btn-block
+    ---
+
+    **Deployment Guide**
+    ^^^
+
+    Learn how to set up a distributed Ray cluster and run your workloads on it.
+
+    +++
+    .. link-button:: ref-deployment-guide
+        :type: ref
+        :text: Deploy on a Ray Cluster
+        :classes: btn-outline-info btn-block
+    ---
+
+    **API**
+    ^^^
+
+    Get more in-depth information about the various APIs to interact with Ray
+    Clusters, including the :ref:`Ray cluster config YAML and CLI<cluster-config>`,
+    the :ref:`Ray Client API<ray-client>` and the
+    :ref:`Ray job submission API<ray-job-submission-api-ref>`.
+
+    +++
+    .. link-button:: ref-cluster-api
+        :type: ref
+        :text: Read the API Reference
+        :classes: btn-outline-info btn-block
+
+.. include:: /_includes/clusters/announcement_bottom.rst
--- a/doc/source/cluster/cluster_under_construction/key-concepts.rst
+++ b/doc/source/cluster/cluster_under_construction/key-concepts.rst
@ -1,26 +1,130 @@
-.. include:: we_are_hiring.rst
-
 .. warning::
    This page is under construction!

+Key Concepts
+============
+..
+    TODO(cade) Can we simplify this? From https://github.com/ray-project/ray/pull/26754#issuecomment-1192927645:
+    * Worker Nodes
+    * Head Node
+    * Autoscaler
+    * Clients and Jobs
+    
+    Need to add the following sections + break out existing content into them.
+    See ray-core/user-guide.rst for a TOC example
+    
+    overview
+    high-level-architecture
+    jobs
+    nodes-vs-workers
+    scheduling-and-autoscaling
+    configuration
+    Things-to-know
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-key-concepts-under-construction:

 Key Concepts
 ============

-TODO(cade) Can we simplify this? From https://github.com/ray-project/ray/pull/26754#issuecomment-1192927645:
-* Worker Nodes
-* Head Node
-* Autoscaler
-* Clients and Jobs
+Cluster
+-------

-Need to add the following sections + break out existing content into them.
-See ray-core/user-guide.rst for a TOC example
+A Ray cluster is a set of one or more nodes that are running Ray and share the
+same :ref:`head node<cluster-node-types>`.

-overview
-high-level-architecture
-jobs
-nodes-vs-workers
-scheduling-and-autoscaling
-configuration
-Things-to-know
+.. _cluster-node-types-under-construction:

+Node types
+----------
+
+A Ray cluster consists of a :ref:`head node<cluster-head-node>` and a set of
+:ref:`worker nodes<cluster-worker-node>`.
+
+.. image:: ray-cluster.jpg
+    :align: center
+    :width: 600px
+
+.. _cluster-head-node-under-construction:
+
+Head node
+~~~~~~~~~
+
+The head node is the first node started by the
+:ref:`Ray cluster launcher<cluster-launcher>` when trying to launch a Ray
+cluster. Among other things, the head node holds the :ref:`Global Control Store
+(GCS)<memory>` and runs the :ref:`autoscaler<cluster-autoscaler>`. Once the head
+node is started, it will be responsible for launching any additional
+:ref:`worker nodes<cluster-worker-node>`. The head node itself will also execute
+tasks and actors to utilize its capacity.
+
+.. _cluster-worker-node-under-construction:
+
+Worker node
+~~~~~~~~~~~
+
+A worker node is any node in the Ray cluster that is not functioning as head node.
+Therefore, worker nodes are simply responsible for executing tasks and actors.
+When a worker node is launched, it will be given the address of the head node to
+form a cluster.
+
+.. _cluster-launcher-under-construction:
+
+Cluster launcher
+----------------
+
+The cluster launcher is a process responsible for bootstrapping the Ray cluster
+by launching the :ref:`head node<cluster-head-node>`. For more information on how
+to use the cluster launcher, refer to
+:ref:`cluster launcher CLI commands documentation<cluster-commands>` and the
+corresponding :ref:`documentation for the configuration file<cluster-config>`.
+
+.. _cluster-autoscaler-under-construction:
+
+Autoscaler
+----------
+
+The autoscaler is a process that runs on the :ref:`head node<cluster-head-node>`
+and is responsible for adding or removing :ref:`worker nodes<cluster-worker-node>`
+to meet the needs of the Ray workload while matching the specification in the
+:ref:`cluster config file<cluster-config>`. In particular, if the resource
+demands of the Ray workload exceed the current capacity of the cluster, the
+autoscaler will try to add nodes. Conversely, if a node is idle for long enough,
+the autoscaler will remove it from the cluster. To learn more about autoscaling,
+refer to the :ref:`Ray cluster deployment guide<deployment-guide-autoscaler>`.
+
+Ray Client
+----------
+The Ray Client is an API that connects a Python script to a remote Ray cluster.
+To learn more about the Ray Client, you can refer to the :ref:`documentation<ray-client>`.
+
+Job submission
+--------------
+
+Ray Job submission is a mechanism to submit locally developed and tested applications
+to a remote Ray cluster. It simplifies the experience of packaging, deploying,
+and managing a Ray application. To learn more about Ray jobs, refer to the
+:ref:`documentation<ray-job-submission-api-ref>`.
+
+Cloud clusters
+--------------
+
+If you’re using AWS, GCP, Azure (community-maintained) or Aliyun (community-maintained), you can use the
+:ref:`Ray cluster launcher<cluster-launcher>` to launch cloud clusters, which
+greatly simplifies the cluster setup process.
+
+Cluster managers
+----------------
+
+You can simplify the process of managing Ray clusters using a number of popular
+cluster managers including :ref:`Kubernetes<kuberay-index>`,
+:ref:`YARN<ray-yarn-deploy>`, :ref:`Slurm<ray-slurm-deploy>` and :ref:`LSF<ray-LSF-deploy>`.
+
+Kubernetes (K8s) operator
+-------------------------
+
+Deployments of Ray on Kubernetes are managed by the Ray Kubernetes Operator. The
+Ray Operator makes it easy to deploy clusters of Ray pods within a Kubernetes
+cluster. To learn more about the K8s operator, refer to
+the :ref:`documentation<kuberay-index>`.
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/getting-started.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/getting-started.md
@ -1,4 +0,0 @@
-# Getting Started
-:::{warning}
-This page is under construction!
-:::
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/getting-started.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/getting-started.rst
@ -0,0 +1,251 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/announcement.rst
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ref-cluster-quick-start-vms-under-construction:
+
+Ray Clusters Quick Start
+========================
+
+This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
+
+For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
+
+About the demo
+--------------
+
+This demo will walk through an end-to-end flow:
+
+1. Create a (basic) Python application.
+2. Launch a cluster on a cloud provider.
+3. Run the application in the cloud.
+
+Requirements
+~~~~~~~~~~~~
+
+To run this demo, you will need:
+
+* Python installed on your development machine (typically your laptop), and
+* an account at your preferred cloud provider (AWS, Azure or GCP).
+
+Setup
+~~~~~
+
+Before we start, you will need to install some Python dependencies as follows:
+
+.. tabbed:: AWS
+
+    .. code-block:: shell
+
+        $ pip install -U "ray[default]" boto3
+
+.. tabbed:: Azure
+
+    .. code-block:: shell
+
+        $ pip install -U "ray[default]" azure-cli azure-core
+
+.. tabbed:: GCP
+
+    .. code-block:: shell
+
+        $ pip install -U "ray[default]" google-api-python-client
+
+Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:
+
+.. tabbed:: AWS
+
+    Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`_.
+
+.. tabbed:: Azure
+
+    Log in using ``az login``, then configure your credentials with ``az account set -s <subscription_id>``.
+
+.. tabbed:: GCP
+
+    Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.
+
+Create a (basic) Python application
+-----------------------------------
+
+We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on:
+
+.. code-block:: python
+
+    from collections import Counter
+    import socket
+    import time
+
+    def f():
+        time.sleep(0.001)
+        # Return IP address.
+        return socket.gethostbyname(socket.gethostname())
+
+    ip_addresses = [f() for _ in range(10000)]
+    print(Counter(ip_addresses))
+
+Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``.
+
+With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough<core-walkthrough>`):
+
+.. code-block:: python
+
+    from collections import Counter
+    import socket
+    import time
+
+    import ray
+
+    ray.init()
+
+    @ray.remote
+    def f():
+        time.sleep(0.001)
+        # Return IP address.
+        return socket.gethostbyname(socket.gethostname())
+
+    object_ids = [f.remote() for _ in range(10000)]
+    ip_addresses = ray.get(object_ids)
+    print(Counter(ip_addresses))
+
+Finally, let's add some code to make the output more interesting:
+
+.. code-block:: python
+
+    from collections import Counter
+    import socket
+    import time
+
+    import ray
+
+    ray.init()
+
+    print('''This cluster consists of
+        {} nodes in total
+        {} CPU resources in total
+    '''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
+
+    @ray.remote
+    def f():
+        time.sleep(0.001)
+        # Return IP address.
+        return socket.gethostbyname(socket.gethostname())
+
+    object_ids = [f.remote() for _ in range(10000)]
+    ip_addresses = ray.get(object_ids)
+
+    print('Tasks executed')
+    for ip_address, num_tasks in Counter(ip_addresses).items():
+        print('    {} tasks on {}'.format(num_tasks, ip_address))
+
+Running ``python script.py`` should now output something like:
+
+.. parsed-literal::
+
+    This cluster consists of
+        1 nodes in total
+        4.0 CPU resources in total
+
+    Tasks executed
+        10000 tasks on 127.0.0.1
+
+Launch a cluster on a cloud provider
+------------------------------------
+
+To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes.
+
+A minimal sample cluster configuration file looks as follows:
+
+.. tabbed:: AWS
+
+    .. code-block:: yaml
+
+        # An unique identifier for the head node and workers of this cluster.
+        cluster_name: minimal
+
+        # Cloud-provider specific configuration.
+        provider:
+            type: aws
+            region: us-west-2
+
+.. tabbed:: Azure
+
+    .. code-block:: yaml
+
+        # An unique identifier for the head node and workers of this cluster.
+        cluster_name: minimal
+
+        # Cloud-provider specific configuration.
+        provider:
+            type: azure
+            location: westus2
+            resource_group: ray-cluster
+
+        # How Ray will authenticate with newly launched nodes.
+        auth:
+            ssh_user: ubuntu
+            # you must specify paths to matching private and public key pair files
+            # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
+            ssh_private_key: ~/.ssh/id_rsa
+            # changes to this should match what is specified in file_mounts
+            ssh_public_key: ~/.ssh/id_rsa.pub
+
+.. tabbed:: GCP
+
+    .. code-block:: yaml
+
+        # A unique identifier for the head node and workers of this cluster.
+        cluster_name: minimal
+
+        # Cloud-provider specific configuration.
+        provider:
+            type: gcp
+            region: us-west1
+
+Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
+
+After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
+
+.. code-block:: shell
+
+    $ ray up -y config.yaml
+
+Run the application in the cloud
+--------------------------------
+
+We are now ready to execute the application in across multiple machines on our Ray cloud cluster.
+``ray.init()`` will now automatically connect to the newly created cluster.
+
+Next, run the following command:
+
+.. code-block:: shell
+
+    $ ray submit config.yaml script.py
+
+The output should now look similar to the following:
+
+.. parsed-literal::
+
+    Connecting to existing Ray cluster at address: <IP address>...
+
+    This cluster consists of
+        3 nodes in total
+        6.0 CPU resources in total
+
+    Tasks executed
+        3425 tasks on xxx.xxx.xxx.xxx
+        3834 tasks on xxx.xxx.xxx.xxx
+        2741 tasks on xxx.xxx.xxx.xxx
+
+In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes.
+
+The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference <cluster-commands>`.
+
+To finish, don't forget to shut down the cluster. Run the following command:
+
+.. code-block:: shell
+
+    $ ray down -y config.yaml
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/autoscaler-sdk-api.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/autoscaler-sdk-api.rst
@ -0,0 +1,18 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ref-autoscaler-sdk-under-construction:
+
+Autoscaler SDK
+==============
+
+.. _ref-autoscaler-sdk-request-resources-under-construction:
+
+ray.autoscaler.sdk.request_resources
+------------------------------------
+
+Within a Ray program, you can command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accommodate the requested resources, bypassing normal upscaling speed constraints.
+
+.. .. autofunction:: ray.autoscaler.sdk.request_resources
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Job submission API
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis.rst
@ -0,0 +1,12 @@
+.. _ray-job-submission-api-ref-under-construction:
+
+Ray Job Submission API
+======================
+
+For an overview with examples see :ref:`Ray Job Submission<jobs-overview>`.
+
+.. _ray-job-submission-cli-ref-under-construction:
+
+Job Submission CLI
+------------------
+
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-cli.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-cli.rst
@ -0,0 +1,237 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-commands-under-construction:
+
+Cluster Launcher Commands
+=========================
+
+This document overviews common commands for using the Ray Cluster Launcher.
+See the :ref:`Cluster Configuration <cluster-config>` docs on how to customize the configuration file.
+
+Launching a cluster (``ray up``)
+--------------------------------
+
+This will start up the machines in the cloud, install your dependencies and run
+any setup commands that you have, configure the Ray cluster automatically, and
+prepare you to scale your distributed system. See :ref:`the documentation
+<ray-up-doc>` for ``ray up``.
+
+.. tip:: The worker nodes will start only after the head node has finished
+         starting. To monitor the progress of the cluster setup, you can run
+         `ray monitor <cluster yaml>`.
+
+.. code-block:: shell
+
+    # Replace '<your_backend>' with one of: 'aws', 'gcp', 'kubernetes', or 'local'.
+    $ BACKEND=<your_backend>
+
+    # Create or update the cluster.
+    $ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml
+
+    # Tear down the cluster.
+    $ ray down ray/python/ray/autoscaler/$BACKEND/example-full.yaml
+
+Updating an existing cluster (``ray up``)
+-----------------------------------------
+
+If you want to update your cluster configuration (add more files, change dependencies), run ``ray up`` again on the existing cluster.
+
+This command checks if the local configuration differs from the applied
+configuration of the cluster. This includes any changes to synced files
+specified in the ``file_mounts`` section of the config. If so, the new files
+and config will be uploaded to the cluster. Following that, Ray
+services/processes will be restarted.
+
+.. tip:: Don't do this for the cloud provider specifications (e.g., change from
+         AWS to GCP on a running cluster) or change the cluster name (as this
+         will just start a new cluster and orphan the original one).
+
+
+You can also run ``ray up`` to restart a cluster if it seems to be in a bad
+state (this will restart all Ray services even if there are no config changes).
+
+Running ``ray up`` on an existing cluster will do all the following:
+
+* If the head node matches the cluster specification, the filemounts will be
+  reapplied and the ``setup_commands`` and ``ray start`` commands will be run.
+  There may be some caching behavior here to skip setup/file mounts.
+* If the head node is out of date from the specified YAML (e.g.,
+  ``head_node_type`` has changed on the YAML), then the out of date node will
+  be terminated and a new node will be provisioned to replace it. Setup/file
+  mounts/``ray start`` will be applied.
+* After the head node reaches a consistent state (after ``ray start`` commands
+  are finished), the same above procedure will be applied to all the worker
+  nodes. The ``ray start`` commands tend to run a ``ray stop`` + ``ray start``,
+  so this will kill currently working jobs.
+
+If you don't want the update to restart services (e.g., because the changes
+don't require a restart), pass ``--no-restart`` to the update call.
+
+If you want to force re-generation of the config to pick up possible changes in
+the cloud environment, pass ``--no-config-cache`` to the update call.
+
+If you want to skip the setup commands and only run ``ray stop``/``ray start``
+on all nodes, pass ``--restart-only`` to the update call.
+
+See :ref:`the documentation <ray-up-doc>` for ``ray up``.
+
+.. code-block:: shell
+
+    # Reconfigure autoscaling behavior without interrupting running jobs.
+    $ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml \
+        --max-workers=N --no-restart
+
+Running shell commands on the cluster (``ray exec``)
+----------------------------------------------------
+
+You can use ``ray exec`` to conveniently run commands on clusters. See :ref:`the documentation <ray-exec-doc>` for ``ray exec``.
+
+
+.. code-block:: shell
+
+    # Run a command on the cluster
+    $ ray exec cluster.yaml 'echo "hello world"'
+
+    # Run a command on the cluster, starting it if needed
+    $ ray exec cluster.yaml 'echo "hello world"' --start
+
+    # Run a command on the cluster, stopping the cluster after it finishes
+    $ ray exec cluster.yaml 'echo "hello world"' --stop
+
+    # Run a command on a new cluster called 'experiment-1', stopping it after
+    $ ray exec cluster.yaml 'echo "hello world"' \
+        --start --stop --cluster-name experiment-1
+
+    # Run a command in a detached tmux session
+    $ ray exec cluster.yaml 'echo "hello world"' --tmux
+
+    # Run a command in a screen (experimental)
+    $ ray exec cluster.yaml 'echo "hello world"' --screen
+
+If you want to run applications on the cluster that are accessible from a web
+browser (e.g., Jupyter notebook), you can use the ``--port-forward``. The local
+port opened is the same as the remote port.
+
+.. code-block:: shell
+
+    $ ray exec cluster.yaml --port-forward=8899 'source ~/anaconda3/bin/activate tensorflow_p36 && jupyter notebook --port=8899'
+
+.. note:: For Kubernetes clusters, the ``port-forward`` option cannot be used
+          while executing a command. To port forward and run a command you need
+          to call ``ray exec`` twice separately.
+
+Running Ray scripts on the cluster (``ray submit``)
+---------------------------------------------------
+
+You can also use ``ray submit`` to execute Python scripts on clusters. This
+will ``rsync`` the designated file onto the head node cluster and execute it
+with the given arguments. See :ref:`the documentation <ray-submit-doc>` for
+``ray submit``.
+
+.. code-block:: shell
+
+    # Run a Python script in a detached tmux session
+    $ ray submit cluster.yaml --tmux --start --stop tune_experiment.py
+
+    # Run a Python script with arguments.
+    # This executes script.py on the head node of the cluster, using
+    # the command: python ~/script.py --arg1 --arg2 --arg3
+    $ ray submit cluster.yaml script.py -- --arg1 --arg2 --arg3
+
+
+Attaching to a running cluster (``ray attach``)
+-----------------------------------------------
+
+You can use ``ray attach`` to attach to an interactive screen session on the
+cluster. See :ref:`the documentation <ray-attach-doc>` for ``ray attach`` or
+run ``ray attach --help``.
+
+.. code-block:: shell
+
+    # Open a screen on the cluster
+    $ ray attach cluster.yaml
+
+    # Open a screen on a new cluster called 'session-1'
+    $ ray attach cluster.yaml --start --cluster-name=session-1
+
+    # Attach to tmux session on cluster (creates a new one if none available)
+    $ ray attach cluster.yaml --tmux
+
+.. _ray-rsync-under-construction:
+
+Synchronizing files from the cluster (``ray rsync-up/down``)
+------------------------------------------------------------
+
+To download or upload files to the cluster head node, use ``ray rsync_down`` or
+``ray rsync_up``:
+
+.. code-block:: shell
+
+    $ ray rsync_down cluster.yaml '/path/on/cluster' '/local/path'
+    $ ray rsync_up cluster.yaml '/local/path' '/path/on/cluster'
+
+.. _monitor-cluster-under-construction:
+
+Monitoring cluster status (``ray dashboard/status``)
+-----------------------------------------------------
+
+The Ray also comes with an online dashboard. The dashboard is accessible via
+HTTP on the head node (by default it listens on ``localhost:8265``). You can
+also use the built-in ``ray dashboard`` to set up port forwarding
+automatically, making the remote dashboard viewable in your local browser at
+``localhost:8265``.
+
+.. code-block:: shell
+
+    $ ray dashboard cluster.yaml
+
+You can monitor cluster usage and auto-scaling status by running (on the head node):
+
+.. code-block:: shell
+
+    $ ray status
+
+To see live updates to the status:
+
+.. code-block:: shell
+
+    $ watch -n 1 ray status
+
+The Ray autoscaler also reports per-node status in the form of instance tags.
+In your cloud provider console, you can click on a Node, go to the "Tags" pane,
+and add the ``ray-node-status`` tag as a column. This lets you see per-node
+statuses at a glance:
+
+.. image:: /images/autoscaler-status.png
+
+Common Workflow: Syncing git branches
+-------------------------------------
+
+A common use case is syncing a particular local git branch to all workers of
+the cluster. However, if you just put a `git checkout <branch>` in the setup
+commands, the autoscaler won't know when to rerun the command to pull in
+updates. There is a nice workaround for this by including the git SHA in the
+input (the hash of the file will change if the branch is updated):
+
+.. code-block:: yaml
+
+    file_mounts: {
+        "/tmp/current_branch_sha": "/path/to/local/repo/.git/refs/heads/<YOUR_BRANCH_NAME>",
+    }
+
+    setup_commands:
+        - test -e <REPO_NAME> || git clone https://github.com/<REPO_ORG>/<REPO_NAME>.git
+        - cd <REPO_NAME> && git fetch && git checkout `cat /tmp/current_branch_sha`
+
+This tells ``ray up`` to sync the current git branch SHA from your personal
+computer to a temporary file on the cluster (assuming you've pushed the branch
+head already). Then, the setup commands read that file to figure out which SHA
+they should checkout on the nodes. Note that each command runs in its own
+session. The final workflow to update the cluster then becomes just this:
+
+1. Make local changes to a git branch
+2. Commit the changes with ``git commit`` and ``git push``
+3. Update files on your Ray cluster with ``ray up``
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Ray cluster configuration file
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-job-submission.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-job-submission.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Ray Job submission
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Using a community-supported cluster manager
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index.rst
@ -0,0 +1,21 @@
+.. warning::
+    This page is udner construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ref-cluster-setup-under-construction:
+
+Community Supported Cluster Managers
+====================================
+
+.. note::
+
+    If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher <cluster-cloud>` to simplify the cluster setup process.
+
+.. toctree::
+   :maxdepth: 2
+
+   yarn.rst
+   slurm.rst
+   lsf.rst
+
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/lsf.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/lsf.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# LSF
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/lsf.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/lsf.rst
@ -0,0 +1,23 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ray-LSF-deploy-under-construction:
+
+Deploying on LSF
+================
+
+This document describes a couple high-level steps to run ray cluster on LSF.
+
+1) Obtain desired nodes from LSF scheduler using bsub directives.
+2) Obtain free ports on the desired nodes to start ray services like dashboard, GCS etc.
+3) Start ray head node on one of the available nodes.
+4) Connect all the worker nodes to the head node.
+5) Perform port forwarding to access ray dashboard.
+
+Steps 1-4 have been automated and can be easily run as a script, please refer to below github repo to access script and run sample workloads:
+
+- `ray_LSF`_ Ray with LSF. Users can start up a Ray cluster on LSF, and run DL workloads through that either in a batch or interactive mode.
+
+.. _`ray_LSF`: https://github.com/IBMSpectrumComputing/ray-integration
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/slurm.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/slurm.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# SLURM
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/slurm.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/slurm.rst
@ -0,0 +1,288 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ray-slurm-deploy-under-construction:
+
+Deploying on Slurm
+==================
+
+Slurm usage with Ray can be a little bit unintuitive.
+
+* SLURM requires multiple copies of the same program are submitted multiple times to the same cluster to do cluster programming. This is particularly well-suited for MPI-based workloads.
+* Ray, on the other hand, expects a head-worker architecture with a single point of entry. That is, you'll need to start a Ray head node, multiple Ray worker nodes, and run your Ray script on the head node.
+
+.. warning::
+
+    SLURM support is still a work in progress. SLURM users should be aware
+    of current limitations regarding networking.
+    See :ref:`here <slurm-network-ray>` for more explanations.
+
+    SLURM support is community-maintained. Maintainer GitHub handle: tupui.
+
+This document aims to clarify how to run Ray on SLURM.
+
+.. contents::
+  :local:
+
+
+Walkthrough using Ray with SLURM
+--------------------------------
+
+Many SLURM deployments require you to interact with slurm via ``sbatch``, which executes a batch script on SLURM.
+
+To run a Ray job with ``sbatch``, you will want to start a Ray cluster in the sbatch job with multiple ``srun`` commands (tasks), and then execute your python script that uses Ray. Each task will run on a separate node and start/connect to a Ray runtime.
+
+The below walkthrough will do the following:
+
+1. Set the proper headers for the ``sbatch`` script.
+2. Load the proper environment/modules.
+3. Fetch a list of available computing nodes and their IP addresses.
+4. Launch a head ray process in one of the node (called the head node).
+5. Launch Ray processes in (n-1) worker nodes and connects them to the head node by providing the head node address.
+6. After the underlying ray cluster is ready, submit the user specified task.
+
+See :ref:`slurm-basic.sh <slurm-basic>` for an end-to-end example.
+
+.. _ray-slurm-headers-under-construction:
+
+sbatch directives
+~~~~~~~~~~~~~~~~~
+
+In your sbatch script, you'll want to add `directives to provide context <https://slurm.schedmd.com/sbatch.html>`__ for your job to SLURM.
+
+.. code-block:: bash
+
+  #!/bin/bash
+  #SBATCH --job-name=my-workload
+
+You'll need to tell SLURM to allocate nodes specifically for Ray. Ray will then find and manage all resources on each node.
+
+.. code-block:: bash
+
+  ### Modify this according to your Ray workload.
+  #SBATCH --nodes=4
+  #SBATCH --exclusive
+
+Important: To ensure that each Ray worker runtime will run on a separate node, set ``tasks-per-node``.
+
+.. code-block:: bash
+
+  #SBATCH --tasks-per-node=1
+
+Since we've set `tasks-per-node = 1`, this will be used to guarantee that each Ray worker runtime will obtain the
+proper resources. In this example, we ask for at least 5 CPUs and 5 GB of memory per node.
+
+.. code-block:: bash
+
+  ### Modify this according to your Ray workload.
+  #SBATCH --cpus-per-task=5
+  #SBATCH --mem-per-cpu=1GB
+  ### Similarly, you can also specify the number of GPUs per node.
+  ### Modify this according to your Ray workload. Sometimes this
+  ### should be 'gres' instead.
+  #SBATCH --gpus-per-task=1
+
+
+You can also add other optional flags to your sbatch directives.
+
+
+Loading your environment
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, you'll often want to Load modules or your own conda environment at the beginning of the script.
+
+Note that this is an optional step, but it is often required for enabling the right set of dependencies.
+
+.. code-block:: bash
+
+  # Example: module load pytorch/v1.4.0-gpu
+  # Example: conda activate my-env
+
+  conda activate my-env
+
+Obtain the head IP address
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Next, we'll want to obtain a hostname and a node IP address for the head node. This way, when we start worker nodes, we'll be able to properly connect to the right head node.
+
+.. literalinclude:: /cluster/examples/slurm-basic.sh
+   :language: bash
+   :start-after: __doc_head_address_start__
+   :end-before: __doc_head_address_end__
+
+
+
+Starting the Ray head node
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After detecting the head node hostname and head node IP, we'll want to create
+a Ray head node runtime. We'll do this by using ``srun`` as a background task
+as a single task/node (recall that ``tasks-per-node=1``).
+
+Below, you'll see that we explicitly specify the number of CPUs (``num-cpus``)
+and number of GPUs (``num-gpus``) to Ray, as this will prevent Ray from using
+more resources than allocated. We also need to explictly
+indicate the ``node-ip-address`` for the Ray head runtime:
+
+.. literalinclude:: /cluster/examples/slurm-basic.sh
+   :language: bash
+   :start-after: __doc_head_ray_start__
+   :end-before: __doc_head_ray_end__
+
+By backgrounding the above srun task, we can proceed to start the Ray worker runtimes.
+
+Starting the Ray worker nodes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Below, we do the same thing, but for each worker. Make sure the Ray head and Ray worker processes are not started on the same node.
+
+.. literalinclude:: /cluster/examples/slurm-basic.sh
+   :language: bash
+   :start-after: __doc_worker_ray_start__
+   :end-before: __doc_worker_ray_end__
+
+Submitting your script
+~~~~~~~~~~~~~~~~~~~~~~
+
+Finally, you can invoke your Python script:
+
+.. literalinclude:: /cluster/examples/slurm-basic.sh
+   :language: bash
+   :start-after: __doc_script_start__
+
+.. _slurm-network-ray-under-construction:
+
+SLURM networking caveats
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are two important networking aspects to keep in mind when working with
+SLURM and Ray:
+
+1. Ports binding.
+2. IP binding.
+
+One common use of a SLURM cluster is to have multiple users running concurrent
+jobs on the same infrastructure. This can easily conflict with Ray due to the
+way the head node communicates with its workers.
+
+Considering 2 users, if they both schedule a SLURM job using Ray
+at the same time, they are both creating a head node. In the backend, Ray will
+assign some internal ports to a few services. The issue is that as soon as the
+first head node is created, it will bind some ports and prevent them to be
+used by another head node. To prevent any conflicts, users have to manually
+specify non overlapping ranges of ports. The following ports are to be
+adjusted. For an explanation on ports, see :ref:`here <ray-ports>`::
+
+    # used for all ports
+    --node-manager-port
+    --object-manager-port
+    --min-worker-port
+    --max-worker-port
+    # used for the head node
+    --port
+    --ray-client-server-port
+    --redis-shard-ports
+
+For instance, again with 2 users, they would have to adapt the instructions
+seen above to:
+
+.. code-block:: bash
+
+  # user 1
+  # same as above
+  ...
+  srun --nodes=1 --ntasks=1 -w "$head_node" \
+      ray start --head --node-ip-address="$head_node_ip" \
+          --port=6379 \
+          --node-manager-port=6700 \
+          --object-manager-port=6701 \
+          --ray-client-server-port=10001 \
+          --redis-shard-ports=6702 \
+          --min-worker-port=10002 \
+          --max-worker-port=19999 \
+          --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
+
+  # user 2
+  # same as above
+  ...
+  srun --nodes=1 --ntasks=1 -w "$head_node" \
+      ray start --head --node-ip-address="$head_node_ip" \
+          --port=6380 \
+          --node-manager-port=6800 \
+          --object-manager-port=6801 \
+          --ray-client-server-port=20001 \
+          --redis-shard-ports=6802 \
+          --min-worker-port=20002 \
+          --max-worker-port=29999 \
+          --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
+
+As for the IP binding, on some cluster architecture the network interfaces
+do not allow to use external IPs between nodes. Instead, there are internal
+network interfaces (`eth0`, `eth1`, etc.). Currently, it's difficult to
+set an internal IP
+(see the open `issue <https://github.com/ray-project/ray/issues/22732>`_).
+
+Python-interface SLURM scripts
+------------------------------
+
+[Contributed by @pengzhenghao] Below, we provide a helper utility (:ref:`slurm-launch.py <slurm-launch>`) to auto-generate SLURM scripts and launch.
+``slurm-launch.py`` uses an underlying template (:ref:`slurm-template.sh <slurm-template>`) and fills out placeholders given user input.
+
+You can feel free to copy both files into your cluster for use. Feel free to also open any PRs for contributions to improve this script!
+
+Usage example
+~~~~~~~~~~~~~
+
+If you want to utilize a multi-node cluster in slurm:
+
+.. code-block:: bash
+
+    python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3
+
+If you want to specify the computing node(s), just use the same node name(s) in the same format of the output of ``sinfo`` command:
+
+.. code-block:: bash
+
+    python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3 --node NODE_NAMES
+
+
+There are other options you can use when calling ``python slurm-launch.py``:
+
+* ``--exp-name``: The experiment name. Will generate ``{exp-name}_{date}-{time}.sh`` and  ``{exp-name}_{date}-{time}.log``.
+* ``--command``: The command you wish to run. For example: ``rllib train XXX`` or ``python XXX.py``.
+* ``--num-gpus``: The number of GPUs you wish to use in each computing node. Default: 0.
+* ``--node`` (``-w``): The specific nodes you wish to use, in the same form as the output of ``sinfo``. Nodes are automatically assigned if not specified.
+* ``--num-nodes`` (``-n``): The number of nodes you wish to use. Default: 1.
+* ``--partition`` (``-p``): The partition you wish to use. Default: "", will use user's default partition.
+* ``--load-env``: The command to setup your environment. For example: ``module load cuda/10.1``. Default: "".
+
+Note that the :ref:`slurm-template.sh <slurm-template>` is compatible with both IPV4 and IPV6 ip address of the computing nodes.
+
+Implementation
+~~~~~~~~~~~~~~
+
+Concretely, the (:ref:`slurm-launch.py <slurm-launch>`) does the following things:
+
+1. It automatically writes your requirements, e.g. number of CPUs, GPUs per node, the number of nodes and so on, to a sbatch script name ``{exp-name}_{date}-{time}.sh``. Your command (``--command``) to launch your own job is also written into the sbatch script.
+2. Then it will submit the sbatch script to slurm manager via a new process.
+3. Finally, the python process will terminate itself and leaves a log file named ``{exp-name}_{date}-{time}.log`` to record the progress of your submitted command. At the mean time, the ray cluster and your job is running in the slurm cluster.
+
+
+Examples and templates
+----------------------
+
+Here are some community-contributed templates for using SLURM with Ray:
+
+- `Ray sbatch submission scripts`_ used at `NERSC <https://www.nersc.gov/>`_, a US national lab.
+- `YASPI`_ (yet another slurm python interface) by @albanie. The goal of yaspi is to provide an interface to submitting slurm jobs, thereby obviating the joys of sbatch files. It does so through recipes - these are collections of templates and rules for generating sbatch scripts. Supports job submissions for Ray.
+
+- `Convenient python interface`_ to launch ray cluster and submit task by @pengzhenghao
+
+.. _`Ray sbatch submission scripts`: https://github.com/NERSC/slurm-ray-cluster
+
+.. _`YASPI`: https://github.com/albanie/yaspi
+
+.. _`Convenient python interface`: https://github.com/pengzhenghao/use-ray-with-slurm
+
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/yarn.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/yarn.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# YARN
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/yarn.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/yarn.rst
@ -0,0 +1,199 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ray-yarn-deploy-under-construction:
+
+Deploying on YARN
+=================
+
+.. warning::
+
+  Running Ray on YARN is still a work in progress. If you have a
+  suggestion for how to improve this documentation or want to request
+  a missing feature, please feel free to create a pull request or get in touch
+  using one of the channels in the `Questions or Issues?`_ section below.
+
+This document assumes that you have access to a YARN cluster and will walk
+you through using `Skein`_ to deploy a YARN job that starts a Ray cluster and
+runs an example script on it.
+
+Skein uses a declarative specification (either written as a yaml file or using the Python API) and allows users to launch jobs and scale applications without the need to write Java code.
+
+You will first need to install Skein: ``pip install skein``.
+
+The Skein ``yaml`` file and example Ray program used here are provided in the
+`Ray repository`_ to get you started. Refer to the provided ``yaml``
+files to be sure that you maintain important configuration options for Ray to
+function properly.
+
+.. _`Ray repository`: https://github.com/ray-project/ray/tree/master/doc/yarn
+
+Skein Configuration
+-------------------
+
+A Ray job is configured to run as two `Skein services`:
+
+1. The ``ray-head`` service that starts the Ray head node and then runs the
+   application.
+2. The ``ray-worker`` service that starts worker nodes that join the Ray cluster.
+   You can change the number of instances in this configuration or at runtime
+   using ``skein container scale`` to scale the cluster up/down.
+
+The specification for each service consists of necessary files and commands that will be run to start the service.
+
+.. code-block:: yaml
+
+    services:
+        ray-head:
+            # There should only be one instance of the head node per cluster.
+            instances: 1
+            resources:
+                # The resources for the worker node.
+                vcores: 1
+                memory: 2048
+            files:
+                ...
+            script:
+                ...
+        ray-worker:
+            # Number of ray worker nodes to start initially.
+            # This can be scaled using 'skein container scale'.
+            instances: 3
+            resources:
+                # The resources for the worker node.
+                vcores: 1
+                memory: 2048
+            files:
+                ...
+            script:
+                ...
+
+Packaging Dependencies
+----------------------
+
+Use the ``files`` option to specify files that will be copied into the YARN container for the application to use. See `the Skein file distribution page <https://jcrist.github.io/skein/distributing-files.html>`_ for more information.
+
+.. code-block:: yaml
+
+    services:
+        ray-head:
+            # There should only be one instance of the head node per cluster.
+            instances: 1
+            resources:
+                # The resources for the head node.
+                vcores: 1
+                memory: 2048
+            files:
+                # ray/doc/yarn/example.py
+                example.py: example.py
+            #     # A packaged python environment using `conda-pack`. Note that Skein
+            #     # doesn't require any specific way of distributing files, but this
+            #     # is a good one for python projects. This is optional.
+            #     # See https://jcrist.github.io/skein/distributing-files.html
+            #     environment: environment.tar.gz
+
+Ray Setup in YARN
+-----------------
+
+Below is a walkthrough of the bash commands used to start the ``ray-head`` and ``ray-worker`` services. Note that this configuration will launch a new Ray cluster for each application, not reuse the same cluster.
+
+Head node commands
+~~~~~~~~~~~~~~~~~~
+
+Start by activating a pre-existing environment for dependency management.
+
+.. code-block:: bash
+
+    source environment/bin/activate
+
+Register the Ray head address needed by the workers in the Skein key-value store.
+
+.. code-block:: bash
+
+    skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current
+
+Start all the processes needed on the ray head node. By default, we set object store memory
+and heap memory to roughly 200 MB. This is conservative and should be set according to application needs.
+
+.. code-block:: bash
+
+    ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1
+
+Execute the user script containing the Ray program.
+
+.. code-block:: bash
+
+    python example.py
+
+Clean up all started processes even if the application fails or is killed.
+
+.. code-block:: bash
+
+    ray stop
+    skein application shutdown current
+
+Putting things together, we have:
+
+.. literalinclude:: /../yarn/ray-skein.yaml
+   :language: yaml
+   :start-after: # Head service
+   :end-before: # Worker service
+
+
+Worker node commands
+~~~~~~~~~~~~~~~~~~~~
+
+Fetch the address of the head node from the Skein key-value store.
+
+.. code-block:: bash
+
+    RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS)
+
+Start all of the processes needed on a ray worker node, blocking until killed by Skein/YARN via SIGTERM. After receiving SIGTERM, all started processes should also die (ray stop).
+
+.. code-block:: bash
+
+    ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop
+
+Putting things together, we have:
+
+.. literalinclude:: /../yarn/ray-skein.yaml
+   :language: yaml
+   :start-after: # Worker service
+
+Running a Job
+-------------
+
+Within your Ray script, use the following to connect to the started Ray cluster:
+
+.. literalinclude:: /../yarn/example.py
+    :language: python
+    :start-after: if __name__ == "__main__"
+
+You can use the following command to launch the application as specified by the Skein YAML file.
+
+.. code-block:: bash
+
+    skein application submit [TEST.YAML]
+
+Once it has been submitted, you can see the job running on the YARN dashboard.
+
+.. image:: /images/yarn-job.png
+
+Cleaning Up
+-----------
+
+To clean up a running job, use the following (using the application ID):
+
+.. code-block:: bash
+
+    skein application shutdown $appid
+
+Questions or Issues?
+--------------------
+
+.. include:: /_includes/_help.rst
+
+.. _`Skein`: https://jcrist.github.io/skein/
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Configuring autoscaling
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling.rst
@ -0,0 +1,56 @@
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _deployment-guide-autoscaler-under-construction:
+
+Autoscaling with Ray
+--------------------
+
+Ray is designed to support highly elastic workloads which are most efficient on
+an autoscaling cluster. At a high level, the autoscaler attempts to
+launch/terminate nodes in order to ensure that workloads have sufficient
+resources to run, while minimizing the idle resources.
+
+It does this by taking into consideration:
+
+* User specified hard limits (min/max workers).
+* User specified node types (nodes in a Ray cluster do _not_ have to be
+  homogenous).
+* Information from the Ray core's scheduling layer about the current resource
+  usage/demands of the cluster.
+* Programmatic autoscaling hints.
+
+Take a look at :ref:`the cluster reference <cluster-config>` to learn more
+about configuring the autoscaler.
+
+
+How does it work?
+^^^^^^^^^^^^^^^^^
+
+The Ray Cluster Launcher will automatically enable a load-based autoscaler. The
+autoscaler resource demand scheduler will look at the pending tasks, actors,
+and placement groups resource demands from the cluster, and try to add the
+minimum list of nodes that can fulfill these demands. Autoscaler uses a simple 
+binpacking algorithm to binpack the user demands into
+the available cluster resources. The remaining unfulfilled demands are placed
+on the smallest list of nodes that satisfies the demand while maximizing
+utilization (starting from the smallest node).
+
+**Downscaling**: When worker nodes are
+idle (without active Tasks or Actors running on it) 
+for more than :ref:`idle_timeout_minutes
+<cluster-configuration-idle-timeout-minutes>`, they are subject to
+removal from the cluster. But there are two important additional conditions
+to note: 
+
+* The head node is never removed unless the cluster is torn down.
+* If the Ray Object Store is used, and a Worker node still holds objects (including spilled objects on disk), it won't be removed.
+
+
+
+**Here is "A Glimpse into the Ray Autoscaler" and how to debug/monitor your cluster:**
+
+2021-19-01 by Ameer Haj-Ali, Anyscale Inc.
+
+.. youtube:: BJ06eJasdu4
+
+
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Best practices for deploying large clusters
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices.rst
@ -0,0 +1,129 @@
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+Best practices for deploying large clusters
+-------------------------------------------
+
+This section aims to document best practices for deploying Ray clusters at
+large scale.
+
+Networking configuration
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+End users should only need to directly interact with the head node of the
+cluster. In particular, there are 2 services which should be exposed to users:
+
+1. The dashboard
+2. The Ray client server
+
+.. note::
+
+  While users only need 2 ports to connect to a cluster, the nodes within a
+  cluster require a much wider range of ports to communicate.
+
+  See :ref:`Ray port configuration <Ray-ports>` for a comprehensive list.
+
+  Applications (such as :ref:`Ray Serve <Rayserve>`) may also require
+  additional ports to work properly.
+
+System configuration
+^^^^^^^^^^^^^^^^^^^^
+
+There are a few system level configurations that should be set when using Ray
+at a large scale.
+
+* Make sure ``ulimit -n`` is set to at least 65535. Ray opens many direct
+  connections between worker processes to avoid bottlenecks, so it can quickly
+  use a large number of file descriptors.
+* Make sure ``/dev/shm`` is sufficiently large. Most ML/RL applications rely
+  heavily on the plasma store. By default, Ray will try to use ``/dev/shm`` for
+  the object store, but if it is not large enough (i.e. ``--object-store-memory``
+  > size of ``/dev/shm``), Ray will write the plasma store to disk instead, which
+  may cause significant performance problems.
+* Use NVMe SSDs (or other high perforfmance storage) if possible. If
+  :ref:`object spilling <object-spilling>` is enabled Ray will spill objects to
+  disk if necessary. This is most commonly needed for data processing
+  workloads.
+
+Configuring the head node
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In addition to the above changes, when deploying a large cluster, Ray's
+architecture means that the head node will have extra stress due to GCS.
+
+* Make sure the head node has sufficient bandwidth. The most heavily stressed
+  resource on the head node is outbound bandwidth. For large clusters (see the
+  scalability envelope), we recommend using machines networking characteristics
+  at least as good as an r5dn.16xlarge on AWS EC2.
+* Set ``resources: {"CPU": 0}`` on the head node. (For Ray clusters deployed using Helm,
+  set ``rayResources: {"CPU": 0}``.) Due to the heavy networking
+  load (and the GCS and dashboard processes), we recommend setting the number of
+  CPUs to 0 on the head node to avoid scheduling additional tasks on it.
+
+Configuring the autoscaler
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For large, long running clusters, there are a few parameters that can be tuned.
+
+* Ensure your quotas for node types are set correctly.
+* For long running clusters, set the ``AUTOSCALER_MAX_NUM_FAILURES`` environment
+  variable to a large number (or ``inf``) to avoid unexpected autoscaler
+  crashes. The variable can be set by prepending \ ``export AUTOSCALER_MAX_NUM_FAILURES=inf;``
+  to the head node's Ray start command.
+  (Note: you may want a separate mechanism to detect if the autoscaler
+  errors too often).
+* For large clusters, consider tuning ``upscaling_speed`` for faster
+  autoscaling.
+
+Picking nodes
+^^^^^^^^^^^^^
+
+Here are some tips for how to set your ``available_node_types`` for a cluster,
+using AWS instance types as a concrete example.
+
+General recommendations with AWS instance types:
+
+**When to use GPUs**
+
+* If you’re using some RL/ML framework
+* You’re doing something with tensorflow/pytorch/jax (some framework that can
+  leverage GPUs well)
+
+**What type of GPU?**
+
+* The latest gen GPU is almost always the best bang for your buck (p3 > p2, g4
+  > g3), for most well designed applications the performance outweighs the
+  price (the instance price may be higher, but you’ll use the instance for less
+  time.
+* You may want to consider using older instances if you’re doing dev work and
+  won’t actually fully utilize the GPUs though.
+* If you’re doing training (ML or RL), you should use a P instance. If you’re
+  doing inference, you should use a G instance. The difference is
+  processing:VRAM ratio (training requires more memory).
+
+**What type of CPU?**
+
+* Again stick to the latest generation, they’re typically cheaper and faster.
+* When in doubt use M instances, they have typically have the highest
+  availability.
+* If you know your application is memory intensive (memory utilization is full,
+  but cpu is not), go with an R instance
+* If you know your application is CPU intensive go with a C instance
+* If you have a big cluster, make the head node an instance with an n (r5dn or
+  c5n)
+
+**How many CPUs/GPUs?**
+
+* Focus on your CPU:GPU ratio first and look at the utilization (Ray dashboard
+  should help with this). If your CPU utilization is low add GPUs, or vice
+  versa.
+* The exact ratio will be very dependent on your workload.
+* Once you find a good ratio, you should be able to scale up and and keep the
+  same ratio.
+* You can’t infinitely scale forever. Eventually, as you add more machines your
+  performance improvements will become sub-linear/not worth it. There may not
+  be a good one-size fits all strategy at this point.
+
+.. note::
+
+   If you're using RLlib, check out :ref:`the RLlib scaling guide
+   <rllib-scaling-guide>` for RLlib specific recommendations.
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Adding your own cloud provider
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider.rst
@ -0,0 +1,9 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+Additional Cloud Providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/aws.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/aws.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# AWS
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/aws.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/aws.rst
@ -0,0 +1,573 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-cloud-under-construction-aws:
+
+Launching Ray Clusters on AWS
+=============================
+
+This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
+
+See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
+
+To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
+
+.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
+
+.. _ref-cloud-setup-under-construction-aws:
+
+Ray with cloud providers
+------------------------
+
+.. toctree::
+    :hidden:
+
+    /cluster/aws-tips.rst
+
+.. tabbed::  AWS
+
+    First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
+    as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
+
+    Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
+        $ # Try running a Ray program.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
+
+
+    AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
+
+    See :ref:`aws-cluster` for recipes on customizing AWS clusters.
+.. tabbed:: Azure
+
+    First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
+
+    Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
+
+    Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/azure/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
+        # test ray setup
+        $ python -c 'import ray; ray.init()'
+        $ exit
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/azure/example-full.yaml
+
+    **Azure Portal**:
+    Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
+    the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
+    for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
+    The head node conveniently exposes both SSH as well as JupyterLab.
+
+    .. image:: https://aka.ms/deploytoazurebutton
+       :target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
+       :alt: Deploy to Azure
+
+    Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
+    Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
+
+    .. code-block:: python
+
+        import ray
+        ray.init()
+
+    Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
+
+    1. Activates one of the conda environments available on DSVM
+    2. Installs Ray and any other user-specified dependencies
+    3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
+
+
+    Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
+    .. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+.. tabbed:: GCP
+
+    First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
+
+    Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
+
+    GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
+
+.. tabbed:: Aliyun
+
+    First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
+
+    Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
+
+    Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+    Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
+
+    .. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+
+.. tabbed:: Custom
+
+    Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
+    You can specify the external node provider using the yaml config:
+
+    .. code-block:: yaml
+
+        provider:
+            type: external
+            module: mypackage.myclass
+
+    The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
+
+Additional Cloud Providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
+
+
+Security
+--------
+
+On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
+
+.. _using-ray-on-a-cluster-under-construction-aws:
+
+Running a Ray program on the Ray cluster
+----------------------------------------
+
+To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
+
+.. tabbed:: Python
+
+    Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
+    For example:
+
+    .. code-block:: python
+
+        ray.init()
+        # Connecting to existing Ray cluster at address: <IP address>...
+
+.. tabbed:: Java
+
+    You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            java -classpath <classpath> \
+              -Dray.address=<address> \
+              <classname> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+.. tabbed:: C++
+
+    You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            RAY_ADDRESS=<address> ./<binary> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+
+.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
+
+To verify that the correct number of nodes have joined the cluster, you can run the following.
+
+.. code-block:: python
+
+  import time
+
+  @ray.remote
+  def f():
+      time.sleep(0.01)
+      return ray._private.services.get_node_ip_address()
+
+  # Get a list of the IP addresses of the nodes that have joined the cluster.
+  set(ray.get([f.remote() for _ in range(1000)]))
+
+
+.. _aws-cluster-under-construction:
+
+AWS Configurations
+==================
+
+.. _aws-cluster-efs-under-construction:
+
+Using Amazon EFS
+----------------
+
+To use Amazon EFS, install some utilities and mount the EFS in ``setup_commands``. Note that these instructions only work if you are using the AWS Autoscaler.
+
+.. note::
+
+  You need to replace the ``{{FileSystemId}}`` to your own EFS ID before using the config. You may also need to set correct ``SecurityGroupIds`` for the instances in the config file.
+
+.. code-block:: yaml
+
+    setup_commands:
+        - sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
+            sudo pkill -9 apt-get;
+            sudo pkill -9 dpkg;
+            sudo dpkg --configure -a;
+            sudo apt-get -y install binutils;
+            cd $HOME;
+            git clone https://github.com/aws/efs-utils;
+            cd $HOME/efs-utils;
+            ./build-deb.sh;
+            sudo apt-get -y install ./build/amazon-efs-utils*deb;
+            cd $HOME;
+            mkdir efs;
+            sudo mount -t efs {{FileSystemId}}:/ efs;
+            sudo chmod 777 efs;
+
+.. _aws-cluster-s3-under-construction:
+
+Configure worker nodes to access Amazon S3
+------------------------------------------
+
+In various scenarios, worker nodes may need write access to the S3 bucket.
+E.g. Ray Tune has the option that worker nodes write distributed checkpoints to S3 instead of syncing back to the driver using rsync.
+
+If you see errors like "Unable to locate credentials", make sure that the correct ``IamInstanceProfile`` is configured for worker nodes in ``cluster.yaml`` file.
+This may look like:
+
+.. code-block:: text
+
+ worker_nodes:
+     InstanceType: m5.xlarge
+     ImageId: latest_dlami
+     IamInstanceProfile:
+         Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
+
+You can verify if the set up is correct by entering one worker node and do
+
+.. code-block:: bash
+
+ aws configure list
+
+You should see something like
+
+.. code-block:: text
+
+       Name                    Value             Type    Location
+       ----                    -----             ----    --------
+    profile                <not set>             None    None
+ access_key     ****************XXXX         iam-role
+ secret_key     ****************YYYY         iam-role
+     region                <not set>             None    None
+
+Please refer to `this discussion <https://github.com/ray-project/ray/issues/9327>`__ for more details.
+
+
+.. _aws-cluster-cloudwatch-under-construction:
+
+Using Amazon CloudWatch
+=======================
+
+Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
+CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
+
+AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
+Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
+
+The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
+
+.. list-table:: All available unified CloudWatch agent images
+
+    * - Base AMI
+      - AMI ID
+      - Region
+      - Unified CloudWatch Agent Version
+    * - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
+      - ami-069f2811478f86c20
+      - us-east-1
+      - v1.247348.0b251302
+    * - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
+      - ami-058cc0932940c2b8b
+      - us-east-2
+      - v1.247348.0b251302
+    * - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
+      - ami-044f95c9ef12883ef
+      - us-west-1
+      - v1.247348.0b251302
+    * - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
+      - ami-0d88d9cbe28fac870
+      - us-west-2
+      - v1.247348.0b251302
+
+.. note::
+
+    Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
+
+Getting started
+---------------
+
+1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: yaml
+
+    provider:
+        type: aws
+        region: us-west-2
+        availability_zone: us-west-2a
+        # Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
+        cloudwatch:
+            agent:
+                # Path to Unified CloudWatch Agent config file
+                config: "cloudwatch/example-cloudwatch-agent-config.json"
+            dashboard:
+                # CloudWatch Dashboard name
+                name: "example-dashboard-name"
+                # Path to the CloudWatch Dashboard config file
+                config: "cloudwatch/example-cloudwatch-dashboard-config.json"
+
+    auth:
+        ssh_user: ubuntu
+
+    available_node_types:
+        ray.head.default:
+            node_config:
+            InstanceType: c5a.large
+            ImageId: ami-0d88d9cbe28fac870  # Unified CloudWatch agent pre-installed AMI, us-west-2
+            resources: {}
+        ray.worker.default:
+            node_config:
+                InstanceType: c5a.large
+                ImageId: ami-0d88d9cbe28fac870  # Unified CloudWatch agent pre-installed AMI, us-west-2
+                IamInstanceProfile:
+                    Name: ray-autoscaler-cloudwatch-v1
+            resources: {}
+            min_workers: 0
+
+2. Download CloudWatch Agent and Dashboard config.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
+Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
+
+.. code-block:: console
+
+    $ mkdir cloudwatch
+    $ cd cloudwatch
+    $ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
+    $ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
+
+3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
+See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
+
+4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
+
+.. code-block:: bash
+
+    aws logs tail $log_group_name --follow
+
+Advanced Setup
+--------------
+
+Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
+
+1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
+then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
+
+The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
+
+.. code-block:: bash
+
+    aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
+
+.. code-block:: yaml
+
+    available_node_types:
+        ray.head.default:
+            node_config:
+            InstanceType: c5a.large
+            ImageId: ami-0d88d9cbe28fac870
+        ray.worker.default:
+            node_config:
+            InstanceType: c5a.large
+            ImageId: ami-0d88d9cbe28fac870
+
+To build your own AMI with the Unified CloudWatch Agent installed:
+
+1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
+2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
+
+2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
+
+These example config files include the following features:
+
+**Logs and Metrics**:  Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
+and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
+Log streams are named after the EC2 instance ID that emitted their logs.
+Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
+
+**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
+Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
+
+**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
+and must be manually provided in your alarm config file.
+
+For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
+
+CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
+
+``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
+
+``{region}``: Replaced with your Ray cluster's region.
+
+``{cluster_name}``: Replaced with your Ray cluster name.
+
+See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
+
+.. note::
+    Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
+
+.. code-block:: json
+
+     "AlarmActions":[
+         "TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
+      ]
+
+3. Reference your CloudWatch JSON config files in your cluster config YAML.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
+
+.. code-block:: yaml
+
+     provider:
+        cloudwatch:
+            agent:
+                config: "cloudwatch/example-cloudwatch-agent-config.json"
+
+
+4. Set your IAM Role and EC2 Instance Profile.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
+This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
+
+Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
+
+.. code-block:: yaml
+
+    ray.worker.default:
+        node_config:
+            InstanceType: c5a.large
+            IamInstanceProfile:
+                Name: ray-autoscaler-cloudwatch-v1
+
+5. Export Ray system metrics to CloudWatch.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
+Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
+You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
+
+.. code-block:: yaml
+
+    head_setup_commands:
+  # Make `ray_prometheus_waiter.sh` executable.
+  - >-
+    RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
+    && sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
+  # Copy `prometheus.yml` to Unified CloudWatch Agent folder
+  - >-
+    RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
+    && sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
+  # First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
+  - >-
+    nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
+    `cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
+    >> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
+
+6. Update CloudWatch Agent, Dashboard and Alarm config files.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
+The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.
+
+
+
+What's Next?
+============
+
+Now that you have a working understanding of the cluster launcher, check out:
+
+* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
+* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
+* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
+
+
+
+Questions or Issues?
+====================
+
+.. include:: /_includes/_help.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/azure.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/azure.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Azure
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/azure.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/azure.rst
@ -0,0 +1,257 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-cloud-under-construction-azure:
+
+Launching Ray Clusters on Azure
+===============================
+
+This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
+
+See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
+
+To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
+
+.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
+
+.. _ref-cloud-setup-under-construction-azure:
+
+Ray with cloud providers
+------------------------
+
+.. toctree::
+    :hidden:
+
+    /cluster/aws-tips.rst
+
+.. tabbed::  AWS
+
+    First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
+    as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
+
+    Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
+        $ # Try running a Ray program.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
+
+
+    AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
+
+    See :ref:`aws-cluster` for recipes on customizing AWS clusters.
+.. tabbed:: Azure
+
+    First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
+
+    Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
+
+    Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/azure/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
+        # test ray setup
+        $ python -c 'import ray; ray.init()'
+        $ exit
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/azure/example-full.yaml
+
+    **Azure Portal**:
+    Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
+    the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
+    for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
+    The head node conveniently exposes both SSH as well as JupyterLab.
+
+    .. image:: https://aka.ms/deploytoazurebutton
+       :target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
+       :alt: Deploy to Azure
+
+    Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
+    Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
+
+    .. code-block:: python
+
+        import ray
+        ray.init()
+
+    Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
+
+    1. Activates one of the conda environments available on DSVM
+    2. Installs Ray and any other user-specified dependencies
+    3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
+
+
+    Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
+    .. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+.. tabbed:: GCP
+
+    First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
+
+    Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
+
+    GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
+
+.. tabbed:: Aliyun
+
+    First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
+
+    Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
+
+    Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+    Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
+
+    .. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+
+.. tabbed:: Custom
+
+    Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
+    You can specify the external node provider using the yaml config:
+
+    .. code-block:: yaml
+
+        provider:
+            type: external
+            module: mypackage.myclass
+
+    The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
+
+Additional Cloud Providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
+
+
+Security
+--------
+
+On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
+
+.. _using-ray-on-a-cluster-under-construction-azure:
+
+Running a Ray program on the Ray cluster
+----------------------------------------
+
+To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
+
+.. tabbed:: Python
+
+    Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
+    For example:
+
+    .. code-block:: python
+
+        ray.init()
+        # Connecting to existing Ray cluster at address: <IP address>...
+
+.. tabbed:: Java
+
+    You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            java -classpath <classpath> \
+              -Dray.address=<address> \
+              <classname> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+.. tabbed:: C++
+
+    You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            RAY_ADDRESS=<address> ./<binary> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+
+.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
+
+To verify that the correct number of nodes have joined the cluster, you can run the following.
+
+.. code-block:: python
+
+  import time
+
+  @ray.remote
+  def f():
+      time.sleep(0.01)
+      return ray._private.services.get_node_ip_address()
+
+  # Get a list of the IP addresses of the nodes that have joined the cluster.
+  set(ray.get([f.remote() for _ in range(1000)]))
+
+
+What's Next?
+-------------
+
+Now that you have a working understanding of the cluster launcher, check out:
+
+* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
+* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
+* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
+
+
+
+Questions or Issues?
+--------------------
+
+.. include:: /_includes/_help.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/gcp.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/gcp.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# GCP
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/gcp.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/gcp.rst
@ -0,0 +1,257 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-cloud-under-construction-gcp:
+
+Launching Ray Clusters on GCP
+=============================
+
+This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
+
+See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
+
+To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
+
+.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
+
+.. _ref-cloud-setup-under-construction-gcp:
+
+Ray with cloud providers
+------------------------
+
+.. toctree::
+    :hidden:
+
+    /cluster/aws-tips.rst
+
+.. tabbed::  AWS
+
+    First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
+    as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
+
+    Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
+        $ # Try running a Ray program.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
+
+
+    AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
+
+    See :ref:`aws-cluster` for recipes on customizing AWS clusters.
+.. tabbed:: Azure
+
+    First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
+
+    Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
+
+    Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/azure/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
+        # test ray setup
+        $ python -c 'import ray; ray.init()'
+        $ exit
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/azure/example-full.yaml
+
+    **Azure Portal**:
+    Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
+    the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
+    for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
+    The head node conveniently exposes both SSH as well as JupyterLab.
+
+    .. image:: https://aka.ms/deploytoazurebutton
+       :target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
+       :alt: Deploy to Azure
+
+    Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
+    Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
+
+    .. code-block:: python
+
+        import ray
+        ray.init()
+
+    Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
+
+    1. Activates one of the conda environments available on DSVM
+    2. Installs Ray and any other user-specified dependencies
+    3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
+
+
+    Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
+    .. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+.. tabbed:: GCP
+
+    First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
+
+    Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
+
+    GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
+
+.. tabbed:: Aliyun
+
+    First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
+
+    Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
+
+    Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+    Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
+
+    .. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+
+.. tabbed:: Custom
+
+    Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
+    You can specify the external node provider using the yaml config:
+
+    .. code-block:: yaml
+
+        provider:
+            type: external
+            module: mypackage.myclass
+
+    The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
+
+Additional Cloud Providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
+
+
+Security
+--------
+
+On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
+
+.. _using-ray-on-a-cluster-under-construction-gcp:
+
+Running a Ray program on the Ray cluster
+----------------------------------------
+
+To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
+
+.. tabbed:: Python
+
+    Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
+    For example:
+
+    .. code-block:: python
+
+        ray.init()
+        # Connecting to existing Ray cluster at address: <IP address>...
+
+.. tabbed:: Java
+
+    You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            java -classpath <classpath> \
+              -Dray.address=<address> \
+              <classname> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+.. tabbed:: C++
+
+    You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            RAY_ADDRESS=<address> ./<binary> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+
+.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
+
+To verify that the correct number of nodes have joined the cluster, you can run the following.
+
+.. code-block:: python
+
+  import time
+
+  @ray.remote
+  def f():
+      time.sleep(0.01)
+      return ray._private.services.get_node_ip_address()
+
+  # Get a list of the IP addresses of the nodes that have joined the cluster.
+  set(ray.get([f.remote() for _ in range(1000)]))
+
+
+What's Next?
+-------------
+
+Now that you have a working understanding of the cluster launcher, check out:
+
+* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
+* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
+* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
+
+
+
+Questions or Issues?
+--------------------
+
+.. include:: /_includes/_help.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Launching a Ray Cluster on Cloud VMs
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index.rst
@ -0,0 +1,12 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. toctree::
+    :maxdepth: 2
+
+    aws.rst
+    gcp.rst
+    azure.rst
+    add-your-own-cloud-provider.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/manual-cluster-setup.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/manual-cluster-setup.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Manual cluster setup
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Monitoring and Observing a Ray Cluster
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster.rst
@ -0,0 +1,56 @@
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+Monitoring and observability
+----------------------------
+
+Ray comes with 3 main observability features:
+
+1. :ref:`The dashboard <Ray-dashboard>`
+2. :ref:`ray status <monitor-cluster>`
+3. :ref:`Prometheus metrics <multi-node-metrics>`
+
+Monitoring the cluster via the dashboard
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:ref:`The dashboard provides detailed information about the state of the cluster <Ray-dashboard>`,
+including the running jobs, actors, workers, nodes, etc.
+
+By default, the cluster launcher and operator will launch the dashboard, but
+not publicly expose it.
+
+If you launch your application via the cluster launcher, you can securely
+portforward local traffic to the dashboard via the ``ray dashboard`` command
+(which establishes an SSH tunnel). The dashboard will now be visible at
+``http://localhost:8265``.
+
+The Kubernetes Operator makes the dashboard available via a Service targeting the Ray head pod.
+You can :ref:`access the dashboard <ray-k8s-dashboard>` using ``kubectl port-forward``.
+
+
+Observing the autoscaler
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The autoscaler makes decisions by scheduling information, and programmatic
+information from the cluster. This information, along with the status of
+starting nodes, can be accessed via the ``ray status`` command.
+
+To dump the current state of a cluster launched via the cluster launcher, you
+can run ``ray exec cluster.yaml "Ray status"``.
+
+For a more "live" monitoring experience, it is recommended that you run ``ray
+status`` in a watch loop: ``ray exec cluster.yaml "watch -n 1 Ray status"``.
+
+With the kubernetes operator, you should replace ``ray exec cluster.yaml`` with
+``kubectl exec <head node pod>``.
+
+Prometheus metrics
+^^^^^^^^^^^^^^^^^^
+
+Ray is capable of producing prometheus metrics. When enabled, Ray produces some
+metrics about the Ray core, and some internal metrics by default. It also
+supports custom, user-defined metrics.
+
+These metrics can be consumed by any metrics infrastructure which can ingest
+metrics from the prometheus server on the head node of the cluster.
+
+:ref:`Learn more about setting up prometheus here. <multi-node-metrics>`
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/index.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/index.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Running jobs
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/index.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/index.rst
@ -0,0 +1,25 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ref-deployment-guide-under-construction:
+
+Deployment Guide
+================
+
+This section explains how to set up a distributed Ray cluster and run your workloads on it.
+
+To set up your cluster, check out the :ref:`Ray Cluster Overview <cluster-index>`, or jump to the :ref:`Ray Cluster Quick Start <ref-cluster-quick-start>`.
+
+To trigger a Ray workload from your local machine, a CI system, or a third-party job scheduler/orchestrator via a command line interface or API call, try :ref:`Ray Job Submission <jobs-overview>`.
+
+To run an interactive Ray workload and see the output in real time in a client of your choice (e.g. your local machine, SageMaker Studio, or Google Colab), you can use :ref:`Ray Client <ray-client>`.
+
+.. toctree::
+    :maxdepth: 2
+
+    job-submission-cli.rst
+    job-submission-rest.rst
+    job-submission-sdk.rst
+    ray-client.rst
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-cli.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-cli.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Submit jobs via the CLI
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-cli.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-cli.rst
@ -0,0 +1,385 @@
+.. warning::
+    This page is under construction!
+
+.. _jobs-overview-under-construction-cli:
+
+==================
+Ray Job Submission
+==================
+
+.. note::
+
+    This component is in **beta**.  APIs may change before becoming stable.  This feature requires a full installation of Ray using ``pip install "ray[default]"``.
+
+Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
+
+
+
+Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
+
+Concepts
+--------
+
+- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
+
+- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
+
+- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
+
+Quick Start Example
+-------------------
+
+Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
+
+.. code-block:: python
+
+    # script.py
+
+    import ray
+    import requests
+
+    ray.init()
+
+    @ray.remote
+    class Counter:
+        def __init__(self):
+            self.counter = 0
+
+        def inc(self):
+            self.counter += 1
+
+        def get_counter(self):
+            return self.counter
+
+    counter = Counter.remote()
+
+    for _ in range(5):
+        ray.get(counter.inc.remote())
+        print(ray.get(counter.get_counter.remote()))
+
+    print(requests.__version__)
+
+Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
+
+.. code-block:: bash
+
+  | your_working_directory ("./")
+  | ├── script.py
+
+
+Next, start a local Ray cluster:
+
+.. code-block:: bash
+
+   ❯ ray start --head
+    Local node IP: 127.0.0.1
+    INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
+
+Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below.  If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
+
+At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
+Continue on to see examples of running and interacting with this sample job. 
+
+.. _ray-job-apis-under-construction-cli:
+
+Ray Job Submission APIs
+-----------------------
+
+Ray provides three APIs for job submission: 
+
+* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
+* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
+* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
+
+All three APIs for job submission share the following key inputs:
+
+* **Entrypoint**: The shell command to run the job.
+
+    * Example: :code:`python my_ray_script.py`
+    * Example: :code:`echo hello`
+
+* **Runtime Environment**: Specifies files, packages, and other dependencies for your job.  See :ref:`Runtime Environments<runtime-environments>` for details.
+
+    * Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
+    * Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``. 
+
+        * If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs.  This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
+        * If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs.  This option has no size limit and is recommended for production use.  For details, see :ref:`remote-uris`.
+
+
+.. _ray-job-cli-under-construction-cli:
+
+CLI
+^^^
+
+The easiest way to get started with Ray job submission is to use the Job Submission CLI. 
+
+Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
+
+
+Using the CLI on a local cluster
+""""""""""""""""""""""""""""""""
+
+First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+This tells the jobs CLI how to find your Ray cluster.  Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.  
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you are ready to use the CLI.  
+Here are some examples of CLI commands from the Quick Start example and their output:
+
+.. code-block::
+
+    ❯ ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
+    2021-12-01 23:04:52,672	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:04:52,809	INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
+    2021-12-01 23:04:52,810	INFO packaging.py:352 -- Creating a file package for local directory './'.
+    2021-12-01 23:04:52,878	INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
+    2021-12-01 23:04:52,878	INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:00,356	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:00,371	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
+    2021-12-01 23:05:00,371	INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:37,751	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:37,764	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
+    2021-12-01 23:05:37,764	INFO cli.py:129 -- Job finished successfully.
+
+    ❯ ray job logs raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:59,026	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:23,037	INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
+    (pid=runtime_env) 2021-12-01 23:05:23,212	WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
+    (pid=runtime_env) 2021-12-01 23:05:23,212	INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
+    (pid=runtime_env) 2021-12-01 23:05:23,213	INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
+    1
+    2
+    3
+    4
+    5
+    2.26.0
+
+    ❯ ray job list
+    {'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
+    'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
+
+.. warning::
+
+    When using the CLI, do not wrap the entrypoint command in quotes.  For example, use 
+    ``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
+    Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
+
+.. tip::
+
+    If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
+    (For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
+    You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
+
+Using the CLI on a remote cluster
+"""""""""""""""""""""""""""""""""
+
+Above, we ran the "Quick Start" example on a local Ray cluster.  When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
+
+One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node. 
+If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
+
+To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
+
+.. code-block:: bash
+
+    ray dashboard cluster.yaml
+
+Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
+
+Using the CLI on Kubernetes
+"""""""""""""""""""""""""""
+
+The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
+https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
+
+Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
+
+
+.. _ray-job-sdk-under-construction-cli:
+
+Python SDK
+^^^^^^^^^^
+
+The Job Submission Python SDK is the recommended way to submit jobs programmatically.  Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
+
+SDK calls are made via a ``JobSubmissionClient`` object.  To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
+
+.. code-block:: python
+
+    from ray.job_submission import JobSubmissionClient
+
+    # If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
+    client = JobSubmissionClient("http://127.0.0.1:8265")
+
+Then we can submit our application to the Ray cluster via the Job SDK.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python script.py",
+        # Runtime environment for the job, specifying a working directory and pip package
+        runtime_env={
+            "working_dir": "./",
+            "pip": ["requests==2.26.0"]
+        }
+    )
+
+.. tip::
+
+    By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
+    In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
+
+Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
+We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
+
+.. code-block:: python
+
+    from ray.job_submission import JobStatus
+    import time
+
+    def wait_until_finish(job_id):
+        start = time.time()
+        timeout = 5
+        while time.time() - start <= timeout:
+            status = client.get_job_status(job_id)
+            print(f"status: {status}")
+            if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+                break
+            time.sleep(1)
+
+
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+The output should be as follows:
+
+.. code-block:: bash
+
+    status: JobStatus.PENDING
+    status: JobStatus.RUNNING
+    status: JobStatus.SUCCEEDED
+
+    1
+    2
+    3
+    4
+    5
+
+    2.26.0
+
+.. tip::
+
+    Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
+
+A submitted job can be stopped by the user before it finishes executing.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python -c 'import time; time.sleep(60)'",
+        runtime_env={}
+    )
+    wait_until_finish(job_id)
+    client.stop_job(job_id)
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+To get information about all jobs, call ``client.list_jobs()``.  This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
+
+For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
+
+
+.. _ray-job-rest-api-under-construction-cli:
+
+REST API
+^^^^^^^^
+
+Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
+
+**Submit Job**
+
+.. code-block:: python
+
+    import requests
+    import json
+    import time
+
+    resp = requests.post(
+        "http://127.0.0.1:8265/api/jobs/",
+        json={
+            "entrypoint": "echo hello",
+            "runtime_env": {},
+            "job_id": None,
+            "metadata": {"job_submission_id": "123"}
+        }
+    )
+    rst = json.loads(resp.text)
+    job_id = rst["job_id"]
+
+**Query and poll for Job status**
+
+.. code-block:: python
+
+    start = time.time()
+    while time.time() - start <= 10:
+        resp = requests.get(
+            "http://127.0.0.1:8265/api/jobs/<job_id>"
+        )
+        rst = json.loads(resp.text)
+        status = rst["status"]
+        print(f"status: {status}")
+        if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+            break
+        time.sleep(1)
+
+**Query for logs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/<job_id>/logs"
+    )
+    rst = json.loads(resp.text)
+    logs = rst["logs"]
+
+**List all jobs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/"
+    )
+    print(resp.json())
+    # {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
+
+
+Job Submission Architecture
+----------------------------
+
+The following diagram shows the underlying structure and steps for each submitted job.
+
+.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-rest.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-rest.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Submit jobs via the REST API
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-rest.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-rest.rst
@ -0,0 +1,385 @@
+.. warning::
+    This page is under construction!
+
+.. _jobs-overview-under-construction-rest:
+
+==================
+Ray Job Submission
+==================
+
+.. note::
+
+    This component is in **beta**.  APIs may change before becoming stable.  This feature requires a full installation of Ray using ``pip install "ray[default]"``.
+
+Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
+
+
+
+Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
+
+Concepts
+--------
+
+- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
+
+- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
+
+- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
+
+Quick Start Example
+-------------------
+
+Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
+
+.. code-block:: python
+
+    # script.py
+
+    import ray
+    import requests
+
+    ray.init()
+
+    @ray.remote
+    class Counter:
+        def __init__(self):
+            self.counter = 0
+
+        def inc(self):
+            self.counter += 1
+
+        def get_counter(self):
+            return self.counter
+
+    counter = Counter.remote()
+
+    for _ in range(5):
+        ray.get(counter.inc.remote())
+        print(ray.get(counter.get_counter.remote()))
+
+    print(requests.__version__)
+
+Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
+
+.. code-block:: bash
+
+  | your_working_directory ("./")
+  | ├── script.py
+
+
+Next, start a local Ray cluster:
+
+.. code-block:: bash
+
+   ❯ ray start --head
+    Local node IP: 127.0.0.1
+    INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
+
+Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below.  If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
+
+At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
+Continue on to see examples of running and interacting with this sample job. 
+
+.. _ray-job-apis-under-construction-rest:
+
+Ray Job Submission APIs
+-----------------------
+
+Ray provides three APIs for job submission: 
+
+* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
+* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
+* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
+
+All three APIs for job submission share the following key inputs:
+
+* **Entrypoint**: The shell command to run the job.
+
+    * Example: :code:`python my_ray_script.py`
+    * Example: :code:`echo hello`
+
+* **Runtime Environment**: Specifies files, packages, and other dependencies for your job.  See :ref:`Runtime Environments<runtime-environments>` for details.
+
+    * Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
+    * Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``. 
+
+        * If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs.  This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
+        * If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs.  This option has no size limit and is recommended for production use.  For details, see :ref:`remote-uris`.
+
+
+.. _ray-job-cli-under-construction-rest:
+
+CLI
+^^^
+
+The easiest way to get started with Ray job submission is to use the Job Submission CLI. 
+
+Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
+
+
+Using the CLI on a local cluster
+""""""""""""""""""""""""""""""""
+
+First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+This tells the jobs CLI how to find your Ray cluster.  Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.  
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you are ready to use the CLI.  
+Here are some examples of CLI commands from the Quick Start example and their output:
+
+.. code-block::
+
+    ❯ ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
+    2021-12-01 23:04:52,672	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:04:52,809	INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
+    2021-12-01 23:04:52,810	INFO packaging.py:352 -- Creating a file package for local directory './'.
+    2021-12-01 23:04:52,878	INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
+    2021-12-01 23:04:52,878	INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:00,356	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:00,371	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
+    2021-12-01 23:05:00,371	INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:37,751	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:37,764	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
+    2021-12-01 23:05:37,764	INFO cli.py:129 -- Job finished successfully.
+
+    ❯ ray job logs raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:59,026	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:23,037	INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
+    (pid=runtime_env) 2021-12-01 23:05:23,212	WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
+    (pid=runtime_env) 2021-12-01 23:05:23,212	INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
+    (pid=runtime_env) 2021-12-01 23:05:23,213	INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
+    1
+    2
+    3
+    4
+    5
+    2.26.0
+
+    ❯ ray job list
+    {'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
+    'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
+
+.. warning::
+
+    When using the CLI, do not wrap the entrypoint command in quotes.  For example, use 
+    ``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
+    Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
+
+.. tip::
+
+    If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
+    (For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
+    You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
+
+Using the CLI on a remote cluster
+"""""""""""""""""""""""""""""""""
+
+Above, we ran the "Quick Start" example on a local Ray cluster.  When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
+
+One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node. 
+If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
+
+To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
+
+.. code-block:: bash
+
+    ray dashboard cluster.yaml
+
+Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
+
+Using the CLI on Kubernetes
+"""""""""""""""""""""""""""
+
+The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
+https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
+
+Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
+
+
+.. _ray-job-sdk-under-construction-rest:
+
+Python SDK
+^^^^^^^^^^
+
+The Job Submission Python SDK is the recommended way to submit jobs programmatically.  Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
+
+SDK calls are made via a ``JobSubmissionClient`` object.  To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
+
+.. code-block:: python
+
+    from ray.job_submission import JobSubmissionClient
+
+    # If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
+    client = JobSubmissionClient("http://127.0.0.1:8265")
+
+Then we can submit our application to the Ray cluster via the Job SDK.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python script.py",
+        # Runtime environment for the job, specifying a working directory and pip package
+        runtime_env={
+            "working_dir": "./",
+            "pip": ["requests==2.26.0"]
+        }
+    )
+
+.. tip::
+
+    By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
+    In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
+
+Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
+We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
+
+.. code-block:: python
+
+    from ray.job_submission import JobStatus
+    import time
+
+    def wait_until_finish(job_id):
+        start = time.time()
+        timeout = 5
+        while time.time() - start <= timeout:
+            status = client.get_job_status(job_id)
+            print(f"status: {status}")
+            if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+                break
+            time.sleep(1)
+
+
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+The output should be as follows:
+
+.. code-block:: bash
+
+    status: JobStatus.PENDING
+    status: JobStatus.RUNNING
+    status: JobStatus.SUCCEEDED
+
+    1
+    2
+    3
+    4
+    5
+
+    2.26.0
+
+.. tip::
+
+    Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
+
+A submitted job can be stopped by the user before it finishes executing.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python -c 'import time; time.sleep(60)'",
+        runtime_env={}
+    )
+    wait_until_finish(job_id)
+    client.stop_job(job_id)
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+To get information about all jobs, call ``client.list_jobs()``.  This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
+
+For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
+
+
+.. _ray-job-rest-api-under-construction-rest:
+
+REST API
+^^^^^^^^
+
+Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
+
+**Submit Job**
+
+.. code-block:: python
+
+    import requests
+    import json
+    import time
+
+    resp = requests.post(
+        "http://127.0.0.1:8265/api/jobs/",
+        json={
+            "entrypoint": "echo hello",
+            "runtime_env": {},
+            "job_id": None,
+            "metadata": {"job_submission_id": "123"}
+        }
+    )
+    rst = json.loads(resp.text)
+    job_id = rst["job_id"]
+
+**Query and poll for Job status**
+
+.. code-block:: python
+
+    start = time.time()
+    while time.time() - start <= 10:
+        resp = requests.get(
+            "http://127.0.0.1:8265/api/jobs/<job_id>"
+        )
+        rst = json.loads(resp.text)
+        status = rst["status"]
+        print(f"status: {status}")
+        if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+            break
+        time.sleep(1)
+
+**Query for logs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/<job_id>/logs"
+    )
+    rst = json.loads(resp.text)
+    logs = rst["logs"]
+
+**List all jobs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/"
+    )
+    print(resp.json())
+    # {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
+
+
+Job Submission Architecture
+----------------------------
+
+The following diagram shows the underlying structure and steps for each submitted job.
+
+.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-sdk.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-sdk.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Submit jobs via the SDK
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-sdk.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/job-submission-sdk.rst
@ -0,0 +1,385 @@
+.. warning::
+    This page is under construction!
+
+.. _jobs-overview-under-construction-sdk:
+
+==================
+Ray Job Submission
+==================
+
+.. note::
+
+    This component is in **beta**.  APIs may change before becoming stable.  This feature requires a full installation of Ray using ``pip install "ray[default]"``.
+
+Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
+
+
+
+Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
+
+Concepts
+--------
+
+- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
+
+- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
+
+- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
+
+Quick Start Example
+-------------------
+
+Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
+
+.. code-block:: python
+
+    # script.py
+
+    import ray
+    import requests
+
+    ray.init()
+
+    @ray.remote
+    class Counter:
+        def __init__(self):
+            self.counter = 0
+
+        def inc(self):
+            self.counter += 1
+
+        def get_counter(self):
+            return self.counter
+
+    counter = Counter.remote()
+
+    for _ in range(5):
+        ray.get(counter.inc.remote())
+        print(ray.get(counter.get_counter.remote()))
+
+    print(requests.__version__)
+
+Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
+
+.. code-block:: bash
+
+  | your_working_directory ("./")
+  | ├── script.py
+
+
+Next, start a local Ray cluster:
+
+.. code-block:: bash
+
+   ❯ ray start --head
+    Local node IP: 127.0.0.1
+    INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
+
+Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below.  If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
+
+At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
+Continue on to see examples of running and interacting with this sample job. 
+
+.. _ray-job-apis-under-construction-sdk:
+
+Ray Job Submission APIs
+-----------------------
+
+Ray provides three APIs for job submission: 
+
+* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
+* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
+* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
+
+All three APIs for job submission share the following key inputs:
+
+* **Entrypoint**: The shell command to run the job.
+
+    * Example: :code:`python my_ray_script.py`
+    * Example: :code:`echo hello`
+
+* **Runtime Environment**: Specifies files, packages, and other dependencies for your job.  See :ref:`Runtime Environments<runtime-environments>` for details.
+
+    * Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
+    * Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``. 
+
+        * If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs.  This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
+        * If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs.  This option has no size limit and is recommended for production use.  For details, see :ref:`remote-uris`.
+
+
+.. _ray-job-cli-under-construction-sdk:
+
+CLI
+^^^
+
+The easiest way to get started with Ray job submission is to use the Job Submission CLI. 
+
+Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
+
+
+Using the CLI on a local cluster
+""""""""""""""""""""""""""""""""
+
+First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+This tells the jobs CLI how to find your Ray cluster.  Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.  
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you are ready to use the CLI.  
+Here are some examples of CLI commands from the Quick Start example and their output:
+
+.. code-block::
+
+    ❯ ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
+    2021-12-01 23:04:52,672	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:04:52,809	INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
+    2021-12-01 23:04:52,810	INFO packaging.py:352 -- Creating a file package for local directory './'.
+    2021-12-01 23:04:52,878	INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
+    2021-12-01 23:04:52,878	INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:00,356	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:00,371	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
+    2021-12-01 23:05:00,371	INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
+
+    ❯ ray job status raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:37,751	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:37,764	INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
+    2021-12-01 23:05:37,764	INFO cli.py:129 -- Job finished successfully.
+
+    ❯ ray job logs raysubmit_RXhvSyEPbxhcXtm6
+    2021-12-01 23:05:59,026	INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
+    2021-12-01 23:05:23,037	INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
+    (pid=runtime_env) 2021-12-01 23:05:23,212	WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
+    (pid=runtime_env) 2021-12-01 23:05:23,212	INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
+    (pid=runtime_env) 2021-12-01 23:05:23,213	INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
+    1
+    2
+    3
+    4
+    5
+    2.26.0
+
+    ❯ ray job list
+    {'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
+    'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
+
+.. warning::
+
+    When using the CLI, do not wrap the entrypoint command in quotes.  For example, use 
+    ``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
+    Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
+
+.. tip::
+
+    If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
+    (For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
+    You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
+
+Using the CLI on a remote cluster
+"""""""""""""""""""""""""""""""""
+
+Above, we ran the "Quick Start" example on a local Ray cluster.  When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
+
+One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node. 
+If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
+
+To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
+
+.. code-block:: bash
+
+    ray dashboard cluster.yaml
+
+Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.  
+
+Next, set the :code:`RAY_ADDRESS` environment variable:
+
+.. code-block:: bash
+
+    export RAY_ADDRESS="http://127.0.0.1:8265"
+
+(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
+
+Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
+
+Using the CLI on Kubernetes
+"""""""""""""""""""""""""""
+
+The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
+https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
+
+Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
+
+
+.. _ray-job-sdk-under-construction-sdk:
+
+Python SDK
+^^^^^^^^^^
+
+The Job Submission Python SDK is the recommended way to submit jobs programmatically.  Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
+
+SDK calls are made via a ``JobSubmissionClient`` object.  To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
+
+.. code-block:: python
+
+    from ray.job_submission import JobSubmissionClient
+
+    # If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
+    client = JobSubmissionClient("http://127.0.0.1:8265")
+
+Then we can submit our application to the Ray cluster via the Job SDK.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python script.py",
+        # Runtime environment for the job, specifying a working directory and pip package
+        runtime_env={
+            "working_dir": "./",
+            "pip": ["requests==2.26.0"]
+        }
+    )
+
+.. tip::
+
+    By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
+    In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
+
+Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
+We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
+
+.. code-block:: python
+
+    from ray.job_submission import JobStatus
+    import time
+
+    def wait_until_finish(job_id):
+        start = time.time()
+        timeout = 5
+        while time.time() - start <= timeout:
+            status = client.get_job_status(job_id)
+            print(f"status: {status}")
+            if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+                break
+            time.sleep(1)
+
+
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+The output should be as follows:
+
+.. code-block:: bash
+
+    status: JobStatus.PENDING
+    status: JobStatus.RUNNING
+    status: JobStatus.SUCCEEDED
+
+    1
+    2
+    3
+    4
+    5
+
+    2.26.0
+
+.. tip::
+
+    Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
+
+A submitted job can be stopped by the user before it finishes executing.
+
+.. code-block:: python
+
+    job_id = client.submit_job(
+        # Entrypoint shell command to execute
+        entrypoint="python -c 'import time; time.sleep(60)'",
+        runtime_env={}
+    )
+    wait_until_finish(job_id)
+    client.stop_job(job_id)
+    wait_until_finish(job_id)
+    logs = client.get_job_logs(job_id)
+
+To get information about all jobs, call ``client.list_jobs()``.  This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
+
+For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
+
+
+.. _ray-job-rest-api-under-construction-sdk:
+
+REST API
+^^^^^^^^
+
+Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
+
+**Submit Job**
+
+.. code-block:: python
+
+    import requests
+    import json
+    import time
+
+    resp = requests.post(
+        "http://127.0.0.1:8265/api/jobs/",
+        json={
+            "entrypoint": "echo hello",
+            "runtime_env": {},
+            "job_id": None,
+            "metadata": {"job_submission_id": "123"}
+        }
+    )
+    rst = json.loads(resp.text)
+    job_id = rst["job_id"]
+
+**Query and poll for Job status**
+
+.. code-block:: python
+
+    start = time.time()
+    while time.time() - start <= 10:
+        resp = requests.get(
+            "http://127.0.0.1:8265/api/jobs/<job_id>"
+        )
+        rst = json.loads(resp.text)
+        status = rst["status"]
+        print(f"status: {status}")
+        if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
+            break
+        time.sleep(1)
+
+**Query for logs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/<job_id>/logs"
+    )
+    rst = json.loads(resp.text)
+    logs = rst["logs"]
+
+**List all jobs**
+
+.. code-block:: python
+
+    resp = requests.get(
+        "http://127.0.0.1:8265/api/jobs/"
+    )
+    print(resp.json())
+    # {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
+
+
+Job Submission Architecture
+----------------------------
+
+The following diagram shows the underlying structure and steps for each submitted job.
+
+.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/ray-client.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/ray-client.md
@ -1,6 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Interacting with the cluster via the Ray Client
-## When to use
-## How to use
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/ray-client.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-jobs/ray-client.rst
@ -0,0 +1,283 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _ray-client-under-construction:
+
+Ray Client: Interactive Development
+===================================
+
+**What is the Ray Client?**
+
+The Ray Client is an API that connects a Python script to a **remote** Ray cluster. Effectively, it allows you to leverage a remote Ray cluster just like you would with Ray running on your local machine.
+
+By changing ``ray.init()`` to ``ray.init("ray://<head_node_host>:<port>")``, you can connect from your laptop (or anywhere) directly to a remote cluster and scale-out your Ray code, while maintaining the ability to develop interactively in a Python shell. **This will only work with Ray 1.5+.** If you're using an older version of ray, see the `1.4.1 docs <https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html>`_
+
+
+.. code-block:: python
+
+   # You can run this code outside of the Ray cluster!
+   import ray
+
+   # Starting the Ray client. This connects to a remote Ray cluster.
+   # If you're using a version of Ray prior to 1.5, use the 1.4.1 example
+   # instead: https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html
+   ray.init("ray://<head_node_host>:10001")
+
+   # Normal Ray code follows
+   @ray.remote
+   def do_work(x):
+       return x ** x
+
+   do_work.remote(2)
+   #....
+
+Client arguments
+----------------
+
+Ray Client is used when the address passed into ``ray.init`` is prefixed with ``ray://``. Besides the address, Client mode currently accepts two other arguments:
+
+- ``namespace`` (optional): Sets the namespace for the session.
+- ``runtime_env`` (optional): Sets the `runtime environment <../ray-core/handling-dependencies.html#runtime-environments>`_ for the session, allowing you to dynamically specify environment variables, packages, local files, and more.
+
+.. code-block:: python
+
+   # Connects to an existing cluster at 1.2.3.4 listening on port 10001, using
+   # the namespace "my_namespace". The Ray workers will run inside a cluster-side
+   # copy of the local directory "files/my_project", in a Python environment with
+   # `toolz` and `requests` installed.
+   ray.init(
+       "ray://1.2.3.4:10001",
+       namespace="my_namespace",
+       runtime_env={"working_dir": "files/my_project", "pip": ["toolz", "requests"]},
+   )
+   #....
+
+When to use Ray Client
+----------------------
+
+Ray Client should be used when you want to connect a script or an interactive shell session to a **remote** cluster.
+
+* Use ``ray.init("ray://<head_node_host>:10001")`` (Ray Client) if you've set up a remote cluster at ``<head_node_host>`` and you want to do interactive work. This will connect your local script or shell to the cluster. See the section on :ref:`using Ray Client<how-do-you-use-the-ray-client>` for more details on setting up your cluster.
+* Use ``ray.init("localhost:<port>")`` (non-client connection, local address) if you're developing locally or on the head node of your cluster and you have already started the cluster (i.e. ``ray start --head`` has already been run)
+* Use ``ray.init()`` (non-client connection, no address specified) if you're developing locally and want to automatically create a local cluster and attach directly to it OR if you are using Ray Job submission.
+
+.. _how-do-you-use-the-ray-client-under-construction:
+
+How do you use the Ray Client?
+------------------------------
+
+Step 1: Set up your Ray cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you have a running Ray cluster (version >= 1.5), Ray Client server is likely already running on port ``10001`` of the head node by default. Otherwise, you'll want to create a Ray cluster. To start a Ray cluster locally, you can run
+
+.. code-block:: bash
+
+   ray start --head
+
+To start a Ray cluster remotely, you can follow the directions in :ref:`ref-cluster-quick-start`.
+
+If necessary, you can modify the Ray Client server port to be other than ``10001``, by specifying ``--ray-client-server-port=...`` to the ``ray start`` :ref:`command <ray-start-doc>`.
+
+Step 2: Check ports
+~~~~~~~~~~~~~~~~~~~
+
+Ensure that the Ray Client port on the head node is reachable from your local machine.
+This means opening that port up by configuring security groups or other access controls (on  `EC2 <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html>`_)
+or proxying from your local machine to the cluster (on `K8s <https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod>`_).
+
+.. tabbed:: AWS
+
+    With the Ray cluster launcher, you can configure the security group
+    to allow inbound access by defining :ref:`cluster-configuration-security-group`
+    in your `cluster.yaml`.
+
+    .. code-block:: yaml
+
+        # An unique identifier for the head node and workers of this cluster.
+        cluster_name: minimal_security_group
+
+        # Cloud-provider specific configuration.
+        provider:
+            type: aws
+            region: us-west-2
+            security_group:
+                GroupName: ray_client_security_group
+                IpPermissions:
+                      - FromPort: 10001
+                        ToPort: 10001
+                        IpProtocol: TCP
+                        IpRanges:
+                            # This will enable inbound access from ALL IPv4 addresses.
+                            - CidrIp: 0.0.0.0/0
+
+Step 3: Run Ray code
+~~~~~~~~~~~~~~~~~~~~
+
+Now, connect to the Ray Cluster with the following and then use Ray like you normally would:
+
+..
+.. code-block:: python
+
+   import ray
+
+   # replace with the appropriate host and port
+   ray.init("ray://<head_node_host>:10001")
+
+   # Normal Ray code follows
+   @ray.remote
+   def do_work(x):
+       return x ** x
+
+   do_work.remote(2)
+
+   #....
+
+Alternative Approach: SSH Port Forwarding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As an alternative to configuring inbound traffic rules, you can also set up
+Ray Client via port forwarding. While this approach does require an open SSH
+connection, it can be useful in a test environment where the
+``head_node_host`` often changes.
+
+First, open up an SSH connection with your Ray cluster and forward the
+listening port (``10001``).
+
+.. code-block:: bash
+
+  $ ray up cluster.yaml
+  $ ray attach cluster.yaml -p 10001
+
+Then, you can connect to the Ray cluster **from another terminal** using  ``localhost`` as the
+``head_node_host``.
+
+.. code-block:: python
+
+   import ray
+
+   # This will connect to the cluster via the open SSH session.
+   ray.init("ray://localhost:10001")
+
+   # Normal Ray code follows
+   @ray.remote
+   def do_work(x):
+       return x ** x
+
+   do_work.remote(2)
+
+   #....
+
+Connect to multiple Ray clusters (Experimental)
+-----------------------------------------------
+
+Ray Client allows connecting to multiple Ray clusters in one Python process. To do this, just pass ``allow_multiple=True`` to ``ray.init``:
+
+.. code-block:: python
+
+    import ray
+    # Create a default client.
+    ray.init("ray://<head_node_host_cluster>:10001")
+
+    # Connect to other clusters.
+    cli1 = ray.init("ray://<head_node_host_cluster_1>:10001", allow_multiple=True)
+    cli2 = ray.init("ray://<head_node_host_cluster_2>:10001", allow_multiple=True)
+
+    # Data is put into the default cluster.
+    obj = ray.put("obj")
+
+    with cli1:
+        obj1 = ray.put("obj1")
+
+    with cli2:
+        obj2 = ray.put("obj2")
+
+    with cli1:
+        assert ray.get(obj1) == "obj1"
+        try:
+            ray.get(obj2)  # Cross-cluster ops not allowed.
+        except:
+            print("Failed to get object which doesn't belong to this cluster")
+
+    with cli2:
+        assert ray.get(obj2) == "obj2"
+        try:
+            ray.get(obj1)  # Cross-cluster ops not allowed.
+        except:
+            print("Failed to get object which doesn't belong to this cluster")
+    assert "obj" == ray.get(obj)
+    cli1.disconnect()
+    cli2.disconnect()
+
+
+When using Ray multi-client, there are some different behaviors to pay attention to:
+
+* The client won't be disconnected automatically. Call ``disconnect`` explicitly to close the connection.
+* Object references can only be used by the client from which it was obtained.
+* ``ray.init`` without ``allow_multiple`` will create a default global Ray client.
+
+Things to know
+--------------
+
+Client disconnections
+~~~~~~~~~~~~~~~~~~~~~
+
+When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster.
+
+
+Versioning requirements
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the client Ray version must match the server Ray version. An error will be raised if an incompatible version is used.
+
+Similarly, the minor Python (e.g., 3.6 vs 3.7) must match between the client and server. An error will be raised if this is not the case.
+
+Starting a connection on older Ray versions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you encounter ``socket.gaierror: [Errno -2] Name or service not known`` when using ``ray.init("ray://...")`` then you may be on a version of Ray prior to 1.5 that does not support starting client connections through ``ray.init``. If this is the case, see the `1.4.1 docs <https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html>`_ for Ray Client.
+
+Connection through the Ingress
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you encounter the following error message when connecting to the ``Ray Cluster`` using an ``Ingress``,  it may be caused by the Ingress's configuration.
+
+..
+.. code-block:: python
+
+   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
+       status = StatusCode.INVALID_ARGUMENT
+       details = ""
+       debug_error_string = "{"created":"@1628668820.164591000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}"
+   >
+   Got Error from logger channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
+       status = StatusCode.INVALID_ARGUMENT
+       details = ""
+       debug_error_string = "{"created":"@1628668820.164713000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}"
+   >
+
+
+If you are using the ``nginx-ingress-controller``, you may be able to resolve the issue by adding the following Ingress configuration.
+
+
+.. code-block:: yaml
+
+   metadata:
+     annotations:
+        nginx.ingress.kubernetes.io/server-snippet: |
+          underscores_in_headers on;
+          ignore_invalid_headers on;
+
+Ray client logs
+~~~~~~~~~~~~~~~
+
+Ray client logs can be found at ``/tmp/ray/session_latest/logs`` on the head node.
+
+Uploads
+~~~~~~~
+
+If a ``working_dir`` is specified in the runtime env, when running ``ray.init()`` the Ray client will upload the ``working_dir`` on the laptop to ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_<hash of directory contents>``.
+
+Ray workers are started in the ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_<hash of directory contents>`` directory on the cluster. This means that relative paths in the remote tasks and actors in the code will work on the laptop and on the cluster without any code changes. For example, if the ``working_dir`` on the laptop contains ``data.txt`` and ``run.py``, inside the remote task definitions in ``run.py`` one can just use the relative path ``"data.txt"``. Then ``python run.py`` will work on my laptop, and also on the cluster. As a side note, since relative paths can be used in the code, the absolute path is only useful for debugging purposes.
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem.md
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem.md
@ -1,4 +0,0 @@
-:::{warning}
-This page is under construction!
-:::
-# Running a Ray cluster on-prem
--- a/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem.rst
+++ b/doc/source/cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem.rst
@ -0,0 +1,447 @@
+.. warning::
+    This page is under construction!
+
+.. include:: /_includes/clusters/we_are_hiring.rst
+
+.. _cluster-cloud-under-construction:
+
+Launching Cloud Clusters
+========================
+
+This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
+
+See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
+
+To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
+
+.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
+
+.. _ref-cloud-setup-under-construction:
+
+Ray with cloud providers
+------------------------
+
+.. toctree::
+    :hidden:
+
+    /cluster/aws-tips.rst
+
+.. tabbed::  AWS
+
+    First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
+    as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
+
+    Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
+        $ # Try running a Ray program.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
+
+
+    AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
+
+    See :ref:`aws-cluster` for recipes on customizing AWS clusters.
+.. tabbed:: Azure
+
+    First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
+
+    Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
+
+    Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/azure/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
+        # test ray setup
+        $ python -c 'import ray; ray.init()'
+        $ exit
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/azure/example-full.yaml
+
+    **Azure Portal**:
+    Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
+    the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
+    for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
+    The head node conveniently exposes both SSH as well as JupyterLab.
+
+    .. image:: https://aka.ms/deploytoazurebutton
+       :target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
+       :alt: Deploy to Azure
+
+    Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
+    Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
+
+    .. code-block:: python
+
+        import ray
+        ray.init()
+
+    Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
+
+    1. Activates one of the conda environments available on DSVM
+    2. Installs Ray and any other user-specified dependencies
+    3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
+
+
+    Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
+    .. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+.. tabbed:: GCP
+
+    First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
+
+    Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
+
+    GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
+
+.. tabbed:: Aliyun
+
+    First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
+
+    Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
+
+    Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to SSH into the cluster head node.
+        $ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster.
+        $ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
+
+    Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
+
+    .. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
+
+
+.. tabbed:: Custom
+
+    Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
+    You can specify the external node provider using the yaml config:
+
+    .. code-block:: yaml
+
+        provider:
+            type: external
+            module: mypackage.myclass
+
+    The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
+
+
+.. _cluster-private-setup-under-construction:
+
+Local On Premise Cluster (List of nodes)
+----------------------------------------
+You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.
+
+The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.
+
+There are two ways of running private clusters:
+
+- Manually managed, i.e., the user explicitly specifies the head and worker ips.
+
+- Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.
+
+.. tip:: To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:
+
+    .. code-block:: bash
+
+        $ ssh-keygen
+        $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
+
+.. tabbed:: Manually Managed
+
+
+    You can get started by filling out the fields in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
+    Be sure to specify the proper ``head_ip``, list of ``worker_ips``, and the ``ssh_user`` field.
+
+    Test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to get a remote shell into the head node.
+        $ ray up ray/python/ray/autoscaler/local/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/local/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster
+        $ ray down ray/python/ray/autoscaler/local/example-full.yaml
+
+.. tabbed:: Automatically Managed
+
+
+    Start by launching the coordinator server that will manage all the on prem clusters. This server also makes sure to isolate the resources between different users. The script for running the coordinator server is `ray/python/ray/autoscaler/local/coordinator_server.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/coordinator_server.py>`__. To launch the coordinator server run:
+
+    .. code-block:: bash
+
+        $ python coordinator_server.py --ips <list_of_node_ips> --port <PORT>
+
+    where ``list_of_node_ips`` is a comma separated list of all the available nodes on the private cluster. For example, ``160.24.42.48,160.24.42.49,...`` and ``<PORT>`` is the port that the coordinator server will listen on.
+    After running the coordinator server it will print the address of the coordinator server. For example:
+
+    .. code-block:: bash
+
+      >> INFO:ray.autoscaler.local.coordinator_server:Running on prem coordinator server
+            on address <Host:PORT>
+
+    Next, the user only specifies the ``<Host:PORT>`` printed above in the ``coordinator_address`` entry instead of specific head/worker ips in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
+
+    Now we can test that it works by running the following commands from your local machine:
+
+    .. code-block:: bash
+
+        # Create or update the cluster. When the command finishes, it will print
+        # out the command that can be used to get a remote shell into the head node.
+        $ ray up ray/python/ray/autoscaler/local/example-full.yaml
+
+        # Get a remote screen on the head node.
+        $ ray attach ray/python/ray/autoscaler/local/example-full.yaml
+        $ # Try running a Ray program with 'ray.init()'.
+
+        # Tear down the cluster
+        $ ray down ray/python/ray/autoscaler/local/example-full.yaml
+
+
+.. _manual-cluster-under-construction:
+
+Manual Ray Cluster Setup
+------------------------
+
+The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
+
+This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
+on each machine. To install Ray, follow the `installation instructions`_.
+
+.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
+
+Starting Ray on each machine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On the head node (just choose one node to be the head node), run the following.
+If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
+random port.
+
+.. code-block:: bash
+
+  $ ray start --head --port=6379
+  ...
+  Next steps
+    To connect to this Ray runtime from another node, run
+      ray start --address='<ip address>:6379'
+
+  If connection fails, check your firewall settings and network configuration.
+
+The command will print out the address of the Ray GCS server that was started
+(the local node IP address plus the port number you specified).
+
+.. note::
+
+    If you already has remote Redis instances, you can specify environment variable
+    `RAY_REDIS_ADDRESS=ip1:port1,ip2:port2...` to use them. The first one is
+    primary and rest are shards.
+
+**Then on each of the other nodes**, run the following. Make sure to replace
+``<address>`` with the value printed by the command on the head node (it
+should look something like ``123.45.67.89:6379``).
+
+Note that if your compute nodes are on their own subnetwork with Network
+Address Translation, to connect from a regular machine outside that subnetwork,
+the command printed by the head node will not work. You need to find the
+address that will reach the head node from the second machine. If the head node
+has a domain address like compute04.berkeley.edu, you can simply use that in
+place of an IP address and rely on the DNS.
+
+.. code-block:: bash
+
+  $ ray start --address=<address>
+  --------------------
+  Ray runtime started.
+  --------------------
+
+  To terminate the Ray runtime, run
+    ray stop
+
+If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
+with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
+
+If you see ``Unable to connect to GCS at ...``,
+this means the head node is inaccessible at the given ``--address`` (because, for
+example, the head node is not actually running, a different version of Ray is
+running at the specified address, the specified address is wrong, or there are
+firewall settings preventing access).
+
+If you see ``Ray runtime started.``, then the node successfully connected to
+the head node at the ``--address``. You should now be able to connect to the
+cluster with ``ray.init()``.
+
+.. code-block:: bash
+
+  If connection fails, check your firewall settings and network configuration.
+
+If the connection fails, to check whether each port can be reached from a node,
+you can use a tool such as ``nmap`` or ``nc``.
+
+.. code-block:: bash
+
+  $ nmap -sV --reason -p $PORT $HEAD_ADDRESS
+  Nmap scan report for compute04.berkeley.edu (123.456.78.910)
+  Host is up, received echo-reply ttl 60 (0.00087s latency).
+  rDNS record for 123.456.78.910: compute04.berkeley.edu
+  PORT     STATE SERVICE REASON         VERSION
+  6379/tcp open  redis?  syn-ack
+  Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
+  $ nc -vv -z $HEAD_ADDRESS $PORT
+  Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
+
+If the node cannot access that port at that IP address, you might see
+
+.. code-block:: bash
+
+  $ nmap -sV --reason -p $PORT $HEAD_ADDRESS
+  Nmap scan report for compute04.berkeley.edu (123.456.78.910)
+  Host is up (0.0011s latency).
+  rDNS record for 123.456.78.910: compute04.berkeley.edu
+  PORT     STATE  SERVICE REASON       VERSION
+  6379/tcp closed redis   reset ttl 60
+  Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
+  $ nc -vv -z $HEAD_ADDRESS $PORT
+  nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
+
+
+Stopping Ray
+~~~~~~~~~~~~
+
+When you want to stop the Ray processes, run ``ray stop`` on each node.
+
+
+Additional Cloud Providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
+
+
+Security
+--------
+
+On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
+
+.. _using-ray-on-a-cluster-under-construction:
+
+Running a Ray program on the Ray cluster
+----------------------------------------
+
+To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
+
+.. tabbed:: Python
+
+    Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
+    For example:
+
+    .. code-block:: python
+
+        ray.init()
+        # Connecting to existing Ray cluster at address: <IP address>...
+
+.. tabbed:: Java
+
+    You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            java -classpath <classpath> \
+              -Dray.address=<address> \
+              <classname> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+.. tabbed:: C++
+
+    You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
+
+    To connect your program to the Ray cluster, run it like this:
+
+        .. code-block:: bash
+
+            RAY_ADDRESS=<address> ./<binary> <args>
+
+    .. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
+
+
+.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
+
+To verify that the correct number of nodes have joined the cluster, you can run the following.
+
+.. code-block:: python
+
+  import time
+
+  @ray.remote
+  def f():
+      time.sleep(0.01)
+      return ray._private.services.get_node_ip_address()
+
+  # Get a list of the IP addresses of the nodes that have joined the cluster.
+  set(ray.get([f.remote() for _ in range(1000)]))
+
+
+What's Next?
+-------------
+
+Now that you have a working understanding of the cluster launcher, check out:
+
+* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
+* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
+* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
+
+
+
+Questions or Issues?
+--------------------
+
+.. include:: /_includes/_help.rst
--- a/doc/source/cluster/deploy.rst
+++ b/doc/source/cluster/deploy.rst
@ -1,4 +1,4 @@
-.. include:: we_are_hiring.rst
+.. include:: /_includes/clusters/we_are_hiring.rst

 .. _ref-cluster-setup:

--- a/doc/source/cluster/quickstart.rst
+++ b/doc/source/cluster/quickstart.rst
@ -1,6 +1,6 @@
 .. include:: /_includes/clusters/announcement.rst

-.. include:: we_are_hiring.rst
+.. include:: /_includes/clusters/we_are_hiring.rst

 .. _ref-cluster-quick-start: