[Ray clusters] [docs] Copying all Ray Clusters doc content to new structure (#27062)

This commit is contained in:
Cade Daniel 2022-07-27 14:22:44 -07:00 committed by GitHub
parent 4f0fb3a5da
commit db26c779a0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
52 changed files with 5745 additions and 375 deletions

View file

@ -297,7 +297,6 @@ parts:
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/add-your-own-cloud-provider
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/manual-cluster-setup
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/multi-tenancy-best-practices
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling
@ -333,8 +332,9 @@ parts:
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/index
sections:
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/job-submission-apis
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-cli
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-cluster-configuration
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/ray-job-submission
- file: cluster/cluster_under_construction/ray-clusters-on-vms/references/autoscaler-sdk-api
- caption: References
chapters:

View file

@ -1,264 +1,97 @@
.. include:: /_includes/clusters/announcement.rst
.. include:: we_are_hiring.rst
.. _ref-cluster-getting-started-under-construction:
.. warning::
This page is under construction!
TODO(cade)
Direct users, based on what they are trying to accomplish, to the
correct page between "Managing Ray Clusters on Kubernetes",
"Managing Ray Clusters via `ray up`", and "Using Ray Clusters".
There should be some discussion on Kubernetes vs. `ray up` for
those looking to create new Ray clusters for the first time.
Getting Started with Ray Clusters
=================================
This page demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
About the demo
--------------
This demo will walk through an end-to-end flow:
1. Create a (basic) Python application.
2. Launch a cluster on a cloud provider.
3. Run the application in the cloud.
Requirements
~~~~~~~~~~~~
To run this demo, you will need:
* Python installed on your development machine (typically your laptop), and
* an account at your preferred cloud provider (AWS, Azure or GCP).
Setup
~~~~~
Before we start, you will need to install some Python dependencies as follows:
.. tabbed:: AWS
.. code-block:: shell
$ pip install -U "ray[default]" boto3
.. tabbed:: Azure
.. code-block:: shell
$ pip install -U "ray[default]" azure-cli azure-core
.. tabbed:: GCP
.. code-block:: shell
$ pip install -U "ray[default]" google-api-python-client
Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:
.. tabbed:: AWS
Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`_.
.. tabbed:: Azure
Log in using ``az login``, then configure your credentials with ``az account set -s <subscription_id>``.
.. tabbed:: GCP
Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.
Create a (basic) Python application
-----------------------------------
We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on:
.. code-block:: python
from collections import Counter
import socket
import time
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
ip_addresses = [f() for _ in range(10000)]
print(Counter(ip_addresses))
Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``.
With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough<core-walkthrough>`):
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))
Finally, let's add some code to make the output more interesting:
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
Running ``python script.py`` should now output something like:
.. parsed-literal::
This cluster consists of
1 nodes in total
4.0 CPU resources in total
Tasks executed
10000 tasks on 127.0.0.1
Launch a cluster on a cloud provider
------------------------------------
To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes.
A minimal sample cluster configuration file looks as follows:
.. tabbed:: AWS
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
.. tabbed:: Azure
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: azure
location: westus2
resource_group: ray-cluster
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
.. tabbed:: GCP
.. code-block:: yaml
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
.. code-block:: shell
$ ray up -y config.yaml
Run the application in the cloud
--------------------------------
We are now ready to execute the application in across multiple machines on our Ray cloud cluster.
First, we need to edit the initialization command ``ray.init()`` in ``script.py``.
Change it to
.. code-block:: python
ray.init(address='auto')
This tells your script to connect to the Ray runtime on the remote cluster instead of initializing a new Ray runtime.
Next, run the following command:
.. code-block:: shell
$ ray submit config.yaml script.py
The output should now look similar to the following:
.. parsed-literal::
This cluster consists of
3 nodes in total
6.0 CPU resources in total
Tasks executed
3425 tasks on xxx.xxx.xxx.xxx
3834 tasks on xxx.xxx.xxx.xxx
2741 tasks on xxx.xxx.xxx.xxx
In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes.
The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference <cluster-commands>`.
To finish, don't forget to shut down the cluster. Run the following command:
.. code-block:: shell
$ ray down -y config.yaml
.. include:: /_includes/clusters/announcement.rst
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-index-under-construction:
..
TODO(cade)
Update this to accomplish the following:
Direct users, based on what they are trying to accomplish, to the
correct page between "Managing Ray Clusters on Kubernetes",
"Managing Ray Clusters via `ray up`", and "Using Ray Clusters".
There should be some discussion on Kubernetes vs. `ray up` for
those looking to create new Ray clusters for the first time.
Ray Clusters Overview
=====================
What is a Ray cluster?
----------------------
One of Ray's strengths is the ability to leverage multiple machines for
distributed execution. Ray can, of course, be run on a single machine (and is
done so often), but the real power is using Ray on a cluster of machines.
Ray can automatically interact with the cloud provider to request or release
instances. You can specify :ref:`a configuration <cluster-config>` to launch
clusters on :ref:`AWS, GCP, Azure (community-maintained), Aliyun (community-maintained), on-premise, or even on
your custom node provider <cluster-cloud>`. Ray can also be run on :ref:`Kubernetes <kuberay-index>` infrastructure.
Your cluster can have a fixed size
or :ref:`automatically scale up and down<cluster-autoscaler>` depending on the
demands of your application.
Where to go from here?
----------------------
.. panels::
:container: text-center
:column: col-lg-6 px-2 py-2
:card:
**Quick Start**
^^^
In this quick start tutorial you will take a sample application designed to
run on a laptop and scale it up in the cloud.
+++
.. link-button:: ref-cluster-quick-start-vms-under-construction
:type: ref
:text: Ray Clusters Quick Start
:classes: btn-outline-info btn-block
---
**Key Concepts**
^^^
Understand the key concepts behind Ray Clusters. Learn about the main
concepts and the different ways to interact with a cluster.
+++
.. link-button:: cluster-key-concepts
:type: ref
:text: Learn Key Concepts
:classes: btn-outline-info btn-block
---
**Deployment Guide**
^^^
Learn how to set up a distributed Ray cluster and run your workloads on it.
+++
.. link-button:: ref-deployment-guide
:type: ref
:text: Deploy on a Ray Cluster
:classes: btn-outline-info btn-block
---
**API**
^^^
Get more in-depth information about the various APIs to interact with Ray
Clusters, including the :ref:`Ray cluster config YAML and CLI<cluster-config>`,
the :ref:`Ray Client API<ray-client>` and the
:ref:`Ray job submission API<ray-job-submission-api-ref>`.
+++
.. link-button:: ref-cluster-api
:type: ref
:text: Read the API Reference
:classes: btn-outline-info btn-block
.. include:: /_includes/clusters/announcement_bottom.rst

View file

@ -1,26 +1,130 @@
.. include:: we_are_hiring.rst
.. warning::
This page is under construction!
Key Concepts
============
..
TODO(cade) Can we simplify this? From https://github.com/ray-project/ray/pull/26754#issuecomment-1192927645:
* Worker Nodes
* Head Node
* Autoscaler
* Clients and Jobs
Need to add the following sections + break out existing content into them.
See ray-core/user-guide.rst for a TOC example
overview
high-level-architecture
jobs
nodes-vs-workers
scheduling-and-autoscaling
configuration
Things-to-know
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-key-concepts-under-construction:
Key Concepts
============
TODO(cade) Can we simplify this? From https://github.com/ray-project/ray/pull/26754#issuecomment-1192927645:
* Worker Nodes
* Head Node
* Autoscaler
* Clients and Jobs
Cluster
-------
Need to add the following sections + break out existing content into them.
See ray-core/user-guide.rst for a TOC example
A Ray cluster is a set of one or more nodes that are running Ray and share the
same :ref:`head node<cluster-node-types>`.
overview
high-level-architecture
jobs
nodes-vs-workers
scheduling-and-autoscaling
configuration
Things-to-know
.. _cluster-node-types-under-construction:
Node types
----------
A Ray cluster consists of a :ref:`head node<cluster-head-node>` and a set of
:ref:`worker nodes<cluster-worker-node>`.
.. image:: ray-cluster.jpg
:align: center
:width: 600px
.. _cluster-head-node-under-construction:
Head node
~~~~~~~~~
The head node is the first node started by the
:ref:`Ray cluster launcher<cluster-launcher>` when trying to launch a Ray
cluster. Among other things, the head node holds the :ref:`Global Control Store
(GCS)<memory>` and runs the :ref:`autoscaler<cluster-autoscaler>`. Once the head
node is started, it will be responsible for launching any additional
:ref:`worker nodes<cluster-worker-node>`. The head node itself will also execute
tasks and actors to utilize its capacity.
.. _cluster-worker-node-under-construction:
Worker node
~~~~~~~~~~~
A worker node is any node in the Ray cluster that is not functioning as head node.
Therefore, worker nodes are simply responsible for executing tasks and actors.
When a worker node is launched, it will be given the address of the head node to
form a cluster.
.. _cluster-launcher-under-construction:
Cluster launcher
----------------
The cluster launcher is a process responsible for bootstrapping the Ray cluster
by launching the :ref:`head node<cluster-head-node>`. For more information on how
to use the cluster launcher, refer to
:ref:`cluster launcher CLI commands documentation<cluster-commands>` and the
corresponding :ref:`documentation for the configuration file<cluster-config>`.
.. _cluster-autoscaler-under-construction:
Autoscaler
----------
The autoscaler is a process that runs on the :ref:`head node<cluster-head-node>`
and is responsible for adding or removing :ref:`worker nodes<cluster-worker-node>`
to meet the needs of the Ray workload while matching the specification in the
:ref:`cluster config file<cluster-config>`. In particular, if the resource
demands of the Ray workload exceed the current capacity of the cluster, the
autoscaler will try to add nodes. Conversely, if a node is idle for long enough,
the autoscaler will remove it from the cluster. To learn more about autoscaling,
refer to the :ref:`Ray cluster deployment guide<deployment-guide-autoscaler>`.
Ray Client
----------
The Ray Client is an API that connects a Python script to a remote Ray cluster.
To learn more about the Ray Client, you can refer to the :ref:`documentation<ray-client>`.
Job submission
--------------
Ray Job submission is a mechanism to submit locally developed and tested applications
to a remote Ray cluster. It simplifies the experience of packaging, deploying,
and managing a Ray application. To learn more about Ray jobs, refer to the
:ref:`documentation<ray-job-submission-api-ref>`.
Cloud clusters
--------------
If youre using AWS, GCP, Azure (community-maintained) or Aliyun (community-maintained), you can use the
:ref:`Ray cluster launcher<cluster-launcher>` to launch cloud clusters, which
greatly simplifies the cluster setup process.
Cluster managers
----------------
You can simplify the process of managing Ray clusters using a number of popular
cluster managers including :ref:`Kubernetes<kuberay-index>`,
:ref:`YARN<ray-yarn-deploy>`, :ref:`Slurm<ray-slurm-deploy>` and :ref:`LSF<ray-LSF-deploy>`.
Kubernetes (K8s) operator
-------------------------
Deployments of Ray on Kubernetes are managed by the Ray Kubernetes Operator. The
Ray Operator makes it easy to deploy clusters of Ray pods within a Kubernetes
cluster. To learn more about the K8s operator, refer to
the :ref:`documentation<kuberay-index>`.

View file

@ -1,4 +0,0 @@
# Getting Started
:::{warning}
This page is under construction!
:::

View file

@ -0,0 +1,251 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/announcement.rst
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-cluster-quick-start-vms-under-construction:
Ray Clusters Quick Start
========================
This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
About the demo
--------------
This demo will walk through an end-to-end flow:
1. Create a (basic) Python application.
2. Launch a cluster on a cloud provider.
3. Run the application in the cloud.
Requirements
~~~~~~~~~~~~
To run this demo, you will need:
* Python installed on your development machine (typically your laptop), and
* an account at your preferred cloud provider (AWS, Azure or GCP).
Setup
~~~~~
Before we start, you will need to install some Python dependencies as follows:
.. tabbed:: AWS
.. code-block:: shell
$ pip install -U "ray[default]" boto3
.. tabbed:: Azure
.. code-block:: shell
$ pip install -U "ray[default]" azure-cli azure-core
.. tabbed:: GCP
.. code-block:: shell
$ pip install -U "ray[default]" google-api-python-client
Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:
.. tabbed:: AWS
Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`_.
.. tabbed:: Azure
Log in using ``az login``, then configure your credentials with ``az account set -s <subscription_id>``.
.. tabbed:: GCP
Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.
Create a (basic) Python application
-----------------------------------
We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on:
.. code-block:: python
from collections import Counter
import socket
import time
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
ip_addresses = [f() for _ in range(10000)]
print(Counter(ip_addresses))
Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``.
With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough<core-walkthrough>`):
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))
Finally, let's add some code to make the output more interesting:
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
Running ``python script.py`` should now output something like:
.. parsed-literal::
This cluster consists of
1 nodes in total
4.0 CPU resources in total
Tasks executed
10000 tasks on 127.0.0.1
Launch a cluster on a cloud provider
------------------------------------
To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes.
A minimal sample cluster configuration file looks as follows:
.. tabbed:: AWS
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
.. tabbed:: Azure
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: azure
location: westus2
resource_group: ray-cluster
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
.. tabbed:: GCP
.. code-block:: yaml
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
.. code-block:: shell
$ ray up -y config.yaml
Run the application in the cloud
--------------------------------
We are now ready to execute the application in across multiple machines on our Ray cloud cluster.
``ray.init()`` will now automatically connect to the newly created cluster.
Next, run the following command:
.. code-block:: shell
$ ray submit config.yaml script.py
The output should now look similar to the following:
.. parsed-literal::
Connecting to existing Ray cluster at address: <IP address>...
This cluster consists of
3 nodes in total
6.0 CPU resources in total
Tasks executed
3425 tasks on xxx.xxx.xxx.xxx
3834 tasks on xxx.xxx.xxx.xxx
2741 tasks on xxx.xxx.xxx.xxx
In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes.
The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference <cluster-commands>`.
To finish, don't forget to shut down the cluster. Run the following command:
.. code-block:: shell
$ ray down -y config.yaml

View file

@ -0,0 +1,18 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-autoscaler-sdk-under-construction:
Autoscaler SDK
==============
.. _ref-autoscaler-sdk-request-resources-under-construction:
ray.autoscaler.sdk.request_resources
------------------------------------
Within a Ray program, you can command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accommodate the requested resources, bypassing normal upscaling speed constraints.
.. .. autofunction:: ray.autoscaler.sdk.request_resources

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Job submission API

View file

@ -0,0 +1,12 @@
.. _ray-job-submission-api-ref-under-construction:
Ray Job Submission API
======================
For an overview with examples see :ref:`Ray Job Submission<jobs-overview>`.
.. _ray-job-submission-cli-ref-under-construction:
Job Submission CLI
------------------

View file

@ -0,0 +1,237 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-commands-under-construction:
Cluster Launcher Commands
=========================
This document overviews common commands for using the Ray Cluster Launcher.
See the :ref:`Cluster Configuration <cluster-config>` docs on how to customize the configuration file.
Launching a cluster (``ray up``)
--------------------------------
This will start up the machines in the cloud, install your dependencies and run
any setup commands that you have, configure the Ray cluster automatically, and
prepare you to scale your distributed system. See :ref:`the documentation
<ray-up-doc>` for ``ray up``.
.. tip:: The worker nodes will start only after the head node has finished
starting. To monitor the progress of the cluster setup, you can run
`ray monitor <cluster yaml>`.
.. code-block:: shell
# Replace '<your_backend>' with one of: 'aws', 'gcp', 'kubernetes', or 'local'.
$ BACKEND=<your_backend>
# Create or update the cluster.
$ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/$BACKEND/example-full.yaml
Updating an existing cluster (``ray up``)
-----------------------------------------
If you want to update your cluster configuration (add more files, change dependencies), run ``ray up`` again on the existing cluster.
This command checks if the local configuration differs from the applied
configuration of the cluster. This includes any changes to synced files
specified in the ``file_mounts`` section of the config. If so, the new files
and config will be uploaded to the cluster. Following that, Ray
services/processes will be restarted.
.. tip:: Don't do this for the cloud provider specifications (e.g., change from
AWS to GCP on a running cluster) or change the cluster name (as this
will just start a new cluster and orphan the original one).
You can also run ``ray up`` to restart a cluster if it seems to be in a bad
state (this will restart all Ray services even if there are no config changes).
Running ``ray up`` on an existing cluster will do all the following:
* If the head node matches the cluster specification, the filemounts will be
reapplied and the ``setup_commands`` and ``ray start`` commands will be run.
There may be some caching behavior here to skip setup/file mounts.
* If the head node is out of date from the specified YAML (e.g.,
``head_node_type`` has changed on the YAML), then the out of date node will
be terminated and a new node will be provisioned to replace it. Setup/file
mounts/``ray start`` will be applied.
* After the head node reaches a consistent state (after ``ray start`` commands
are finished), the same above procedure will be applied to all the worker
nodes. The ``ray start`` commands tend to run a ``ray stop`` + ``ray start``,
so this will kill currently working jobs.
If you don't want the update to restart services (e.g., because the changes
don't require a restart), pass ``--no-restart`` to the update call.
If you want to force re-generation of the config to pick up possible changes in
the cloud environment, pass ``--no-config-cache`` to the update call.
If you want to skip the setup commands and only run ``ray stop``/``ray start``
on all nodes, pass ``--restart-only`` to the update call.
See :ref:`the documentation <ray-up-doc>` for ``ray up``.
.. code-block:: shell
# Reconfigure autoscaling behavior without interrupting running jobs.
$ ray up ray/python/ray/autoscaler/$BACKEND/example-full.yaml \
--max-workers=N --no-restart
Running shell commands on the cluster (``ray exec``)
----------------------------------------------------
You can use ``ray exec`` to conveniently run commands on clusters. See :ref:`the documentation <ray-exec-doc>` for ``ray exec``.
.. code-block:: shell
# Run a command on the cluster
$ ray exec cluster.yaml 'echo "hello world"'
# Run a command on the cluster, starting it if needed
$ ray exec cluster.yaml 'echo "hello world"' --start
# Run a command on the cluster, stopping the cluster after it finishes
$ ray exec cluster.yaml 'echo "hello world"' --stop
# Run a command on a new cluster called 'experiment-1', stopping it after
$ ray exec cluster.yaml 'echo "hello world"' \
--start --stop --cluster-name experiment-1
# Run a command in a detached tmux session
$ ray exec cluster.yaml 'echo "hello world"' --tmux
# Run a command in a screen (experimental)
$ ray exec cluster.yaml 'echo "hello world"' --screen
If you want to run applications on the cluster that are accessible from a web
browser (e.g., Jupyter notebook), you can use the ``--port-forward``. The local
port opened is the same as the remote port.
.. code-block:: shell
$ ray exec cluster.yaml --port-forward=8899 'source ~/anaconda3/bin/activate tensorflow_p36 && jupyter notebook --port=8899'
.. note:: For Kubernetes clusters, the ``port-forward`` option cannot be used
while executing a command. To port forward and run a command you need
to call ``ray exec`` twice separately.
Running Ray scripts on the cluster (``ray submit``)
---------------------------------------------------
You can also use ``ray submit`` to execute Python scripts on clusters. This
will ``rsync`` the designated file onto the head node cluster and execute it
with the given arguments. See :ref:`the documentation <ray-submit-doc>` for
``ray submit``.
.. code-block:: shell
# Run a Python script in a detached tmux session
$ ray submit cluster.yaml --tmux --start --stop tune_experiment.py
# Run a Python script with arguments.
# This executes script.py on the head node of the cluster, using
# the command: python ~/script.py --arg1 --arg2 --arg3
$ ray submit cluster.yaml script.py -- --arg1 --arg2 --arg3
Attaching to a running cluster (``ray attach``)
-----------------------------------------------
You can use ``ray attach`` to attach to an interactive screen session on the
cluster. See :ref:`the documentation <ray-attach-doc>` for ``ray attach`` or
run ``ray attach --help``.
.. code-block:: shell
# Open a screen on the cluster
$ ray attach cluster.yaml
# Open a screen on a new cluster called 'session-1'
$ ray attach cluster.yaml --start --cluster-name=session-1
# Attach to tmux session on cluster (creates a new one if none available)
$ ray attach cluster.yaml --tmux
.. _ray-rsync-under-construction:
Synchronizing files from the cluster (``ray rsync-up/down``)
------------------------------------------------------------
To download or upload files to the cluster head node, use ``ray rsync_down`` or
``ray rsync_up``:
.. code-block:: shell
$ ray rsync_down cluster.yaml '/path/on/cluster' '/local/path'
$ ray rsync_up cluster.yaml '/local/path' '/path/on/cluster'
.. _monitor-cluster-under-construction:
Monitoring cluster status (``ray dashboard/status``)
-----------------------------------------------------
The Ray also comes with an online dashboard. The dashboard is accessible via
HTTP on the head node (by default it listens on ``localhost:8265``). You can
also use the built-in ``ray dashboard`` to set up port forwarding
automatically, making the remote dashboard viewable in your local browser at
``localhost:8265``.
.. code-block:: shell
$ ray dashboard cluster.yaml
You can monitor cluster usage and auto-scaling status by running (on the head node):
.. code-block:: shell
$ ray status
To see live updates to the status:
.. code-block:: shell
$ watch -n 1 ray status
The Ray autoscaler also reports per-node status in the form of instance tags.
In your cloud provider console, you can click on a Node, go to the "Tags" pane,
and add the ``ray-node-status`` tag as a column. This lets you see per-node
statuses at a glance:
.. image:: /images/autoscaler-status.png
Common Workflow: Syncing git branches
-------------------------------------
A common use case is syncing a particular local git branch to all workers of
the cluster. However, if you just put a `git checkout <branch>` in the setup
commands, the autoscaler won't know when to rerun the command to pull in
updates. There is a nice workaround for this by including the git SHA in the
input (the hash of the file will change if the branch is updated):
.. code-block:: yaml
file_mounts: {
"/tmp/current_branch_sha": "/path/to/local/repo/.git/refs/heads/<YOUR_BRANCH_NAME>",
}
setup_commands:
- test -e <REPO_NAME> || git clone https://github.com/<REPO_ORG>/<REPO_NAME>.git
- cd <REPO_NAME> && git fetch && git checkout `cat /tmp/current_branch_sha`
This tells ``ray up`` to sync the current git branch SHA from your personal
computer to a temporary file on the cluster (assuming you've pushed the branch
head already). Then, the setup commands read that file to figure out which SHA
they should checkout on the nodes. Note that each command runs in its own
session. The final workflow to update the cluster then becomes just this:
1. Make local changes to a git branch
2. Commit the changes with ``git commit`` and ``git push``
3. Update files on your Ray cluster with ``ray up``

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Ray cluster configuration file

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Ray Job submission

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Using a community-supported cluster manager

View file

@ -0,0 +1,21 @@
.. warning::
This page is udner construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-cluster-setup-under-construction:
Community Supported Cluster Managers
====================================
.. note::
If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher <cluster-cloud>` to simplify the cluster setup process.
.. toctree::
:maxdepth: 2
yarn.rst
slurm.rst
lsf.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# LSF

View file

@ -0,0 +1,23 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ray-LSF-deploy-under-construction:
Deploying on LSF
================
This document describes a couple high-level steps to run ray cluster on LSF.
1) Obtain desired nodes from LSF scheduler using bsub directives.
2) Obtain free ports on the desired nodes to start ray services like dashboard, GCS etc.
3) Start ray head node on one of the available nodes.
4) Connect all the worker nodes to the head node.
5) Perform port forwarding to access ray dashboard.
Steps 1-4 have been automated and can be easily run as a script, please refer to below github repo to access script and run sample workloads:
- `ray_LSF`_ Ray with LSF. Users can start up a Ray cluster on LSF, and run DL workloads through that either in a batch or interactive mode.
.. _`ray_LSF`: https://github.com/IBMSpectrumComputing/ray-integration

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# SLURM

View file

@ -0,0 +1,288 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ray-slurm-deploy-under-construction:
Deploying on Slurm
==================
Slurm usage with Ray can be a little bit unintuitive.
* SLURM requires multiple copies of the same program are submitted multiple times to the same cluster to do cluster programming. This is particularly well-suited for MPI-based workloads.
* Ray, on the other hand, expects a head-worker architecture with a single point of entry. That is, you'll need to start a Ray head node, multiple Ray worker nodes, and run your Ray script on the head node.
.. warning::
SLURM support is still a work in progress. SLURM users should be aware
of current limitations regarding networking.
See :ref:`here <slurm-network-ray>` for more explanations.
SLURM support is community-maintained. Maintainer GitHub handle: tupui.
This document aims to clarify how to run Ray on SLURM.
.. contents::
:local:
Walkthrough using Ray with SLURM
--------------------------------
Many SLURM deployments require you to interact with slurm via ``sbatch``, which executes a batch script on SLURM.
To run a Ray job with ``sbatch``, you will want to start a Ray cluster in the sbatch job with multiple ``srun`` commands (tasks), and then execute your python script that uses Ray. Each task will run on a separate node and start/connect to a Ray runtime.
The below walkthrough will do the following:
1. Set the proper headers for the ``sbatch`` script.
2. Load the proper environment/modules.
3. Fetch a list of available computing nodes and their IP addresses.
4. Launch a head ray process in one of the node (called the head node).
5. Launch Ray processes in (n-1) worker nodes and connects them to the head node by providing the head node address.
6. After the underlying ray cluster is ready, submit the user specified task.
See :ref:`slurm-basic.sh <slurm-basic>` for an end-to-end example.
.. _ray-slurm-headers-under-construction:
sbatch directives
~~~~~~~~~~~~~~~~~
In your sbatch script, you'll want to add `directives to provide context <https://slurm.schedmd.com/sbatch.html>`__ for your job to SLURM.
.. code-block:: bash
#!/bin/bash
#SBATCH --job-name=my-workload
You'll need to tell SLURM to allocate nodes specifically for Ray. Ray will then find and manage all resources on each node.
.. code-block:: bash
### Modify this according to your Ray workload.
#SBATCH --nodes=4
#SBATCH --exclusive
Important: To ensure that each Ray worker runtime will run on a separate node, set ``tasks-per-node``.
.. code-block:: bash
#SBATCH --tasks-per-node=1
Since we've set `tasks-per-node = 1`, this will be used to guarantee that each Ray worker runtime will obtain the
proper resources. In this example, we ask for at least 5 CPUs and 5 GB of memory per node.
.. code-block:: bash
### Modify this according to your Ray workload.
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1GB
### Similarly, you can also specify the number of GPUs per node.
### Modify this according to your Ray workload. Sometimes this
### should be 'gres' instead.
#SBATCH --gpus-per-task=1
You can also add other optional flags to your sbatch directives.
Loading your environment
~~~~~~~~~~~~~~~~~~~~~~~~
First, you'll often want to Load modules or your own conda environment at the beginning of the script.
Note that this is an optional step, but it is often required for enabling the right set of dependencies.
.. code-block:: bash
# Example: module load pytorch/v1.4.0-gpu
# Example: conda activate my-env
conda activate my-env
Obtain the head IP address
~~~~~~~~~~~~~~~~~~~~~~~~~~
Next, we'll want to obtain a hostname and a node IP address for the head node. This way, when we start worker nodes, we'll be able to properly connect to the right head node.
.. literalinclude:: /cluster/examples/slurm-basic.sh
:language: bash
:start-after: __doc_head_address_start__
:end-before: __doc_head_address_end__
Starting the Ray head node
~~~~~~~~~~~~~~~~~~~~~~~~~~
After detecting the head node hostname and head node IP, we'll want to create
a Ray head node runtime. We'll do this by using ``srun`` as a background task
as a single task/node (recall that ``tasks-per-node=1``).
Below, you'll see that we explicitly specify the number of CPUs (``num-cpus``)
and number of GPUs (``num-gpus``) to Ray, as this will prevent Ray from using
more resources than allocated. We also need to explictly
indicate the ``node-ip-address`` for the Ray head runtime:
.. literalinclude:: /cluster/examples/slurm-basic.sh
:language: bash
:start-after: __doc_head_ray_start__
:end-before: __doc_head_ray_end__
By backgrounding the above srun task, we can proceed to start the Ray worker runtimes.
Starting the Ray worker nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Below, we do the same thing, but for each worker. Make sure the Ray head and Ray worker processes are not started on the same node.
.. literalinclude:: /cluster/examples/slurm-basic.sh
:language: bash
:start-after: __doc_worker_ray_start__
:end-before: __doc_worker_ray_end__
Submitting your script
~~~~~~~~~~~~~~~~~~~~~~
Finally, you can invoke your Python script:
.. literalinclude:: /cluster/examples/slurm-basic.sh
:language: bash
:start-after: __doc_script_start__
.. _slurm-network-ray-under-construction:
SLURM networking caveats
~~~~~~~~~~~~~~~~~~~~~~~~
There are two important networking aspects to keep in mind when working with
SLURM and Ray:
1. Ports binding.
2. IP binding.
One common use of a SLURM cluster is to have multiple users running concurrent
jobs on the same infrastructure. This can easily conflict with Ray due to the
way the head node communicates with its workers.
Considering 2 users, if they both schedule a SLURM job using Ray
at the same time, they are both creating a head node. In the backend, Ray will
assign some internal ports to a few services. The issue is that as soon as the
first head node is created, it will bind some ports and prevent them to be
used by another head node. To prevent any conflicts, users have to manually
specify non overlapping ranges of ports. The following ports are to be
adjusted. For an explanation on ports, see :ref:`here <ray-ports>`::
# used for all ports
--node-manager-port
--object-manager-port
--min-worker-port
--max-worker-port
# used for the head node
--port
--ray-client-server-port
--redis-shard-ports
For instance, again with 2 users, they would have to adapt the instructions
seen above to:
.. code-block:: bash
# user 1
# same as above
...
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=6379 \
--node-manager-port=6700 \
--object-manager-port=6701 \
--ray-client-server-port=10001 \
--redis-shard-ports=6702 \
--min-worker-port=10002 \
--max-worker-port=19999 \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
# user 2
# same as above
...
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=6380 \
--node-manager-port=6800 \
--object-manager-port=6801 \
--ray-client-server-port=20001 \
--redis-shard-ports=6802 \
--min-worker-port=20002 \
--max-worker-port=29999 \
--num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
As for the IP binding, on some cluster architecture the network interfaces
do not allow to use external IPs between nodes. Instead, there are internal
network interfaces (`eth0`, `eth1`, etc.). Currently, it's difficult to
set an internal IP
(see the open `issue <https://github.com/ray-project/ray/issues/22732>`_).
Python-interface SLURM scripts
------------------------------
[Contributed by @pengzhenghao] Below, we provide a helper utility (:ref:`slurm-launch.py <slurm-launch>`) to auto-generate SLURM scripts and launch.
``slurm-launch.py`` uses an underlying template (:ref:`slurm-template.sh <slurm-template>`) and fills out placeholders given user input.
You can feel free to copy both files into your cluster for use. Feel free to also open any PRs for contributions to improve this script!
Usage example
~~~~~~~~~~~~~
If you want to utilize a multi-node cluster in slurm:
.. code-block:: bash
python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3
If you want to specify the computing node(s), just use the same node name(s) in the same format of the output of ``sinfo`` command:
.. code-block:: bash
python slurm-launch.py --exp-name test --command "python your_file.py" --num-nodes 3 --node NODE_NAMES
There are other options you can use when calling ``python slurm-launch.py``:
* ``--exp-name``: The experiment name. Will generate ``{exp-name}_{date}-{time}.sh`` and ``{exp-name}_{date}-{time}.log``.
* ``--command``: The command you wish to run. For example: ``rllib train XXX`` or ``python XXX.py``.
* ``--num-gpus``: The number of GPUs you wish to use in each computing node. Default: 0.
* ``--node`` (``-w``): The specific nodes you wish to use, in the same form as the output of ``sinfo``. Nodes are automatically assigned if not specified.
* ``--num-nodes`` (``-n``): The number of nodes you wish to use. Default: 1.
* ``--partition`` (``-p``): The partition you wish to use. Default: "", will use user's default partition.
* ``--load-env``: The command to setup your environment. For example: ``module load cuda/10.1``. Default: "".
Note that the :ref:`slurm-template.sh <slurm-template>` is compatible with both IPV4 and IPV6 ip address of the computing nodes.
Implementation
~~~~~~~~~~~~~~
Concretely, the (:ref:`slurm-launch.py <slurm-launch>`) does the following things:
1. It automatically writes your requirements, e.g. number of CPUs, GPUs per node, the number of nodes and so on, to a sbatch script name ``{exp-name}_{date}-{time}.sh``. Your command (``--command``) to launch your own job is also written into the sbatch script.
2. Then it will submit the sbatch script to slurm manager via a new process.
3. Finally, the python process will terminate itself and leaves a log file named ``{exp-name}_{date}-{time}.log`` to record the progress of your submitted command. At the mean time, the ray cluster and your job is running in the slurm cluster.
Examples and templates
----------------------
Here are some community-contributed templates for using SLURM with Ray:
- `Ray sbatch submission scripts`_ used at `NERSC <https://www.nersc.gov/>`_, a US national lab.
- `YASPI`_ (yet another slurm python interface) by @albanie. The goal of yaspi is to provide an interface to submitting slurm jobs, thereby obviating the joys of sbatch files. It does so through recipes - these are collections of templates and rules for generating sbatch scripts. Supports job submissions for Ray.
- `Convenient python interface`_ to launch ray cluster and submit task by @pengzhenghao
.. _`Ray sbatch submission scripts`: https://github.com/NERSC/slurm-ray-cluster
.. _`YASPI`: https://github.com/albanie/yaspi
.. _`Convenient python interface`: https://github.com/pengzhenghao/use-ray-with-slurm

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# YARN

View file

@ -0,0 +1,199 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ray-yarn-deploy-under-construction:
Deploying on YARN
=================
.. warning::
Running Ray on YARN is still a work in progress. If you have a
suggestion for how to improve this documentation or want to request
a missing feature, please feel free to create a pull request or get in touch
using one of the channels in the `Questions or Issues?`_ section below.
This document assumes that you have access to a YARN cluster and will walk
you through using `Skein`_ to deploy a YARN job that starts a Ray cluster and
runs an example script on it.
Skein uses a declarative specification (either written as a yaml file or using the Python API) and allows users to launch jobs and scale applications without the need to write Java code.
You will first need to install Skein: ``pip install skein``.
The Skein ``yaml`` file and example Ray program used here are provided in the
`Ray repository`_ to get you started. Refer to the provided ``yaml``
files to be sure that you maintain important configuration options for Ray to
function properly.
.. _`Ray repository`: https://github.com/ray-project/ray/tree/master/doc/yarn
Skein Configuration
-------------------
A Ray job is configured to run as two `Skein services`:
1. The ``ray-head`` service that starts the Ray head node and then runs the
application.
2. The ``ray-worker`` service that starts worker nodes that join the Ray cluster.
You can change the number of instances in this configuration or at runtime
using ``skein container scale`` to scale the cluster up/down.
The specification for each service consists of necessary files and commands that will be run to start the service.
.. code-block:: yaml
services:
ray-head:
# There should only be one instance of the head node per cluster.
instances: 1
resources:
# The resources for the worker node.
vcores: 1
memory: 2048
files:
...
script:
...
ray-worker:
# Number of ray worker nodes to start initially.
# This can be scaled using 'skein container scale'.
instances: 3
resources:
# The resources for the worker node.
vcores: 1
memory: 2048
files:
...
script:
...
Packaging Dependencies
----------------------
Use the ``files`` option to specify files that will be copied into the YARN container for the application to use. See `the Skein file distribution page <https://jcrist.github.io/skein/distributing-files.html>`_ for more information.
.. code-block:: yaml
services:
ray-head:
# There should only be one instance of the head node per cluster.
instances: 1
resources:
# The resources for the head node.
vcores: 1
memory: 2048
files:
# ray/doc/yarn/example.py
example.py: example.py
# # A packaged python environment using `conda-pack`. Note that Skein
# # doesn't require any specific way of distributing files, but this
# # is a good one for python projects. This is optional.
# # See https://jcrist.github.io/skein/distributing-files.html
# environment: environment.tar.gz
Ray Setup in YARN
-----------------
Below is a walkthrough of the bash commands used to start the ``ray-head`` and ``ray-worker`` services. Note that this configuration will launch a new Ray cluster for each application, not reuse the same cluster.
Head node commands
~~~~~~~~~~~~~~~~~~
Start by activating a pre-existing environment for dependency management.
.. code-block:: bash
source environment/bin/activate
Register the Ray head address needed by the workers in the Skein key-value store.
.. code-block:: bash
skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current
Start all the processes needed on the ray head node. By default, we set object store memory
and heap memory to roughly 200 MB. This is conservative and should be set according to application needs.
.. code-block:: bash
ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1
Execute the user script containing the Ray program.
.. code-block:: bash
python example.py
Clean up all started processes even if the application fails or is killed.
.. code-block:: bash
ray stop
skein application shutdown current
Putting things together, we have:
.. literalinclude:: /../yarn/ray-skein.yaml
:language: yaml
:start-after: # Head service
:end-before: # Worker service
Worker node commands
~~~~~~~~~~~~~~~~~~~~
Fetch the address of the head node from the Skein key-value store.
.. code-block:: bash
RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS)
Start all of the processes needed on a ray worker node, blocking until killed by Skein/YARN via SIGTERM. After receiving SIGTERM, all started processes should also die (ray stop).
.. code-block:: bash
ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop
Putting things together, we have:
.. literalinclude:: /../yarn/ray-skein.yaml
:language: yaml
:start-after: # Worker service
Running a Job
-------------
Within your Ray script, use the following to connect to the started Ray cluster:
.. literalinclude:: /../yarn/example.py
:language: python
:start-after: if __name__ == "__main__"
You can use the following command to launch the application as specified by the Skein YAML file.
.. code-block:: bash
skein application submit [TEST.YAML]
Once it has been submitted, you can see the job running on the YARN dashboard.
.. image:: /images/yarn-job.png
Cleaning Up
-----------
To clean up a running job, use the following (using the application ID):
.. code-block:: bash
skein application shutdown $appid
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst
.. _`Skein`: https://jcrist.github.io/skein/

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Configuring autoscaling

View file

@ -0,0 +1,56 @@
.. include:: /_includes/clusters/we_are_hiring.rst
.. _deployment-guide-autoscaler-under-construction:
Autoscaling with Ray
--------------------
Ray is designed to support highly elastic workloads which are most efficient on
an autoscaling cluster. At a high level, the autoscaler attempts to
launch/terminate nodes in order to ensure that workloads have sufficient
resources to run, while minimizing the idle resources.
It does this by taking into consideration:
* User specified hard limits (min/max workers).
* User specified node types (nodes in a Ray cluster do _not_ have to be
homogenous).
* Information from the Ray core's scheduling layer about the current resource
usage/demands of the cluster.
* Programmatic autoscaling hints.
Take a look at :ref:`the cluster reference <cluster-config>` to learn more
about configuring the autoscaler.
How does it work?
^^^^^^^^^^^^^^^^^
The Ray Cluster Launcher will automatically enable a load-based autoscaler. The
autoscaler resource demand scheduler will look at the pending tasks, actors,
and placement groups resource demands from the cluster, and try to add the
minimum list of nodes that can fulfill these demands. Autoscaler uses a simple
binpacking algorithm to binpack the user demands into
the available cluster resources. The remaining unfulfilled demands are placed
on the smallest list of nodes that satisfies the demand while maximizing
utilization (starting from the smallest node).
**Downscaling**: When worker nodes are
idle (without active Tasks or Actors running on it)
for more than :ref:`idle_timeout_minutes
<cluster-configuration-idle-timeout-minutes>`, they are subject to
removal from the cluster. But there are two important additional conditions
to note:
* The head node is never removed unless the cluster is torn down.
* If the Ray Object Store is used, and a Worker node still holds objects (including spilled objects on disk), it won't be removed.
**Here is "A Glimpse into the Ray Autoscaler" and how to debug/monitor your cluster:**
2021-19-01 by Ameer Haj-Ali, Anyscale Inc.
.. youtube:: BJ06eJasdu4

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Best practices for deploying large clusters

View file

@ -0,0 +1,129 @@
.. include:: /_includes/clusters/we_are_hiring.rst
Best practices for deploying large clusters
-------------------------------------------
This section aims to document best practices for deploying Ray clusters at
large scale.
Networking configuration
^^^^^^^^^^^^^^^^^^^^^^^^
End users should only need to directly interact with the head node of the
cluster. In particular, there are 2 services which should be exposed to users:
1. The dashboard
2. The Ray client server
.. note::
While users only need 2 ports to connect to a cluster, the nodes within a
cluster require a much wider range of ports to communicate.
See :ref:`Ray port configuration <Ray-ports>` for a comprehensive list.
Applications (such as :ref:`Ray Serve <Rayserve>`) may also require
additional ports to work properly.
System configuration
^^^^^^^^^^^^^^^^^^^^
There are a few system level configurations that should be set when using Ray
at a large scale.
* Make sure ``ulimit -n`` is set to at least 65535. Ray opens many direct
connections between worker processes to avoid bottlenecks, so it can quickly
use a large number of file descriptors.
* Make sure ``/dev/shm`` is sufficiently large. Most ML/RL applications rely
heavily on the plasma store. By default, Ray will try to use ``/dev/shm`` for
the object store, but if it is not large enough (i.e. ``--object-store-memory``
> size of ``/dev/shm``), Ray will write the plasma store to disk instead, which
may cause significant performance problems.
* Use NVMe SSDs (or other high perforfmance storage) if possible. If
:ref:`object spilling <object-spilling>` is enabled Ray will spill objects to
disk if necessary. This is most commonly needed for data processing
workloads.
Configuring the head node
^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to the above changes, when deploying a large cluster, Ray's
architecture means that the head node will have extra stress due to GCS.
* Make sure the head node has sufficient bandwidth. The most heavily stressed
resource on the head node is outbound bandwidth. For large clusters (see the
scalability envelope), we recommend using machines networking characteristics
at least as good as an r5dn.16xlarge on AWS EC2.
* Set ``resources: {"CPU": 0}`` on the head node. (For Ray clusters deployed using Helm,
set ``rayResources: {"CPU": 0}``.) Due to the heavy networking
load (and the GCS and dashboard processes), we recommend setting the number of
CPUs to 0 on the head node to avoid scheduling additional tasks on it.
Configuring the autoscaler
^^^^^^^^^^^^^^^^^^^^^^^^^^
For large, long running clusters, there are a few parameters that can be tuned.
* Ensure your quotas for node types are set correctly.
* For long running clusters, set the ``AUTOSCALER_MAX_NUM_FAILURES`` environment
variable to a large number (or ``inf``) to avoid unexpected autoscaler
crashes. The variable can be set by prepending \ ``export AUTOSCALER_MAX_NUM_FAILURES=inf;``
to the head node's Ray start command.
(Note: you may want a separate mechanism to detect if the autoscaler
errors too often).
* For large clusters, consider tuning ``upscaling_speed`` for faster
autoscaling.
Picking nodes
^^^^^^^^^^^^^
Here are some tips for how to set your ``available_node_types`` for a cluster,
using AWS instance types as a concrete example.
General recommendations with AWS instance types:
**When to use GPUs**
* If youre using some RL/ML framework
* Youre doing something with tensorflow/pytorch/jax (some framework that can
leverage GPUs well)
**What type of GPU?**
* The latest gen GPU is almost always the best bang for your buck (p3 > p2, g4
> g3), for most well designed applications the performance outweighs the
price (the instance price may be higher, but youll use the instance for less
time.
* You may want to consider using older instances if youre doing dev work and
wont actually fully utilize the GPUs though.
* If youre doing training (ML or RL), you should use a P instance. If youre
doing inference, you should use a G instance. The difference is
processing:VRAM ratio (training requires more memory).
**What type of CPU?**
* Again stick to the latest generation, theyre typically cheaper and faster.
* When in doubt use M instances, they have typically have the highest
availability.
* If you know your application is memory intensive (memory utilization is full,
but cpu is not), go with an R instance
* If you know your application is CPU intensive go with a C instance
* If you have a big cluster, make the head node an instance with an n (r5dn or
c5n)
**How many CPUs/GPUs?**
* Focus on your CPU:GPU ratio first and look at the utilization (Ray dashboard
should help with this). If your CPU utilization is low add GPUs, or vice
versa.
* The exact ratio will be very dependent on your workload.
* Once you find a good ratio, you should be able to scale up and and keep the
same ratio.
* You cant infinitely scale forever. Eventually, as you add more machines your
performance improvements will become sub-linear/not worth it. There may not
be a good one-size fits all strategy at this point.
.. note::
If you're using RLlib, check out :ref:`the RLlib scaling guide
<rllib-scaling-guide>` for RLlib specific recommendations.

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Adding your own cloud provider

View file

@ -0,0 +1,9 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# AWS

View file

@ -0,0 +1,573 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-aws:
Launching Ray Clusters on AWS
=============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-aws:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-aws:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
.. _aws-cluster-under-construction:
AWS Configurations
==================
.. _aws-cluster-efs-under-construction:
Using Amazon EFS
----------------
To use Amazon EFS, install some utilities and mount the EFS in ``setup_commands``. Note that these instructions only work if you are using the AWS Autoscaler.
.. note::
You need to replace the ``{{FileSystemId}}`` to your own EFS ID before using the config. You may also need to set correct ``SecurityGroupIds`` for the instances in the config file.
.. code-block:: yaml
setup_commands:
- sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
sudo pkill -9 apt-get;
sudo pkill -9 dpkg;
sudo dpkg --configure -a;
sudo apt-get -y install binutils;
cd $HOME;
git clone https://github.com/aws/efs-utils;
cd $HOME/efs-utils;
./build-deb.sh;
sudo apt-get -y install ./build/amazon-efs-utils*deb;
cd $HOME;
mkdir efs;
sudo mount -t efs {{FileSystemId}}:/ efs;
sudo chmod 777 efs;
.. _aws-cluster-s3-under-construction:
Configure worker nodes to access Amazon S3
------------------------------------------
In various scenarios, worker nodes may need write access to the S3 bucket.
E.g. Ray Tune has the option that worker nodes write distributed checkpoints to S3 instead of syncing back to the driver using rsync.
If you see errors like "Unable to locate credentials", make sure that the correct ``IamInstanceProfile`` is configured for worker nodes in ``cluster.yaml`` file.
This may look like:
.. code-block:: text
worker_nodes:
InstanceType: m5.xlarge
ImageId: latest_dlami
IamInstanceProfile:
Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
You can verify if the set up is correct by entering one worker node and do
.. code-block:: bash
aws configure list
You should see something like
.. code-block:: text
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************XXXX iam-role
secret_key ****************YYYY iam-role
region <not set> None None
Please refer to `this discussion <https://github.com/ray-project/ray/issues/9327>`__ for more details.
.. _aws-cluster-cloudwatch-under-construction:
Using Amazon CloudWatch
=======================
Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
.. list-table:: All available unified CloudWatch agent images
* - Base AMI
- AMI ID
- Region
- Unified CloudWatch Agent Version
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-069f2811478f86c20
- us-east-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-058cc0932940c2b8b
- us-east-2
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-044f95c9ef12883ef
- us-west-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-0d88d9cbe28fac870
- us-west-2
- v1.247348.0b251302
.. note::
Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
Getting started
---------------
1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: yaml
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a
# Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
cloudwatch:
agent:
# Path to Unified CloudWatch Agent config file
config: "cloudwatch/example-cloudwatch-agent-config.json"
dashboard:
# CloudWatch Dashboard name
name: "example-dashboard-name"
# Path to the CloudWatch Dashboard config file
config: "cloudwatch/example-cloudwatch-dashboard-config.json"
auth:
ssh_user: ubuntu
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
resources: {}
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
resources: {}
min_workers: 0
2. Download CloudWatch Agent and Dashboard config.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
.. code-block:: console
$ mkdir cloudwatch
$ cd cloudwatch
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
.. code-block:: bash
aws logs tail $log_group_name --follow
Advanced Setup
--------------
Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
.. code-block:: bash
aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
.. code-block:: yaml
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
To build your own AMI with the Unified CloudWatch Agent installed:
1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
These example config files include the following features:
**Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
Log streams are named after the EC2 instance ID that emitted their logs.
Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
and must be manually provided in your alarm config file.
For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
``{region}``: Replaced with your Ray cluster's region.
``{cluster_name}``: Replaced with your Ray cluster name.
See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
.. note::
Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
.. code-block:: json
"AlarmActions":[
"TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
]
3. Reference your CloudWatch JSON config files in your cluster config YAML.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
.. code-block:: yaml
provider:
cloudwatch:
agent:
config: "cloudwatch/example-cloudwatch-agent-config.json"
4. Set your IAM Role and EC2 Instance Profile.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
.. code-block:: yaml
ray.worker.default:
node_config:
InstanceType: c5a.large
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
5. Export Ray system metrics to CloudWatch.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
.. code-block:: yaml
head_setup_commands:
# Make `ray_prometheus_waiter.sh` executable.
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
# Copy `prometheus.yml` to Unified CloudWatch Agent folder
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
# First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
- >-
nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
`cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
6. Update CloudWatch Agent, Dashboard and Alarm config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.
What's Next?
============
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
====================
.. include:: /_includes/_help.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Azure

View file

@ -0,0 +1,257 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-azure:
Launching Ray Clusters on Azure
===============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-azure:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-azure:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# GCP

View file

@ -0,0 +1,257 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-gcp:
Launching Ray Clusters on GCP
=============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-gcp:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-gcp:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Launching a Ray Cluster on Cloud VMs

View file

@ -0,0 +1,12 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. toctree::
:maxdepth: 2
aws.rst
gcp.rst
azure.rst
add-your-own-cloud-provider.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Manual cluster setup

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Monitoring and Observing a Ray Cluster

View file

@ -0,0 +1,56 @@
.. include:: /_includes/clusters/we_are_hiring.rst
Monitoring and observability
----------------------------
Ray comes with 3 main observability features:
1. :ref:`The dashboard <Ray-dashboard>`
2. :ref:`ray status <monitor-cluster>`
3. :ref:`Prometheus metrics <multi-node-metrics>`
Monitoring the cluster via the dashboard
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:ref:`The dashboard provides detailed information about the state of the cluster <Ray-dashboard>`,
including the running jobs, actors, workers, nodes, etc.
By default, the cluster launcher and operator will launch the dashboard, but
not publicly expose it.
If you launch your application via the cluster launcher, you can securely
portforward local traffic to the dashboard via the ``ray dashboard`` command
(which establishes an SSH tunnel). The dashboard will now be visible at
``http://localhost:8265``.
The Kubernetes Operator makes the dashboard available via a Service targeting the Ray head pod.
You can :ref:`access the dashboard <ray-k8s-dashboard>` using ``kubectl port-forward``.
Observing the autoscaler
^^^^^^^^^^^^^^^^^^^^^^^^
The autoscaler makes decisions by scheduling information, and programmatic
information from the cluster. This information, along with the status of
starting nodes, can be accessed via the ``ray status`` command.
To dump the current state of a cluster launched via the cluster launcher, you
can run ``ray exec cluster.yaml "Ray status"``.
For a more "live" monitoring experience, it is recommended that you run ``ray
status`` in a watch loop: ``ray exec cluster.yaml "watch -n 1 Ray status"``.
With the kubernetes operator, you should replace ``ray exec cluster.yaml`` with
``kubectl exec <head node pod>``.
Prometheus metrics
^^^^^^^^^^^^^^^^^^
Ray is capable of producing prometheus metrics. When enabled, Ray produces some
metrics about the Ray core, and some internal metrics by default. It also
supports custom, user-defined metrics.
These metrics can be consumed by any metrics infrastructure which can ingest
metrics from the prometheus server on the head node of the cluster.
:ref:`Learn more about setting up prometheus here. <multi-node-metrics>`

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Running jobs

View file

@ -0,0 +1,25 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-deployment-guide-under-construction:
Deployment Guide
================
This section explains how to set up a distributed Ray cluster and run your workloads on it.
To set up your cluster, check out the :ref:`Ray Cluster Overview <cluster-index>`, or jump to the :ref:`Ray Cluster Quick Start <ref-cluster-quick-start>`.
To trigger a Ray workload from your local machine, a CI system, or a third-party job scheduler/orchestrator via a command line interface or API call, try :ref:`Ray Job Submission <jobs-overview>`.
To run an interactive Ray workload and see the output in real time in a client of your choice (e.g. your local machine, SageMaker Studio, or Google Colab), you can use :ref:`Ray Client <ray-client>`.
.. toctree::
:maxdepth: 2
job-submission-cli.rst
job-submission-rest.rst
job-submission-sdk.rst
ray-client.rst

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Submit jobs via the CLI

View file

@ -0,0 +1,385 @@
.. warning::
This page is under construction!
.. _jobs-overview-under-construction-cli:
==================
Ray Job Submission
==================
.. note::
This component is in **beta**. APIs may change before becoming stable. This feature requires a full installation of Ray using ``pip install "ray[default]"``.
Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
Concepts
--------
- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
Quick Start Example
-------------------
Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
.. code-block:: python
# script.py
import ray
import requests
ray.init()
@ray.remote
class Counter:
def __init__(self):
self.counter = 0
def inc(self):
self.counter += 1
def get_counter(self):
return self.counter
counter = Counter.remote()
for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
print(requests.__version__)
Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
.. code-block:: bash
| your_working_directory ("./")
| ├── script.py
Next, start a local Ray cluster:
.. code-block:: bash
ray start --head
Local node IP: 127.0.0.1
INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below. If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
Continue on to see examples of running and interacting with this sample job.
.. _ray-job-apis-under-construction-cli:
Ray Job Submission APIs
-----------------------
Ray provides three APIs for job submission:
* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
All three APIs for job submission share the following key inputs:
* **Entrypoint**: The shell command to run the job.
* Example: :code:`python my_ray_script.py`
* Example: :code:`echo hello`
* **Runtime Environment**: Specifies files, packages, and other dependencies for your job. See :ref:`Runtime Environments<runtime-environments>` for details.
* Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
* Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``.
* If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs. This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
* If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs. This option has no size limit and is recommended for production use. For details, see :ref:`remote-uris`.
.. _ray-job-cli-under-construction-cli:
CLI
^^^
The easiest way to get started with Ray job submission is to use the Job Submission CLI.
Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
Using the CLI on a local cluster
""""""""""""""""""""""""""""""""
First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
This tells the jobs CLI how to find your Ray cluster. Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you are ready to use the CLI.
Here are some examples of CLI commands from the Quick Start example and their output:
.. code-block::
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
2021-12-01 23:04:52,672 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:04:52,809 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
2021-12-01 23:04:52,810 INFO packaging.py:352 -- Creating a file package for local directory './'.
2021-12-01 23:04:52,878 INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
2021-12-01 23:04:52,878 INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:00,356 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:00,371 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
2021-12-01 23:05:00,371 INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:37,751 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:37,764 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
2021-12-01 23:05:37,764 INFO cli.py:129 -- Job finished successfully.
ray job logs raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:59,026 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:23,037 INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
(pid=runtime_env) 2021-12-01 23:05:23,212 WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
(pid=runtime_env) 2021-12-01 23:05:23,212 INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
(pid=runtime_env) 2021-12-01 23:05:23,213 INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
1
2
3
4
5
2.26.0
ray job list
{'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
.. warning::
When using the CLI, do not wrap the entrypoint command in quotes. For example, use
``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
.. tip::
If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
(For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
Using the CLI on a remote cluster
"""""""""""""""""""""""""""""""""
Above, we ran the "Quick Start" example on a local Ray cluster. When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node.
If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
.. code-block:: bash
ray dashboard cluster.yaml
Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
Using the CLI on Kubernetes
"""""""""""""""""""""""""""
The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
.. _ray-job-sdk-under-construction-cli:
Python SDK
^^^^^^^^^^
The Job Submission Python SDK is the recommended way to submit jobs programmatically. Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
.. code-block:: python
from ray.job_submission import JobSubmissionClient
# If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
client = JobSubmissionClient("http://127.0.0.1:8265")
Then we can submit our application to the Ray cluster via the Job SDK.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python script.py",
# Runtime environment for the job, specifying a working directory and pip package
runtime_env={
"working_dir": "./",
"pip": ["requests==2.26.0"]
}
)
.. tip::
By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
.. code-block:: python
from ray.job_submission import JobStatus
import time
def wait_until_finish(job_id):
start = time.time()
timeout = 5
while time.time() - start <= timeout:
status = client.get_job_status(job_id)
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
The output should be as follows:
.. code-block:: bash
status: JobStatus.PENDING
status: JobStatus.RUNNING
status: JobStatus.SUCCEEDED
1
2
3
4
5
2.26.0
.. tip::
Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
A submitted job can be stopped by the user before it finishes executing.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python -c 'import time; time.sleep(60)'",
runtime_env={}
)
wait_until_finish(job_id)
client.stop_job(job_id)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
To get information about all jobs, call ``client.list_jobs()``. This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
.. _ray-job-rest-api-under-construction-cli:
REST API
^^^^^^^^
Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
**Submit Job**
.. code-block:: python
import requests
import json
import time
resp = requests.post(
"http://127.0.0.1:8265/api/jobs/",
json={
"entrypoint": "echo hello",
"runtime_env": {},
"job_id": None,
"metadata": {"job_submission_id": "123"}
}
)
rst = json.loads(resp.text)
job_id = rst["job_id"]
**Query and poll for Job status**
.. code-block:: python
start = time.time()
while time.time() - start <= 10:
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>"
)
rst = json.loads(resp.text)
status = rst["status"]
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
**Query for logs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>/logs"
)
rst = json.loads(resp.text)
logs = rst["logs"]
**List all jobs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/"
)
print(resp.json())
# {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
Job Submission Architecture
----------------------------
The following diagram shows the underlying structure and steps for each submitted job.
.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Submit jobs via the REST API

View file

@ -0,0 +1,385 @@
.. warning::
This page is under construction!
.. _jobs-overview-under-construction-rest:
==================
Ray Job Submission
==================
.. note::
This component is in **beta**. APIs may change before becoming stable. This feature requires a full installation of Ray using ``pip install "ray[default]"``.
Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
Concepts
--------
- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
Quick Start Example
-------------------
Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
.. code-block:: python
# script.py
import ray
import requests
ray.init()
@ray.remote
class Counter:
def __init__(self):
self.counter = 0
def inc(self):
self.counter += 1
def get_counter(self):
return self.counter
counter = Counter.remote()
for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
print(requests.__version__)
Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
.. code-block:: bash
| your_working_directory ("./")
| ├── script.py
Next, start a local Ray cluster:
.. code-block:: bash
ray start --head
Local node IP: 127.0.0.1
INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below. If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
Continue on to see examples of running and interacting with this sample job.
.. _ray-job-apis-under-construction-rest:
Ray Job Submission APIs
-----------------------
Ray provides three APIs for job submission:
* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
All three APIs for job submission share the following key inputs:
* **Entrypoint**: The shell command to run the job.
* Example: :code:`python my_ray_script.py`
* Example: :code:`echo hello`
* **Runtime Environment**: Specifies files, packages, and other dependencies for your job. See :ref:`Runtime Environments<runtime-environments>` for details.
* Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
* Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``.
* If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs. This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
* If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs. This option has no size limit and is recommended for production use. For details, see :ref:`remote-uris`.
.. _ray-job-cli-under-construction-rest:
CLI
^^^
The easiest way to get started with Ray job submission is to use the Job Submission CLI.
Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
Using the CLI on a local cluster
""""""""""""""""""""""""""""""""
First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
This tells the jobs CLI how to find your Ray cluster. Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you are ready to use the CLI.
Here are some examples of CLI commands from the Quick Start example and their output:
.. code-block::
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
2021-12-01 23:04:52,672 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:04:52,809 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
2021-12-01 23:04:52,810 INFO packaging.py:352 -- Creating a file package for local directory './'.
2021-12-01 23:04:52,878 INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
2021-12-01 23:04:52,878 INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:00,356 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:00,371 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
2021-12-01 23:05:00,371 INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:37,751 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:37,764 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
2021-12-01 23:05:37,764 INFO cli.py:129 -- Job finished successfully.
ray job logs raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:59,026 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:23,037 INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
(pid=runtime_env) 2021-12-01 23:05:23,212 WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
(pid=runtime_env) 2021-12-01 23:05:23,212 INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
(pid=runtime_env) 2021-12-01 23:05:23,213 INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
1
2
3
4
5
2.26.0
ray job list
{'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
.. warning::
When using the CLI, do not wrap the entrypoint command in quotes. For example, use
``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
.. tip::
If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
(For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
Using the CLI on a remote cluster
"""""""""""""""""""""""""""""""""
Above, we ran the "Quick Start" example on a local Ray cluster. When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node.
If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
.. code-block:: bash
ray dashboard cluster.yaml
Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
Using the CLI on Kubernetes
"""""""""""""""""""""""""""
The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
.. _ray-job-sdk-under-construction-rest:
Python SDK
^^^^^^^^^^
The Job Submission Python SDK is the recommended way to submit jobs programmatically. Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
.. code-block:: python
from ray.job_submission import JobSubmissionClient
# If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
client = JobSubmissionClient("http://127.0.0.1:8265")
Then we can submit our application to the Ray cluster via the Job SDK.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python script.py",
# Runtime environment for the job, specifying a working directory and pip package
runtime_env={
"working_dir": "./",
"pip": ["requests==2.26.0"]
}
)
.. tip::
By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
.. code-block:: python
from ray.job_submission import JobStatus
import time
def wait_until_finish(job_id):
start = time.time()
timeout = 5
while time.time() - start <= timeout:
status = client.get_job_status(job_id)
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
The output should be as follows:
.. code-block:: bash
status: JobStatus.PENDING
status: JobStatus.RUNNING
status: JobStatus.SUCCEEDED
1
2
3
4
5
2.26.0
.. tip::
Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
A submitted job can be stopped by the user before it finishes executing.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python -c 'import time; time.sleep(60)'",
runtime_env={}
)
wait_until_finish(job_id)
client.stop_job(job_id)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
To get information about all jobs, call ``client.list_jobs()``. This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
.. _ray-job-rest-api-under-construction-rest:
REST API
^^^^^^^^
Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
**Submit Job**
.. code-block:: python
import requests
import json
import time
resp = requests.post(
"http://127.0.0.1:8265/api/jobs/",
json={
"entrypoint": "echo hello",
"runtime_env": {},
"job_id": None,
"metadata": {"job_submission_id": "123"}
}
)
rst = json.loads(resp.text)
job_id = rst["job_id"]
**Query and poll for Job status**
.. code-block:: python
start = time.time()
while time.time() - start <= 10:
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>"
)
rst = json.loads(resp.text)
status = rst["status"]
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
**Query for logs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>/logs"
)
rst = json.loads(resp.text)
logs = rst["logs"]
**List all jobs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/"
)
print(resp.json())
# {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
Job Submission Architecture
----------------------------
The following diagram shows the underlying structure and steps for each submitted job.
.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Submit jobs via the SDK

View file

@ -0,0 +1,385 @@
.. warning::
This page is under construction!
.. _jobs-overview-under-construction-sdk:
==================
Ray Job Submission
==================
.. note::
This component is in **beta**. APIs may change before becoming stable. This feature requires a full installation of Ray using ``pip install "ray[default]"``.
Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
Jump to the :ref:`API Reference<ray-job-submission-api-ref>`, or continue reading for a quick overview.
Concepts
--------
- **Job**: A Ray application submitted to a Ray cluster for execution. Consists of (1) an entrypoint command and (2) a :ref:`runtime environment<runtime-environments>`, which may contain file and package dependencies.
- **Job Lifecycle**: When a job is submitted, it runs once to completion or failure. Retries or different runs with different parameters should be handled by the submitter. Jobs are bound to the lifetime of a Ray cluster, so if the cluster goes down, all running jobs on that cluster will be terminated.
- **Job Manager**: An entity external to the Ray cluster that manages the lifecycle of a job (scheduling, killing, polling status, getting logs, and persisting inputs/outputs), and potentially also manages the lifecycle of Ray clusters. Can be any third-party framework with these abilities, such as Apache Airflow or Kubernetes Jobs.
Quick Start Example
-------------------
Let's start with a sample job that can be run locally. The following script uses Ray APIs to increment a counter and print its value, and print the version of the ``requests`` module it's using:
.. code-block:: python
# script.py
import ray
import requests
ray.init()
@ray.remote
class Counter:
def __init__(self):
self.counter = 0
def inc(self):
self.counter += 1
def get_counter(self):
return self.counter
counter = Counter.remote()
for _ in range(5):
ray.get(counter.inc.remote())
print(ray.get(counter.get_counter.remote()))
print(requests.__version__)
Put this file in a local directory of your choice, with filename ``script.py``, so your working directory will look like:
.. code-block:: bash
| your_working_directory ("./")
| ├── script.py
Next, start a local Ray cluster:
.. code-block:: bash
ray start --head
Local node IP: 127.0.0.1
INFO services.py:1360 -- View the Ray dashboard at http://127.0.0.1:8265
Note the address and port returned in the terminal---this will be where we submit job requests to, as explained further in the examples below. If you do not see this, ensure the Ray Dashboard is installed by running :code:`pip install "ray[default]"`.
At this point, the job is ready to be submitted by one of the :ref:`Ray Job APIs<ray-job-apis>`.
Continue on to see examples of running and interacting with this sample job.
.. _ray-job-apis-under-construction-sdk:
Ray Job Submission APIs
-----------------------
Ray provides three APIs for job submission:
* A :ref:`command line interface<ray-job-cli>`, the easiest way to get started.
* A :ref:`Python SDK<ray-job-sdk>`, the recommended way to submit jobs programmatically.
* An :ref:`HTTP REST API<ray-job-rest-api>`. Both the CLI and SDK call into the REST API under the hood.
All three APIs for job submission share the following key inputs:
* **Entrypoint**: The shell command to run the job.
* Example: :code:`python my_ray_script.py`
* Example: :code:`echo hello`
* **Runtime Environment**: Specifies files, packages, and other dependencies for your job. See :ref:`Runtime Environments<runtime-environments>` for details.
* Example: ``{working_dir="/data/my_files", pip=["requests", "pendulum==2.1.2"]}``
* Of special note: the field :code:`working_dir` specifies the files your job needs to run. The entrypoint command will be run in the remote cluster's copy of the `working_dir`, so for the entrypoint ``python my_ray_script.py``, the file ``my_ray_script.py`` must be in the directory specified by ``working_dir``.
* If :code:`working_dir` is a local directory: It will be automatically zipped and uploaded to the target Ray cluster, then unpacked to where your submitted application runs. This option has a size limit of 100 MB and is recommended for rapid iteration and experimentation.
* If :code:`working_dir` is a remote URI hosted on S3, GitHub or others: It will be downloaded and unpacked to where your submitted application runs. This option has no size limit and is recommended for production use. For details, see :ref:`remote-uris`.
.. _ray-job-cli-under-construction-sdk:
CLI
^^^
The easiest way to get started with Ray job submission is to use the Job Submission CLI.
Jump to the :ref:`API Reference<ray-job-submission-cli-ref>`, or continue reading for a walkthrough.
Using the CLI on a local cluster
""""""""""""""""""""""""""""""""
First, start a local Ray cluster (e.g. with ``ray start --head``) and open a terminal (on the head node, which is your local machine).
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
This tells the jobs CLI how to find your Ray cluster. Here we are specifying port ``8265`` on the head node, the port that the Ray Dashboard listens on.
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you are ready to use the CLI.
Here are some examples of CLI commands from the Quick Start example and their output:
.. code-block::
ray job submit --runtime-env-json='{"working_dir": "./", "pip": ["requests==2.26.0"]}' -- python script.py
2021-12-01 23:04:52,672 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:04:52,809 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip.
2021-12-01 23:04:52,810 INFO packaging.py:352 -- Creating a file package for local directory './'.
2021-12-01 23:04:52,878 INFO cli.py:105 -- Job submitted successfully: raysubmit_RXhvSyEPbxhcXtm6.
2021-12-01 23:04:52,878 INFO cli.py:106 -- Query the status of the job using: `ray job status raysubmit_RXhvSyEPbxhcXtm6`.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:00,356 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:00,371 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': PENDING.
2021-12-01 23:05:00,371 INFO cli.py:129 -- Job has not started yet, likely waiting for the runtime_env to be set up.
ray job status raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:37,751 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:37,764 INFO cli.py:127 -- Job status for 'raysubmit_RXhvSyEPbxhcXtm6': SUCCEEDED.
2021-12-01 23:05:37,764 INFO cli.py:129 -- Job finished successfully.
ray job logs raysubmit_RXhvSyEPbxhcXtm6
2021-12-01 23:05:59,026 INFO cli.py:25 -- Creating JobSubmissionClient at address: http://127.0.0.1:8265
2021-12-01 23:05:23,037 INFO worker.py:851 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379
(pid=runtime_env) 2021-12-01 23:05:23,212 WARNING conda.py:54 -- Injecting /Users/jiaodong/Workspace/ray/python to environment /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5 because _inject_current_ray flag is on.
(pid=runtime_env) 2021-12-01 23:05:23,212 INFO conda.py:328 -- Finished setting up runtime environment at /tmp/ray/session_2021-12-01_23-04-44_771129_7693/runtime_resources/conda/99305e1352b2dcc9d5f38c2721c7c1f1cc0551d5
(pid=runtime_env) 2021-12-01 23:05:23,213 INFO working_dir.py:85 -- Setup working dir for gcs://_ray_pkg_bbcc8ca7e83b4dc0.zip
1
2
3
4
5
2.26.0
ray job list
{'raysubmit_AYhLMgDJ6XBQFvFP': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908622, end_time=1645908623, metadata={}, runtime_env={}),
'raysubmit_su9UcdUviUZ86b1t': JobInfo(status='SUCCEEDED', message='Job finished successfully.', error_type=None, start_time=1645908669, end_time=1645908670, metadata={}, runtime_env={})}
.. warning::
When using the CLI, do not wrap the entrypoint command in quotes. For example, use
``ray job submit --working_dir="." -- python script.py`` instead of ``ray job submit --working_dir="." -- "python script.py"``.
Otherwise you may encounter the error ``/bin/sh: 1: python script.py: not found``.
.. tip::
If your job is stuck in `PENDING`, the runtime environment installation may be stuck.
(For example, the `pip` installation or `working_dir` download may be stalled due to internet issues.)
You can check the installation logs at `/tmp/ray/session_latest/logs/runtime_env_setup-*.log` for details.
Using the CLI on a remote cluster
"""""""""""""""""""""""""""""""""
Above, we ran the "Quick Start" example on a local Ray cluster. When connecting to a `remote` cluster via the CLI, you need to be able to access the Ray Dashboard port of the cluster over HTTP.
One way to do this is to port forward ``127.0.0.1:8265`` on your local machine to ``127.0.0.1:8265`` on the head node.
If you started your remote cluster with the :ref:`Ray Cluster Launcher <ref-cluster-quick-start>`, then the port forwarding can be set up automatically using the ``ray dashboard`` command (see :ref:`monitor-cluster` for details).
To use this, run the following command on your local machine, where ``cluster.yaml`` is the configuration file you used to launch your cluster:
.. code-block:: bash
ray dashboard cluster.yaml
Once this is running, check that you can view the Ray Dashboard in your local browser at ``http://127.0.0.1:8265``.
Next, set the :code:`RAY_ADDRESS` environment variable:
.. code-block:: bash
export RAY_ADDRESS="http://127.0.0.1:8265"
(Note that this port is different from the port used to connect to the cluster via :ref:`Ray Client <ray-client>`, which is ``10001`` by default.)
Now you will be able to use the Jobs CLI on your local machine as in the example above to interact with your remote Ray cluster.
Using the CLI on Kubernetes
"""""""""""""""""""""""""""
The instructions above still apply, but you can achieve the dashboard port forwarding using ``kubectl port-forward``:
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/
Alternatively, you can set up Ingress to the dashboard port of the cluster over HTTP: https://kubernetes.io/docs/concepts/services-networking/ingress/
.. _ray-job-sdk-under-construction-sdk:
Python SDK
^^^^^^^^^^
The Job Submission Python SDK is the recommended way to submit jobs programmatically. Jump to the :ref:`API Reference<ray-job-submission-sdk-ref>`, or continue reading for a quick overview.
SDK calls are made via a ``JobSubmissionClient`` object. To initialize the client, provide the Ray cluster head node address and the port used by the Ray Dashboard (``8265`` by default). For this example, we'll use a local Ray cluster, but the same example will work for remote Ray cluster addresses.
.. code-block:: python
from ray.job_submission import JobSubmissionClient
# If using a remote cluster, replace 127.0.0.1 with the head node's IP address.
client = JobSubmissionClient("http://127.0.0.1:8265")
Then we can submit our application to the Ray cluster via the Job SDK.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python script.py",
# Runtime environment for the job, specifying a working directory and pip package
runtime_env={
"working_dir": "./",
"pip": ["requests==2.26.0"]
}
)
.. tip::
By default, the Ray job server will generate a new ``job_id`` and return it, but you can alternatively choose a unique ``job_id`` string first and pass it into :code:`submit_job`.
In this case, the Job will be executed with your given id, and will throw an error if the same ``job_id`` is submitted more than once for the same Ray cluster.
Now we can write a simple polling loop that checks the job status until it reaches a terminal state (namely, ``JobStatus.SUCCEEDED``, ``JobStatus.STOPPED``, or ``JobStatus.FAILED``), and gets the logs at the end.
We expect to see the numbers printed from our actor, as well as the correct version of the :code:`requests` module specified in the ``runtime_env``.
.. code-block:: python
from ray.job_submission import JobStatus
import time
def wait_until_finish(job_id):
start = time.time()
timeout = 5
while time.time() - start <= timeout:
status = client.get_job_status(job_id)
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
The output should be as follows:
.. code-block:: bash
status: JobStatus.PENDING
status: JobStatus.RUNNING
status: JobStatus.SUCCEEDED
1
2
3
4
5
2.26.0
.. tip::
Instead of a local directory (``"./"`` in this example), you can also specify remote URIs for your job's working directory, such as S3 buckets or Git repositories. See :ref:`remote-uris` for details.
A submitted job can be stopped by the user before it finishes executing.
.. code-block:: python
job_id = client.submit_job(
# Entrypoint shell command to execute
entrypoint="python -c 'import time; time.sleep(60)'",
runtime_env={}
)
wait_until_finish(job_id)
client.stop_job(job_id)
wait_until_finish(job_id)
logs = client.get_job_logs(job_id)
To get information about all jobs, call ``client.list_jobs()``. This returns a ``Dict[str, JobInfo]`` object mapping Job IDs to their information.
For full details, see the :ref:`API Reference<ray-job-submission-sdk-ref>`.
.. _ray-job-rest-api-under-construction-sdk:
REST API
^^^^^^^^
Under the hood, both the Python SDK and the CLI make HTTP calls to the job server running on the Ray head node. You can also directly send requests to the corresponding endpoints via HTTP if needed:
**Submit Job**
.. code-block:: python
import requests
import json
import time
resp = requests.post(
"http://127.0.0.1:8265/api/jobs/",
json={
"entrypoint": "echo hello",
"runtime_env": {},
"job_id": None,
"metadata": {"job_submission_id": "123"}
}
)
rst = json.loads(resp.text)
job_id = rst["job_id"]
**Query and poll for Job status**
.. code-block:: python
start = time.time()
while time.time() - start <= 10:
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>"
)
rst = json.loads(resp.text)
status = rst["status"]
print(f"status: {status}")
if status in {JobStatus.SUCCEEDED, JobStatus.STOPPED, JobStatus.FAILED}:
break
time.sleep(1)
**Query for logs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/<job_id>/logs"
)
rst = json.loads(resp.text)
logs = rst["logs"]
**List all jobs**
.. code-block:: python
resp = requests.get(
"http://127.0.0.1:8265/api/jobs/"
)
print(resp.json())
# {"job_id": {"metadata": ..., "status": ..., "message": ...}, ...}
Job Submission Architecture
----------------------------
The following diagram shows the underlying structure and steps for each submitted job.
.. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/job/job_submission_arch_v2.png

View file

@ -1,6 +0,0 @@
:::{warning}
This page is under construction!
:::
# Interacting with the cluster via the Ray Client
## When to use
## How to use

View file

@ -0,0 +1,283 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ray-client-under-construction:
Ray Client: Interactive Development
===================================
**What is the Ray Client?**
The Ray Client is an API that connects a Python script to a **remote** Ray cluster. Effectively, it allows you to leverage a remote Ray cluster just like you would with Ray running on your local machine.
By changing ``ray.init()`` to ``ray.init("ray://<head_node_host>:<port>")``, you can connect from your laptop (or anywhere) directly to a remote cluster and scale-out your Ray code, while maintaining the ability to develop interactively in a Python shell. **This will only work with Ray 1.5+.** If you're using an older version of ray, see the `1.4.1 docs <https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html>`_
.. code-block:: python
# You can run this code outside of the Ray cluster!
import ray
# Starting the Ray client. This connects to a remote Ray cluster.
# If you're using a version of Ray prior to 1.5, use the 1.4.1 example
# instead: https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html
ray.init("ray://<head_node_host>:10001")
# Normal Ray code follows
@ray.remote
def do_work(x):
return x ** x
do_work.remote(2)
#....
Client arguments
----------------
Ray Client is used when the address passed into ``ray.init`` is prefixed with ``ray://``. Besides the address, Client mode currently accepts two other arguments:
- ``namespace`` (optional): Sets the namespace for the session.
- ``runtime_env`` (optional): Sets the `runtime environment <../ray-core/handling-dependencies.html#runtime-environments>`_ for the session, allowing you to dynamically specify environment variables, packages, local files, and more.
.. code-block:: python
# Connects to an existing cluster at 1.2.3.4 listening on port 10001, using
# the namespace "my_namespace". The Ray workers will run inside a cluster-side
# copy of the local directory "files/my_project", in a Python environment with
# `toolz` and `requests` installed.
ray.init(
"ray://1.2.3.4:10001",
namespace="my_namespace",
runtime_env={"working_dir": "files/my_project", "pip": ["toolz", "requests"]},
)
#....
When to use Ray Client
----------------------
Ray Client should be used when you want to connect a script or an interactive shell session to a **remote** cluster.
* Use ``ray.init("ray://<head_node_host>:10001")`` (Ray Client) if you've set up a remote cluster at ``<head_node_host>`` and you want to do interactive work. This will connect your local script or shell to the cluster. See the section on :ref:`using Ray Client<how-do-you-use-the-ray-client>` for more details on setting up your cluster.
* Use ``ray.init("localhost:<port>")`` (non-client connection, local address) if you're developing locally or on the head node of your cluster and you have already started the cluster (i.e. ``ray start --head`` has already been run)
* Use ``ray.init()`` (non-client connection, no address specified) if you're developing locally and want to automatically create a local cluster and attach directly to it OR if you are using Ray Job submission.
.. _how-do-you-use-the-ray-client-under-construction:
How do you use the Ray Client?
------------------------------
Step 1: Set up your Ray cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have a running Ray cluster (version >= 1.5), Ray Client server is likely already running on port ``10001`` of the head node by default. Otherwise, you'll want to create a Ray cluster. To start a Ray cluster locally, you can run
.. code-block:: bash
ray start --head
To start a Ray cluster remotely, you can follow the directions in :ref:`ref-cluster-quick-start`.
If necessary, you can modify the Ray Client server port to be other than ``10001``, by specifying ``--ray-client-server-port=...`` to the ``ray start`` :ref:`command <ray-start-doc>`.
Step 2: Check ports
~~~~~~~~~~~~~~~~~~~
Ensure that the Ray Client port on the head node is reachable from your local machine.
This means opening that port up by configuring security groups or other access controls (on `EC2 <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html>`_)
or proxying from your local machine to the cluster (on `K8s <https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/#forward-a-local-port-to-a-port-on-the-pod>`_).
.. tabbed:: AWS
With the Ray cluster launcher, you can configure the security group
to allow inbound access by defining :ref:`cluster-configuration-security-group`
in your `cluster.yaml`.
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal_security_group
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
security_group:
GroupName: ray_client_security_group
IpPermissions:
- FromPort: 10001
ToPort: 10001
IpProtocol: TCP
IpRanges:
# This will enable inbound access from ALL IPv4 addresses.
- CidrIp: 0.0.0.0/0
Step 3: Run Ray code
~~~~~~~~~~~~~~~~~~~~
Now, connect to the Ray Cluster with the following and then use Ray like you normally would:
..
.. code-block:: python
import ray
# replace with the appropriate host and port
ray.init("ray://<head_node_host>:10001")
# Normal Ray code follows
@ray.remote
def do_work(x):
return x ** x
do_work.remote(2)
#....
Alternative Approach: SSH Port Forwarding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As an alternative to configuring inbound traffic rules, you can also set up
Ray Client via port forwarding. While this approach does require an open SSH
connection, it can be useful in a test environment where the
``head_node_host`` often changes.
First, open up an SSH connection with your Ray cluster and forward the
listening port (``10001``).
.. code-block:: bash
$ ray up cluster.yaml
$ ray attach cluster.yaml -p 10001
Then, you can connect to the Ray cluster **from another terminal** using ``localhost`` as the
``head_node_host``.
.. code-block:: python
import ray
# This will connect to the cluster via the open SSH session.
ray.init("ray://localhost:10001")
# Normal Ray code follows
@ray.remote
def do_work(x):
return x ** x
do_work.remote(2)
#....
Connect to multiple Ray clusters (Experimental)
-----------------------------------------------
Ray Client allows connecting to multiple Ray clusters in one Python process. To do this, just pass ``allow_multiple=True`` to ``ray.init``:
.. code-block:: python
import ray
# Create a default client.
ray.init("ray://<head_node_host_cluster>:10001")
# Connect to other clusters.
cli1 = ray.init("ray://<head_node_host_cluster_1>:10001", allow_multiple=True)
cli2 = ray.init("ray://<head_node_host_cluster_2>:10001", allow_multiple=True)
# Data is put into the default cluster.
obj = ray.put("obj")
with cli1:
obj1 = ray.put("obj1")
with cli2:
obj2 = ray.put("obj2")
with cli1:
assert ray.get(obj1) == "obj1"
try:
ray.get(obj2) # Cross-cluster ops not allowed.
except:
print("Failed to get object which doesn't belong to this cluster")
with cli2:
assert ray.get(obj2) == "obj2"
try:
ray.get(obj1) # Cross-cluster ops not allowed.
except:
print("Failed to get object which doesn't belong to this cluster")
assert "obj" == ray.get(obj)
cli1.disconnect()
cli2.disconnect()
When using Ray multi-client, there are some different behaviors to pay attention to:
* The client won't be disconnected automatically. Call ``disconnect`` explicitly to close the connection.
* Object references can only be used by the client from which it was obtained.
* ``ray.init`` without ``allow_multiple`` will create a default global Ray client.
Things to know
--------------
Client disconnections
~~~~~~~~~~~~~~~~~~~~~
When the client disconnects, any object or actor references held by the server on behalf of the client are dropped, as if directly disconnecting from the cluster.
Versioning requirements
~~~~~~~~~~~~~~~~~~~~~~~
Generally, the client Ray version must match the server Ray version. An error will be raised if an incompatible version is used.
Similarly, the minor Python (e.g., 3.6 vs 3.7) must match between the client and server. An error will be raised if this is not the case.
Starting a connection on older Ray versions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you encounter ``socket.gaierror: [Errno -2] Name or service not known`` when using ``ray.init("ray://...")`` then you may be on a version of Ray prior to 1.5 that does not support starting client connections through ``ray.init``. If this is the case, see the `1.4.1 docs <https://docs.ray.io/en/releases-1.4.1/cluster/ray-client.html>`_ for Ray Client.
Connection through the Ingress
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you encounter the following error message when connecting to the ``Ray Cluster`` using an ``Ingress``, it may be caused by the Ingress's configuration.
..
.. code-block:: python
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = ""
debug_error_string = "{"created":"@1628668820.164591000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}"
>
Got Error from logger channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = ""
debug_error_string = "{"created":"@1628668820.164713000","description":"Error received from peer ipv4:10.233.120.107:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"","grpc_status":3}"
>
If you are using the ``nginx-ingress-controller``, you may be able to resolve the issue by adding the following Ingress configuration.
.. code-block:: yaml
metadata:
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
underscores_in_headers on;
ignore_invalid_headers on;
Ray client logs
~~~~~~~~~~~~~~~
Ray client logs can be found at ``/tmp/ray/session_latest/logs`` on the head node.
Uploads
~~~~~~~
If a ``working_dir`` is specified in the runtime env, when running ``ray.init()`` the Ray client will upload the ``working_dir`` on the laptop to ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_<hash of directory contents>``.
Ray workers are started in the ``/tmp/ray/session_latest/runtime_resources/_ray_pkg_<hash of directory contents>`` directory on the cluster. This means that relative paths in the remote tasks and actors in the code will work on the laptop and on the cluster without any code changes. For example, if the ``working_dir`` on the laptop contains ``data.txt`` and ``run.py``, inside the remote task definitions in ``run.py`` one can just use the relative path ``"data.txt"``. Then ``python run.py`` will work on my laptop, and also on the cluster. As a side note, since relative paths can be used in the code, the absolute path is only useful for debugging purposes.

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Running a Ray cluster on-prem

View file

@ -0,0 +1,447 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction:
Launching Cloud Clusters
========================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
.. _cluster-private-setup-under-construction:
Local On Premise Cluster (List of nodes)
----------------------------------------
You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.
The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.
There are two ways of running private clusters:
- Manually managed, i.e., the user explicitly specifies the head and worker ips.
- Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.
.. tip:: To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:
.. code-block:: bash
$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
.. tabbed:: Manually Managed
You can get started by filling out the fields in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
Be sure to specify the proper ``head_ip``, list of ``worker_ips``, and the ``ssh_user`` field.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
.. tabbed:: Automatically Managed
Start by launching the coordinator server that will manage all the on prem clusters. This server also makes sure to isolate the resources between different users. The script for running the coordinator server is `ray/python/ray/autoscaler/local/coordinator_server.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/coordinator_server.py>`__. To launch the coordinator server run:
.. code-block:: bash
$ python coordinator_server.py --ips <list_of_node_ips> --port <PORT>
where ``list_of_node_ips`` is a comma separated list of all the available nodes on the private cluster. For example, ``160.24.42.48,160.24.42.49,...`` and ``<PORT>`` is the port that the coordinator server will listen on.
After running the coordinator server it will print the address of the coordinator server. For example:
.. code-block:: bash
>> INFO:ray.autoscaler.local.coordinator_server:Running on prem coordinator server
on address <Host:PORT>
Next, the user only specifies the ``<Host:PORT>`` printed above in the ``coordinator_address`` entry instead of specific head/worker ips in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
Now we can test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
.. _manual-cluster-under-construction:
Manual Ray Cluster Setup
------------------------
The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.
.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On the head node (just choose one node to be the head node), run the following.
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
random port.
.. code-block:: bash
$ ray start --head --port=6379
...
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<ip address>:6379'
If connection fails, check your firewall settings and network configuration.
The command will print out the address of the Ray GCS server that was started
(the local node IP address plus the port number you specified).
.. note::
If you already has remote Redis instances, you can specify environment variable
`RAY_REDIS_ADDRESS=ip1:port1,ip2:port2...` to use them. The first one is
primary and rest are shards.
**Then on each of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).
Note that if your compute nodes are on their own subnetwork with Network
Address Translation, to connect from a regular machine outside that subnetwork,
the command printed by the head node will not work. You need to find the
address that will reach the head node from the second machine. If the head node
has a domain address like compute04.berkeley.edu, you can simply use that in
place of an IP address and rely on the DNS.
.. code-block:: bash
$ ray start --address=<address>
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
If you see ``Unable to connect to GCS at ...``,
this means the head node is inaccessible at the given ``--address`` (because, for
example, the head node is not actually running, a different version of Ray is
running at the specified address, the specified address is wrong, or there are
firewall settings preventing access).
If you see ``Ray runtime started.``, then the node successfully connected to
the head node at the ``--address``. You should now be able to connect to the
cluster with ``ray.init()``.
.. code-block:: bash
If connection fails, check your firewall settings and network configuration.
If the connection fails, to check whether each port can be reached from a node,
you can use a tool such as ``nmap`` or ``nc``.
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis? syn-ack
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
If the node cannot access that port at that IP address, you might see
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp closed redis reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
Stopping Ray
~~~~~~~~~~~~
When you want to stop the Ray processes, run ``ray stop`` on each node.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -1,4 +1,4 @@
.. include:: we_are_hiring.rst
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-cluster-setup:

View file

@ -1,6 +1,6 @@
.. include:: /_includes/clusters/announcement.rst
.. include:: we_are_hiring.rst
.. include:: /_includes/clusters/we_are_hiring.rst
.. _ref-cluster-quick-start: