[Cluster-launcher doc] revamp the vm part (#27431)

This commit is contained in:
Chen Shen 2022-08-10 02:43:28 -07:00 committed by GitHub
parent 853c859037
commit a1d80dc195
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
20 changed files with 678 additions and 1571 deletions

View file

@ -289,13 +289,10 @@ parts:
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/index
title: User Guides
sections:
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/installing-ray
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index
title: Launching Clusters
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/multi-tenancy-best-practices
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index
title: Community-supported Cluster Managers

View file

@ -23,10 +23,11 @@ How can I use Ray clusters?
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ray clusters are officially supported on the following technology stacks:
* The :ref:`Ray Cluster Launcher on AWS and GCP<ref-cluster-quick-start-vms-under-construction>`. Community-supported Azure and Aliyun integrations also exist.
* The :ref:`Ray cluster launcher on AWS and GCP<ref-cluster-quick-start-vms-under-construction>`. Community-supported Azure and Aliyun integrations also exist.
* :ref:`KubeRay, the official way to run Ray on Kubernetes<kuberay-index>`.
Advanced users may want to :ref:`deploy Ray clusters on-premise<cluster-private-setup-under-construction>` or even onto infrastructure platforms not listed here by :ref:`providing a custom node provider<additional-cloud-providers-under-construction>`.
Advanced users may want to :ref:`deploy Ray clusters on-premise <on-prem>`
or onto infrastructure platforms not listed here by :ref:`providing a custom node provider <ref-cluster-setup-under-construction>`.
Where to go from here?
----------------------
@ -48,7 +49,7 @@ Where to go from here?
---
**I want to run Ray on a cloud provider**
**I want to run Ray on a cloud provider**
^^^
Take a sample application designed to run on a laptop and scale it up in the
cloud. Access to an AWS or GCP account is required.

View file

@ -12,7 +12,7 @@ Ray Clusters Quick Start
This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <on-prem>` guide.
About the demo
--------------
@ -207,7 +207,7 @@ A minimal sample cluster configuration file looks as follows:
Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
After defining our configuration, we will use the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
.. code-block:: shell

View file

@ -8,7 +8,7 @@
Cluster Launcher Commands
=========================
This document overviews common commands for using the Ray Cluster Launcher.
This document overviews common commands for using the Ray cluster launcher.
See the :ref:`Cluster Configuration <cluster-config>` docs on how to customize the configuration file.
Launching a cluster (``ray up``)

View file

@ -7,7 +7,9 @@ Community Supported Cluster Managers
.. note::
If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher <cluster-cloud>` to simplify the cluster setup process.
If you're using AWS, Azure or GCP you can use the :ref:`Ray cluster launcher <cluster-cloud>` to simplify the cluster setup process.
The following is a list of community supported cluster managers.
.. toctree::
:maxdepth: 2
@ -16,3 +18,19 @@ Community Supported Cluster Managers
slurm.rst
lsf.rst
.. _ref-additional-cloud-providers-under-construction:
Using a custom cloud or cluster manager
=======================================
The Ray cluster launcher currently supports AWS, Azure, GCP, Aliyun and Kuberay out of the box. To use the Ray cluster launcher and Autoscaler on other cloud providers or cluster managers, you can implement the `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`_ interface (100 LOC).
Once the node provider is implemented, you can register it in the `provider section <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/local/example-full.yaml#L18>`_ of the cluster launcher config.
.. code-block:: yaml
provider:
type: "external"
module: "my.module.MyCustomNodeProvider"
You can refer to `AWSNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L95>`_, `KuberayNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/kuberay/node_provider.py#L148>`_ and
`LocalNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/local/node_provider.py#L166>`_ for more examples.

View file

@ -1,6 +0,0 @@
:::{warning}
This page is under construction!
:::
# Installing Ray
## Install Ray via `pip`
## Use the Ray docker images

View file

@ -1,11 +0,0 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _additional-cloud-providers-under-construction:
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!

View file

@ -0,0 +1,240 @@
.. include:: /_includes/clusters/we_are_hiring.rst
Monitor Ray using Amazon CloudWatch
===================================
Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
.. list-table:: All available unified CloudWatch agent images
* - Base AMI
- AMI ID
- Region
- Unified CloudWatch Agent Version
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-069f2811478f86c20
- us-east-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-058cc0932940c2b8b
- us-east-2
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-044f95c9ef12883ef
- us-west-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-0d88d9cbe28fac870
- us-west-2
- v1.247348.0b251302
.. note::
Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
Getting started
---------------
1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: yaml
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a
# Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
cloudwatch:
agent:
# Path to Unified CloudWatch Agent config file
config: "cloudwatch/example-cloudwatch-agent-config.json"
dashboard:
# CloudWatch Dashboard name
name: "example-dashboard-name"
# Path to the CloudWatch Dashboard config file
config: "cloudwatch/example-cloudwatch-dashboard-config.json"
auth:
ssh_user: ubuntu
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
resources: {}
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
resources: {}
min_workers: 0
2. Download CloudWatch Agent and Dashboard config.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
.. code-block:: console
$ mkdir cloudwatch
$ cd cloudwatch
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
.. code-block:: bash
aws logs tail $log_group_name --follow
Advanced Setup
--------------
Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
.. code-block:: bash
aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
.. code-block:: yaml
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
To build your own AMI with the Unified CloudWatch Agent installed:
1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
These example config files include the following features:
**Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
Log streams are named after the EC2 instance ID that emitted their logs.
Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
and must be manually provided in your alarm config file.
For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
``{region}``: Replaced with your Ray cluster's region.
``{cluster_name}``: Replaced with your Ray cluster name.
See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
.. note::
Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
.. code-block:: json
"AlarmActions":[
"TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
]
3. Reference your CloudWatch JSON config files in your cluster config YAML.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
.. code-block:: yaml
provider:
cloudwatch:
agent:
config: "cloudwatch/example-cloudwatch-agent-config.json"
4. Set your IAM Role and EC2 Instance Profile.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
.. code-block:: yaml
ray.worker.default:
node_config:
InstanceType: c5a.large
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
5. Export Ray system metrics to CloudWatch.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
.. code-block:: yaml
head_setup_commands:
# Make `ray_prometheus_waiter.sh` executable.
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
# Copy `prometheus.yml` to Unified CloudWatch Agent folder
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
# First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
- >-
nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
`cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
6. Update CloudWatch Agent, Dashboard and Alarm config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.

View file

@ -0,0 +1,126 @@
# Launching Ray Clusters on AWS
This guide details the steps needed to start a Ray cluster on AWS.
To start an AWS Ray cluster, you should use the Ray cluster launcher with the AWS Python SDK.
## Install Ray cluster launcher
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
```bash
# install ray
pip install -U ray[default]
```
## Install and Configure AWS Python SDK (Boto3)
Next, install AWS SDK using `pip install -U boto3` and configure your AWS credentials following [the AWS guide](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).
```bash
# install AWS Python SDK (boto3)
pip install -U boto3
# setup AWS credentials using environment variables
export AWS_ACCESS_KEY_ID=foo
export AWS_SECRET_ACCESS_KEY=bar
export AWS_SESSION_TOKEN=baz
# alternatively, you can setup AWS credentials using ~/.aws/credentials file
echo "[default]
aws_access_key_id=foo
aws_secret_access_key=bar
aws_session_token=baz" >> ~/.aws/credentials
```
## Start Ray with the Ray cluster launcher
Once Boto3 is configured to manage resources in your AWS account, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml) will create a small cluster with an m5.large head node (on-demand) configured to autoscale to up to two m5.large [spot-instance](https://aws.amazon.com/ec2/spot/) workers.
Test that it works by running the following commands from your local machine:
```bash
# Download the example-full.yaml
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
ray up example-full.yaml
# Get a remote shell on the head node.
ray attach example-full.yaml
# Try running a Ray program.
python -c 'import ray; ray.init()'
exit
# Tear down the cluster.
ray down example-full.yaml
```
Congrats, you have started a Ray cluster on AWS!
If you want to learn more about the Ray cluster launcher, see this blog post for a [step by step guide](https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1).
## AWS Configurations
### Using Amazon EFS
To utilize Amazon EFS in the Ray cluster, you will need to install some additional utilities and mount the EFS in `setup_commands`. Note that these instructions only work if you are using the Ray cluster launcher on AWS.
```yaml
# Note You need to replace the {{FileSystemId}} with your own EFS ID before using the config.
# You may also need to modify the SecurityGroupIds for the head and worker nodes in the config file.
setup_commands:
- sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
sudo pkill -9 apt-get;
sudo pkill -9 dpkg;
sudo dpkg --configure -a;
sudo apt-get -y install binutils;
cd $HOME;
git clone https://github.com/aws/efs-utils;
cd $HOME/efs-utils;
./build-deb.sh;
sudo apt-get -y install ./build/amazon-efs-utils*deb;
cd $HOME;
mkdir efs;
sudo mount -t efs {{FileSystemId}}:/ efs;
sudo chmod 777 efs;
```
### Accessing S3
In various scenarios, worker nodes may need write access to an S3 bucket, e.g., Ray Tune has an option to write checkpoints to S3 instead of syncing them directly back to the driver.
If you see errors like “Unable to locate credentials”, make sure that the correct `IamInstanceProfile` is configured for worker nodes in your cluster config file. This may look like:
```yaml
worker_nodes:
InstanceType: m5.xlarge
ImageId: latest_dlami
IamInstanceProfile:
Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
```
You can verify if the set up is correct by SSHing into a worker node and running
```bash
aws configure list
```
You should see something like
```bash
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************XXXX iam-role
secret_key ****************YYYY iam-role
region <not set> None None
```
Please refer to this [discussion](https://github.com/ray-project/ray/issues/9327) for more details on ???.

View file

@ -1,573 +0,0 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-aws:
Launching Ray Clusters on AWS
=============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-aws:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-aws:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
.. _aws-cluster-under-construction:
AWS Configurations
==================
.. _aws-cluster-efs-under-construction:
Using Amazon EFS
----------------
To use Amazon EFS, install some utilities and mount the EFS in ``setup_commands``. Note that these instructions only work if you are using the AWS Autoscaler.
.. note::
You need to replace the ``{{FileSystemId}}`` to your own EFS ID before using the config. You may also need to set correct ``SecurityGroupIds`` for the instances in the config file.
.. code-block:: yaml
setup_commands:
- sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
sudo pkill -9 apt-get;
sudo pkill -9 dpkg;
sudo dpkg --configure -a;
sudo apt-get -y install binutils;
cd $HOME;
git clone https://github.com/aws/efs-utils;
cd $HOME/efs-utils;
./build-deb.sh;
sudo apt-get -y install ./build/amazon-efs-utils*deb;
cd $HOME;
mkdir efs;
sudo mount -t efs {{FileSystemId}}:/ efs;
sudo chmod 777 efs;
.. _aws-cluster-s3-under-construction:
Configure worker nodes to access Amazon S3
------------------------------------------
In various scenarios, worker nodes may need write access to the S3 bucket.
E.g. Ray Tune has the option that worker nodes write distributed checkpoints to S3 instead of syncing back to the driver using rsync.
If you see errors like "Unable to locate credentials", make sure that the correct ``IamInstanceProfile`` is configured for worker nodes in ``cluster.yaml`` file.
This may look like:
.. code-block:: text
worker_nodes:
InstanceType: m5.xlarge
ImageId: latest_dlami
IamInstanceProfile:
Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
You can verify if the set up is correct by entering one worker node and do
.. code-block:: bash
aws configure list
You should see something like
.. code-block:: text
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key ****************XXXX iam-role
secret_key ****************YYYY iam-role
region <not set> None None
Please refer to `this discussion <https://github.com/ray-project/ray/issues/9327>`__ for more details.
.. _aws-cluster-cloudwatch-under-construction:
Using Amazon CloudWatch
=======================
Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
.. list-table:: All available unified CloudWatch agent images
* - Base AMI
- AMI ID
- Region
- Unified CloudWatch Agent Version
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-069f2811478f86c20
- us-east-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-058cc0932940c2b8b
- us-east-2
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-044f95c9ef12883ef
- us-west-1
- v1.247348.0b251302
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
- ami-0d88d9cbe28fac870
- us-west-2
- v1.247348.0b251302
.. note::
Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
Getting started
---------------
1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: yaml
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a
# Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
cloudwatch:
agent:
# Path to Unified CloudWatch Agent config file
config: "cloudwatch/example-cloudwatch-agent-config.json"
dashboard:
# CloudWatch Dashboard name
name: "example-dashboard-name"
# Path to the CloudWatch Dashboard config file
config: "cloudwatch/example-cloudwatch-dashboard-config.json"
auth:
ssh_user: ubuntu
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
resources: {}
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
resources: {}
min_workers: 0
2. Download CloudWatch Agent and Dashboard config.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
.. code-block:: console
$ mkdir cloudwatch
$ cd cloudwatch
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
.. code-block:: bash
aws logs tail $log_group_name --follow
Advanced Setup
--------------
Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
.. code-block:: bash
aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
.. code-block:: yaml
available_node_types:
ray.head.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
ray.worker.default:
node_config:
InstanceType: c5a.large
ImageId: ami-0d88d9cbe28fac870
To build your own AMI with the Unified CloudWatch Agent installed:
1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
These example config files include the following features:
**Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
Log streams are named after the EC2 instance ID that emitted their logs.
Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
and must be manually provided in your alarm config file.
For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
``{region}``: Replaced with your Ray cluster's region.
``{cluster_name}``: Replaced with your Ray cluster name.
See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
.. note::
Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
.. code-block:: json
"AlarmActions":[
"TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
]
3. Reference your CloudWatch JSON config files in your cluster config YAML.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
.. code-block:: yaml
provider:
cloudwatch:
agent:
config: "cloudwatch/example-cloudwatch-agent-config.json"
4. Set your IAM Role and EC2 Instance Profile.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
.. code-block:: yaml
ray.worker.default:
node_config:
InstanceType: c5a.large
IamInstanceProfile:
Name: ray-autoscaler-cloudwatch-v1
5. Export Ray system metrics to CloudWatch.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
.. code-block:: yaml
head_setup_commands:
# Make `ray_prometheus_waiter.sh` executable.
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
# Copy `prometheus.yml` to Unified CloudWatch Agent folder
- >-
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
&& sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
# First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
- >-
nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
`cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
6. Update CloudWatch Agent, Dashboard and Alarm config files.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.
What's Next?
============
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
====================
.. include:: /_includes/_help.rst

View file

@ -0,0 +1,89 @@
# Launching Ray Clusters on Azure
This guide details the steps needed to start a Ray cluster on Azure.
There are two ways to start an Azure Ray cluster.
- Launch through Ray cluster launcher.
- Deploy a cluster using Azure portal.
```{note}
The Azure integration is community-maintained. Please reach out to the integration maintainers on Github if
you run into any problems: gramhagen, eisber, ijrsvt.
```
## Using Ray cluster launcher
### Install Ray cluster launcher
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
```bash
# install ray
pip install -U ray[default]
```
### Install and Configure Azure CLI
Next, install the Azure CLI (`pip install -U azure-cli azure-identity`) and login using `az login`.
```bash
# Install azure cli.
pip install azure-cli azure-identity
# Login to azure. This will redirect you to your web browser.
az login
```
### Start Ray with the Ray cluster launcher
The provided [cluster config file](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml) will create a small cluster with a Standard DS2v3 on-demand head node that is configured to autoscale to up to two Standard DS2v3 [spot-instance](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms) worker nodes.
Note that you'll need to fill in your Azure [resource_group](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L42) and [location](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L41) in those templates. You also need set the subscription to use. You can do this from the command line with `az account set -s <subscription_id>` or by filling in the [subscription_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L44) in the cluster config file.
Test that it works by running the following commands from your local machine:
```bash
# Download the example-full.yaml
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/azure/example-full.yaml
# Update the example-full.yaml to update resource_group, location, and subscription_id.
# vi example-full.yaml
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
ray up example-full.yaml
# Get a remote screen on the head node.
ray attach example-full.yaml
# Try running a Ray program.
# Tear down the cluster.
ray down example-full.yaml
```
Congratulations, you have started a Ray cluster on Azure!
## Using Azure portal
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through the Ray autoscaler. This will deploy [Azure Data Science VMs (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) for both the head node and the auto-scalable cluster managed by [Azure Virtual Machine Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/).
The head node conveniently exposes both SSH as well as JupyterLab.
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
```python
import ray; ray.init()
```
Under the hood, the [azure-init.sh](https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh) script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode

View file

@ -1,257 +0,0 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-azure:
Launching Ray Clusters on Azure
===============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-azure:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-azure:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -0,0 +1,58 @@
# Launching Ray Clusters on GCP
This guide details the steps needed to start a Ray cluster in GCP.
To start a GCP Ray cluster, you will use the Ray cluster launcher with the Google API client.
## Install Ray cluster launcher
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
```bash
# install ray
pip install -U ray[default]
```
## Install and Configure Google API Client
If you have never created a Google APIs Console project, read google Cloud's [Managing Projects page](https://cloud.google.com/resource-manager/docs/creating-managing-projects?visit_id=637952351450670909-433962807&rd=1) and create a project in the [Google API Console](https://console.developers.google.com/).
Next, install the Google API Client using `pip install -U google-api-python-client`.
```bash
# Install the Google API Client.
pip install google-api-python-client
```
## Start Ray with the Ray cluster launcher
Once the Google API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml) will create a small cluster with an on-demand n1-standard-2 head node and is configured to autoscale to up to two n1-standard-2 [preemptible workers](https://cloud.google.com/preemptible-vms/). Note that you'll need to fill in your GCP [project_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/gcp/example-full.yaml#L42) in those templates.
Test that it works by running the following commands from your local machine:
```bash
# Download the example-full.yaml
wget https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-full.yaml
# Edit the example-full.yaml to update project_id.
# vi example-full.yaml
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
ray up example-full.yaml
# Get a remote screen on the head node.
ray attach example-full.yaml
# Try running a Ray program.
python -c 'import ray; ray.init()'
exit
# Tear down the cluster.
ray down example-full.yaml
```
Congrats, you have started a Ray cluster on GCP!

View file

@ -1,257 +0,0 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction-gcp:
Launching Ray Clusters on GCP
=============================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction-gcp:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction-gcp:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -3,10 +3,19 @@
.. include:: /_includes/clusters/we_are_hiring.rst
Launching Ray Clusters
======================
In this section, you can find guides for launching Ray clusters on various cluster management frameworks and clouds.
Table of Contents
-----------------
.. toctree::
:maxdepth: 2
aws.rst
gcp.rst
azure.rst
add-your-own-cloud-provider.rst
aws.md
aws-cloud-watch.rst
gcp.md
azure.md
on-premises.md

View file

@ -0,0 +1,119 @@
(on-prem)=
# Launching an On-Premise Cluster
This document describes how to set up an on-premise Ray cluster, i.e., to run Ray on bare metal machines, or in a private cloud. We provide two ways to start an on-premise cluster.
* You can [manually set up](manual-setup-cluster) the Ray cluster by installing the Ray package and starting the Ray processes on each node.
* Alternatively, if you know all the nodes in advance and have SSH access to them, you should start the Ray cluster using the [cluster-launcher](manual-cluster-launcher).
(manual-setup-cluster)=
## Manually Set up a Ray Cluster
This section assumes that you have a list of machines and that the nodes in the cluster share the same network. It also assumes that Ray is installed on each machine. You can use pip to install the ray command line tool with cluster launcher support. Follow the [Ray installation instructions](installation) for more details.
```bash
# install ray
pip install -U "ray[default]"
```
### Start the Head Node
Choose any node to be the head node and run the following. If the `--port` argument is omitted, Ray will first choose port 6379, and then fall back to a random port if in 6379 is in use.
```bash
ray start --head --port=6379
```
The command will print out the Ray cluster address, which can be passed to `ray start` on other machines to start the worker nodes (see below). If you receive a ConnectionError, check your firewall settings and network configuration.
### Start Worker Nodes
Then on each of the other nodes, run the following command to connect to the head node you just created.
```bash
ray start --address=<head-node-address:port>
```
Make sure to replace `head-node-address:port` with the value printed by the command on the head node (it should look something like 123.45.67.89:6379).
Note that if your compute nodes are on their own subnetwork with Network Address Translation, the address printed by the head node will not work if connecting from a machine outside that subnetwork. You will need to use a head node address reachable from the remote machine. If the head node has a domain address like compute04.berkeley.edu, you can simply use that in place of an IP address and rely on DNS.
Ray autodetects the resources (e.g., CPU) available on each node, but you can also manually override this by passing custom resources to the `ray start` command. For example, if you wish to specify that a machine has 10 CPUs and 1 GPU available for use by Ray, you can do this with the flags `--num-cpus=10` and `--num-gpus=1`.
See the [Configuration page](../../ray-core/configure.html#configuring-ray) for more information.
### Troubleshooting
If you see `Unable to connect to GCS at ...`, this means the head node is inaccessible at the given `--address`.
Some possible causes include:
- the head node is not actually running;
- a different version of Ray is running at the specified address;
- the specified address is wrong;
- or there are firewall settings preventing access.
If the connection fails, to check whether each port can be reached from a node, you can use a tool such as nmap or nc.
```bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis? syn-ack
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
```
If the node cannot access that port at that IP address, you might see
```bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp closed redis reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
```
(manual-cluster-launcher)=
## Using Ray cluster launcher
The Ray cluster launcher is part of the `ray` command line tool. It allows you to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install it, or follow [install ray](installation) for more detailed instructions.
```bash
# install ray
pip install "ray[default]"
```
### Start Ray with the Ray cluster launcher
The provided [example-full.yaml](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml) cluster config file will create a Ray cluster given a list of nodes.
Note that you'll need to fill in your [head_ip](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L20), a list of [worker_ips](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L26), and the [ssh_user](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L34) field in those templates
Test that it works by running the following commands from your local machine:
```bash
# Download the example-full.yaml
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/local/example-full.yaml
# Update the example-full.yaml to update head_ip, worker_ips, and ssh_user.
# vi example-full.yaml
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
ray up example-full.yaml
# Get a remote screen on the head node.
ray attach example-full.yaml
# Try running a Ray program.
# Tear down the cluster.
ray down example-full.yaml
```
Congrats, you have started a local Ray cluster!

View file

@ -3,12 +3,15 @@
Monitoring and observability
----------------------------
Ray comes with 3 main observability features:
Ray comes with following observability features:
1. :ref:`The dashboard <Ray-dashboard>`
2. :ref:`ray status <monitor-cluster>`
3. :ref:`Prometheus metrics <multi-node-metrics>`
Please refer to :ref:`the observability documentation <observability>` for more on Ray's observability features.
Monitoring the cluster via the dashboard
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

View file

@ -1,4 +0,0 @@
:::{warning}
This page is under construction!
:::
# Best practices for multi-tenancy

View file

@ -1,447 +0,0 @@
.. warning::
This page is under construction!
.. include:: /_includes/clusters/we_are_hiring.rst
.. _cluster-cloud-under-construction:
Launching Cloud Clusters
========================
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
.. _ref-cloud-setup-under-construction:
Ray with cloud providers
------------------------
.. toctree::
:hidden:
/cluster/aws-tips.rst
.. tabbed:: AWS
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
$ # Try running a Ray program.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
.. tabbed:: Azure
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
# test ray setup
$ python -c 'import ray; ray.init()'
$ exit
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
**Azure Portal**:
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
The head node conveniently exposes both SSH as well as JupyterLab.
.. image:: https://aka.ms/deploytoazurebutton
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
:alt: Deploy to Azure
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
.. code-block:: python
import ray
ray.init()
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
1. Activates one of the conda environments available on DSVM
2. Installs Ray and any other user-specified dependencies
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: GCP
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
.. tabbed:: Aliyun
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster.
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
.. tabbed:: Custom
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
You can specify the external node provider using the yaml config:
.. code-block:: yaml
provider:
type: external
module: mypackage.myclass
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
.. _cluster-private-setup-under-construction:
Local On Premise Cluster (List of nodes)
----------------------------------------
You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.
The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.
There are two ways of running private clusters:
- Manually managed, i.e., the user explicitly specifies the head and worker ips.
- Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.
.. tip:: To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:
.. code-block:: bash
$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
.. tabbed:: Manually Managed
You can get started by filling out the fields in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
Be sure to specify the proper ``head_ip``, list of ``worker_ips``, and the ``ssh_user`` field.
Test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
.. tabbed:: Automatically Managed
Start by launching the coordinator server that will manage all the on prem clusters. This server also makes sure to isolate the resources between different users. The script for running the coordinator server is `ray/python/ray/autoscaler/local/coordinator_server.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/coordinator_server.py>`__. To launch the coordinator server run:
.. code-block:: bash
$ python coordinator_server.py --ips <list_of_node_ips> --port <PORT>
where ``list_of_node_ips`` is a comma separated list of all the available nodes on the private cluster. For example, ``160.24.42.48,160.24.42.49,...`` and ``<PORT>`` is the port that the coordinator server will listen on.
After running the coordinator server it will print the address of the coordinator server. For example:
.. code-block:: bash
>> INFO:ray.autoscaler.local.coordinator_server:Running on prem coordinator server
on address <Host:PORT>
Next, the user only specifies the ``<Host:PORT>`` printed above in the ``coordinator_address`` entry instead of specific head/worker ips in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
Now we can test that it works by running the following commands from your local machine:
.. code-block:: bash
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to get a remote shell into the head node.
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
$ # Try running a Ray program with 'ray.init()'.
# Tear down the cluster
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
.. _manual-cluster-under-construction:
Manual Ray Cluster Setup
------------------------
The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.
.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On the head node (just choose one node to be the head node), run the following.
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
random port.
.. code-block:: bash
$ ray start --head --port=6379
...
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<ip address>:6379'
If connection fails, check your firewall settings and network configuration.
The command will print out the address of the Ray GCS server that was started
(the local node IP address plus the port number you specified).
.. note::
If you already has remote Redis instances, you can specify environment variable
`RAY_REDIS_ADDRESS=ip1:port1,ip2:port2...` to use them. The first one is
primary and rest are shards.
**Then on each of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).
Note that if your compute nodes are on their own subnetwork with Network
Address Translation, to connect from a regular machine outside that subnetwork,
the command printed by the head node will not work. You need to find the
address that will reach the head node from the second machine. If the head node
has a domain address like compute04.berkeley.edu, you can simply use that in
place of an IP address and rely on the DNS.
.. code-block:: bash
$ ray start --address=<address>
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
If you see ``Unable to connect to GCS at ...``,
this means the head node is inaccessible at the given ``--address`` (because, for
example, the head node is not actually running, a different version of Ray is
running at the specified address, the specified address is wrong, or there are
firewall settings preventing access).
If you see ``Ray runtime started.``, then the node successfully connected to
the head node at the ``--address``. You should now be able to connect to the
cluster with ``ray.init()``.
.. code-block:: bash
If connection fails, check your firewall settings and network configuration.
If the connection fails, to check whether each port can be reached from a node,
you can use a tool such as ``nmap`` or ``nc``.
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis? syn-ack
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
If the node cannot access that port at that IP address, you might see
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp closed redis reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
Stopping Ray
~~~~~~~~~~~~
When you want to stop the Ray processes, run ``ray stop`` on each node.
Additional Cloud Providers
--------------------------
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
Security
--------
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster-under-construction:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabbed:: Python
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
For example:
.. code-block:: python
ray.init()
# Connecting to existing Ray cluster at address: <IP address>...
.. tabbed:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. tabbed:: C++
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
RAY_ADDRESS=<address> ./<binary> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
Questions or Issues?
--------------------
.. include:: /_includes/_help.rst

View file

@ -1,5 +1,7 @@
.. _observability:
Observability
===============
=============
.. toctree::
:maxdepth: 2