mirror of
https://github.com/vale981/ray
synced 2025-03-05 10:01:43 -05:00
[Cluster-launcher doc] revamp the vm part (#27431)
This commit is contained in:
parent
853c859037
commit
a1d80dc195
20 changed files with 678 additions and 1571 deletions
|
@ -289,13 +289,10 @@ parts:
|
|||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/index
|
||||
title: User Guides
|
||||
sections:
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/installing-ray
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/launching-clusters/index
|
||||
title: Launching Clusters
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/running-ray-cluster-on-prem
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/monitoring-and-observing-ray-cluster
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/large-cluster-best-practices
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/multi-tenancy-best-practices
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/configuring-autoscaling
|
||||
- file: cluster/cluster_under_construction/ray-clusters-on-vms/user-guides/community-supported-cluster-manager/index
|
||||
title: Community-supported Cluster Managers
|
||||
|
|
|
@ -23,10 +23,11 @@ How can I use Ray clusters?
|
|||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Ray clusters are officially supported on the following technology stacks:
|
||||
|
||||
* The :ref:`Ray Cluster Launcher on AWS and GCP<ref-cluster-quick-start-vms-under-construction>`. Community-supported Azure and Aliyun integrations also exist.
|
||||
* The :ref:`Ray cluster launcher on AWS and GCP<ref-cluster-quick-start-vms-under-construction>`. Community-supported Azure and Aliyun integrations also exist.
|
||||
* :ref:`KubeRay, the official way to run Ray on Kubernetes<kuberay-index>`.
|
||||
|
||||
Advanced users may want to :ref:`deploy Ray clusters on-premise<cluster-private-setup-under-construction>` or even onto infrastructure platforms not listed here by :ref:`providing a custom node provider<additional-cloud-providers-under-construction>`.
|
||||
Advanced users may want to :ref:`deploy Ray clusters on-premise <on-prem>`
|
||||
or onto infrastructure platforms not listed here by :ref:`providing a custom node provider <ref-cluster-setup-under-construction>`.
|
||||
|
||||
Where to go from here?
|
||||
----------------------
|
||||
|
@ -48,7 +49,7 @@ Where to go from here?
|
|||
|
||||
---
|
||||
|
||||
**I want to run Ray on a cloud provider**
|
||||
**I want to run Ray on a cloud provider**
|
||||
^^^
|
||||
Take a sample application designed to run on a laptop and scale it up in the
|
||||
cloud. Access to an AWS or GCP account is required.
|
||||
|
|
|
@ -12,7 +12,7 @@ Ray Clusters Quick Start
|
|||
|
||||
This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
|
||||
|
||||
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <cluster-private-setup>` guide.
|
||||
For launching a Ray cluster manually, you can refer to the :ref:`on-premise cluster setup <on-prem>` guide.
|
||||
|
||||
About the demo
|
||||
--------------
|
||||
|
@ -207,7 +207,7 @@ A minimal sample cluster configuration file looks as follows:
|
|||
|
||||
Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
|
||||
|
||||
After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
|
||||
After defining our configuration, we will use the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
|
|
|
@ -8,7 +8,7 @@
|
|||
Cluster Launcher Commands
|
||||
=========================
|
||||
|
||||
This document overviews common commands for using the Ray Cluster Launcher.
|
||||
This document overviews common commands for using the Ray cluster launcher.
|
||||
See the :ref:`Cluster Configuration <cluster-config>` docs on how to customize the configuration file.
|
||||
|
||||
Launching a cluster (``ray up``)
|
||||
|
|
|
@ -7,7 +7,9 @@ Community Supported Cluster Managers
|
|||
|
||||
.. note::
|
||||
|
||||
If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher <cluster-cloud>` to simplify the cluster setup process.
|
||||
If you're using AWS, Azure or GCP you can use the :ref:`Ray cluster launcher <cluster-cloud>` to simplify the cluster setup process.
|
||||
|
||||
The following is a list of community supported cluster managers.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
@ -16,3 +18,19 @@ Community Supported Cluster Managers
|
|||
slurm.rst
|
||||
lsf.rst
|
||||
|
||||
.. _ref-additional-cloud-providers-under-construction:
|
||||
|
||||
Using a custom cloud or cluster manager
|
||||
=======================================
|
||||
|
||||
The Ray cluster launcher currently supports AWS, Azure, GCP, Aliyun and Kuberay out of the box. To use the Ray cluster launcher and Autoscaler on other cloud providers or cluster managers, you can implement the `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`_ interface (100 LOC).
|
||||
Once the node provider is implemented, you can register it in the `provider section <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/local/example-full.yaml#L18>`_ of the cluster launcher config.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: "external"
|
||||
module: "my.module.MyCustomNodeProvider"
|
||||
|
||||
You can refer to `AWSNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/node_provider.py#L95>`_, `KuberayNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/kuberay/node_provider.py#L148>`_ and
|
||||
`LocalNodeProvider <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/local/node_provider.py#L166>`_ for more examples.
|
||||
|
|
|
@ -1,6 +0,0 @@
|
|||
:::{warning}
|
||||
This page is under construction!
|
||||
:::
|
||||
# Installing Ray
|
||||
## Install Ray via `pip`
|
||||
## Use the Ray docker images
|
|
@ -1,11 +0,0 @@
|
|||
.. warning::
|
||||
This page is under construction!
|
||||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
.. _additional-cloud-providers-under-construction:
|
||||
|
||||
Additional Cloud Providers
|
||||
--------------------------
|
||||
|
||||
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
|
|
@ -0,0 +1,240 @@
|
|||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
Monitor Ray using Amazon CloudWatch
|
||||
===================================
|
||||
|
||||
Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
|
||||
CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
|
||||
|
||||
AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
|
||||
Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
|
||||
|
||||
The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
|
||||
|
||||
.. list-table:: All available unified CloudWatch agent images
|
||||
|
||||
* - Base AMI
|
||||
- AMI ID
|
||||
- Region
|
||||
- Unified CloudWatch Agent Version
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-069f2811478f86c20
|
||||
- us-east-1
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-058cc0932940c2b8b
|
||||
- us-east-2
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-044f95c9ef12883ef
|
||||
- us-west-1
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-0d88d9cbe28fac870
|
||||
- us-west-2
|
||||
- v1.247348.0b251302
|
||||
|
||||
.. note::
|
||||
|
||||
Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
|
||||
|
||||
Getting started
|
||||
---------------
|
||||
|
||||
1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
# Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
|
||||
cloudwatch:
|
||||
agent:
|
||||
# Path to Unified CloudWatch Agent config file
|
||||
config: "cloudwatch/example-cloudwatch-agent-config.json"
|
||||
dashboard:
|
||||
# CloudWatch Dashboard name
|
||||
name: "example-dashboard-name"
|
||||
# Path to the CloudWatch Dashboard config file
|
||||
config: "cloudwatch/example-cloudwatch-dashboard-config.json"
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
available_node_types:
|
||||
ray.head.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
|
||||
resources: {}
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
|
||||
IamInstanceProfile:
|
||||
Name: ray-autoscaler-cloudwatch-v1
|
||||
resources: {}
|
||||
min_workers: 0
|
||||
|
||||
2. Download CloudWatch Agent and Dashboard config.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
|
||||
Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mkdir cloudwatch
|
||||
$ cd cloudwatch
|
||||
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
|
||||
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
|
||||
|
||||
3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
|
||||
See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
|
||||
|
||||
4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
aws logs tail $log_group_name --follow
|
||||
|
||||
Advanced Setup
|
||||
--------------
|
||||
|
||||
Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
|
||||
|
||||
1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
|
||||
then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
|
||||
|
||||
The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
available_node_types:
|
||||
ray.head.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870
|
||||
|
||||
To build your own AMI with the Unified CloudWatch Agent installed:
|
||||
|
||||
1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
|
||||
2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
|
||||
|
||||
2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
|
||||
|
||||
These example config files include the following features:
|
||||
|
||||
**Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
|
||||
and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
|
||||
Log streams are named after the EC2 instance ID that emitted their logs.
|
||||
Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
|
||||
|
||||
**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
|
||||
Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
|
||||
|
||||
**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
|
||||
and must be manually provided in your alarm config file.
|
||||
|
||||
For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
|
||||
|
||||
CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
|
||||
|
||||
``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
|
||||
|
||||
``{region}``: Replaced with your Ray cluster's region.
|
||||
|
||||
``{cluster_name}``: Replaced with your Ray cluster name.
|
||||
|
||||
See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
|
||||
|
||||
.. note::
|
||||
Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
"AlarmActions":[
|
||||
"TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
|
||||
]
|
||||
|
||||
3. Reference your CloudWatch JSON config files in your cluster config YAML.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
cloudwatch:
|
||||
agent:
|
||||
config: "cloudwatch/example-cloudwatch-agent-config.json"
|
||||
|
||||
|
||||
4. Set your IAM Role and EC2 Instance Profile.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
|
||||
This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
|
||||
|
||||
Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
IamInstanceProfile:
|
||||
Name: ray-autoscaler-cloudwatch-v1
|
||||
|
||||
5. Export Ray system metrics to CloudWatch.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
|
||||
Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
|
||||
You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
head_setup_commands:
|
||||
# Make `ray_prometheus_waiter.sh` executable.
|
||||
- >-
|
||||
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
|
||||
&& sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
|
||||
# Copy `prometheus.yml` to Unified CloudWatch Agent folder
|
||||
- >-
|
||||
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
|
||||
&& sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
|
||||
# First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
|
||||
- >-
|
||||
nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
|
||||
`cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
|
||||
>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
|
||||
|
||||
6. Update CloudWatch Agent, Dashboard and Alarm config files.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
|
||||
The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.
|
|
@ -0,0 +1,126 @@
|
|||
|
||||
# Launching Ray Clusters on AWS
|
||||
|
||||
This guide details the steps needed to start a Ray cluster on AWS.
|
||||
|
||||
To start an AWS Ray cluster, you should use the Ray cluster launcher with the AWS Python SDK.
|
||||
|
||||
## Install Ray cluster launcher
|
||||
|
||||
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
|
||||
|
||||
```bash
|
||||
# install ray
|
||||
pip install -U ray[default]
|
||||
```
|
||||
|
||||
## Install and Configure AWS Python SDK (Boto3)
|
||||
|
||||
Next, install AWS SDK using `pip install -U boto3` and configure your AWS credentials following [the AWS guide](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).
|
||||
|
||||
```bash
|
||||
# install AWS Python SDK (boto3)
|
||||
pip install -U boto3
|
||||
|
||||
# setup AWS credentials using environment variables
|
||||
export AWS_ACCESS_KEY_ID=foo
|
||||
export AWS_SECRET_ACCESS_KEY=bar
|
||||
export AWS_SESSION_TOKEN=baz
|
||||
|
||||
# alternatively, you can setup AWS credentials using ~/.aws/credentials file
|
||||
echo "[default]
|
||||
aws_access_key_id=foo
|
||||
aws_secret_access_key=bar
|
||||
aws_session_token=baz" >> ~/.aws/credentials
|
||||
```
|
||||
|
||||
## Start Ray with the Ray cluster launcher
|
||||
|
||||
Once Boto3 is configured to manage resources in your AWS account, you should be ready to launch your cluster using the cluster launcher. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml) will create a small cluster with an m5.large head node (on-demand) configured to autoscale to up to two m5.large [spot-instance](https://aws.amazon.com/ec2/spot/) workers.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
```bash
|
||||
# Download the example-full.yaml
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
ray up example-full.yaml
|
||||
|
||||
# Get a remote shell on the head node.
|
||||
ray attach example-full.yaml
|
||||
|
||||
# Try running a Ray program.
|
||||
python -c 'import ray; ray.init()'
|
||||
exit
|
||||
|
||||
# Tear down the cluster.
|
||||
ray down example-full.yaml
|
||||
```
|
||||
|
||||
Congrats, you have started a Ray cluster on AWS!
|
||||
|
||||
|
||||
If you want to learn more about the Ray cluster launcher, see this blog post for a [step by step guide](https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1).
|
||||
|
||||
|
||||
## AWS Configurations
|
||||
|
||||
### Using Amazon EFS
|
||||
|
||||
To utilize Amazon EFS in the Ray cluster, you will need to install some additional utilities and mount the EFS in `setup_commands`. Note that these instructions only work if you are using the Ray cluster launcher on AWS.
|
||||
|
||||
```yaml
|
||||
# Note You need to replace the {{FileSystemId}} with your own EFS ID before using the config.
|
||||
# You may also need to modify the SecurityGroupIds for the head and worker nodes in the config file.
|
||||
|
||||
setup_commands:
|
||||
- sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
|
||||
sudo pkill -9 apt-get;
|
||||
sudo pkill -9 dpkg;
|
||||
sudo dpkg --configure -a;
|
||||
sudo apt-get -y install binutils;
|
||||
cd $HOME;
|
||||
git clone https://github.com/aws/efs-utils;
|
||||
cd $HOME/efs-utils;
|
||||
./build-deb.sh;
|
||||
sudo apt-get -y install ./build/amazon-efs-utils*deb;
|
||||
cd $HOME;
|
||||
mkdir efs;
|
||||
sudo mount -t efs {{FileSystemId}}:/ efs;
|
||||
sudo chmod 777 efs;
|
||||
```
|
||||
|
||||
### Accessing S3
|
||||
|
||||
In various scenarios, worker nodes may need write access to an S3 bucket, e.g., Ray Tune has an option to write checkpoints to S3 instead of syncing them directly back to the driver.
|
||||
|
||||
If you see errors like “Unable to locate credentials”, make sure that the correct `IamInstanceProfile` is configured for worker nodes in your cluster config file. This may look like:
|
||||
|
||||
```yaml
|
||||
worker_nodes:
|
||||
InstanceType: m5.xlarge
|
||||
ImageId: latest_dlami
|
||||
IamInstanceProfile:
|
||||
Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
|
||||
```
|
||||
|
||||
You can verify if the set up is correct by SSHing into a worker node and running
|
||||
|
||||
```bash
|
||||
aws configure list
|
||||
```
|
||||
|
||||
You should see something like
|
||||
|
||||
```bash
|
||||
Name Value Type Location
|
||||
---- ----- ---- --------
|
||||
profile <not set> None None
|
||||
access_key ****************XXXX iam-role
|
||||
secret_key ****************YYYY iam-role
|
||||
region <not set> None None
|
||||
```
|
||||
|
||||
Please refer to this [discussion](https://github.com/ray-project/ray/issues/9327) for more details on ???.
|
|
@ -1,573 +0,0 @@
|
|||
.. warning::
|
||||
This page is under construction!
|
||||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
.. _cluster-cloud-under-construction-aws:
|
||||
|
||||
Launching Ray Clusters on AWS
|
||||
=============================
|
||||
|
||||
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
|
||||
|
||||
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
|
||||
|
||||
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
|
||||
|
||||
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
|
||||
|
||||
.. _ref-cloud-setup-under-construction-aws:
|
||||
|
||||
Ray with cloud providers
|
||||
------------------------
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
||||
/cluster/aws-tips.rst
|
||||
|
||||
.. tabbed:: AWS
|
||||
|
||||
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
|
||||
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
|
||||
|
||||
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
$ # Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
|
||||
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
|
||||
|
||||
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
|
||||
.. tabbed:: Azure
|
||||
|
||||
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
|
||||
|
||||
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
|
||||
|
||||
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
# test ray setup
|
||||
$ python -c 'import ray; ray.init()'
|
||||
$ exit
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
**Azure Portal**:
|
||||
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
|
||||
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
|
||||
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
|
||||
The head node conveniently exposes both SSH as well as JupyterLab.
|
||||
|
||||
.. image:: https://aka.ms/deploytoazurebutton
|
||||
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
|
||||
:alt: Deploy to Azure
|
||||
|
||||
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
|
||||
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import ray
|
||||
ray.init()
|
||||
|
||||
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
|
||||
|
||||
1. Activates one of the conda environments available on DSVM
|
||||
2. Installs Ray and any other user-specified dependencies
|
||||
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
|
||||
|
||||
|
||||
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
|
||||
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
.. tabbed:: GCP
|
||||
|
||||
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
|
||||
|
||||
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
|
||||
|
||||
.. tabbed:: Aliyun
|
||||
|
||||
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
|
||||
|
||||
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
|
||||
|
||||
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
|
||||
|
||||
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
|
||||
.. tabbed:: Custom
|
||||
|
||||
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
|
||||
You can specify the external node provider using the yaml config:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: external
|
||||
module: mypackage.myclass
|
||||
|
||||
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
|
||||
|
||||
Additional Cloud Providers
|
||||
--------------------------
|
||||
|
||||
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
|
||||
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
|
||||
|
||||
.. _using-ray-on-a-cluster-under-construction-aws:
|
||||
|
||||
Running a Ray program on the Ray cluster
|
||||
----------------------------------------
|
||||
|
||||
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
|
||||
|
||||
.. tabbed:: Python
|
||||
|
||||
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
|
||||
For example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ray.init()
|
||||
# Connecting to existing Ray cluster at address: <IP address>...
|
||||
|
||||
.. tabbed:: Java
|
||||
|
||||
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
java -classpath <classpath> \
|
||||
-Dray.address=<address> \
|
||||
<classname> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
.. tabbed:: C++
|
||||
|
||||
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
RAY_ADDRESS=<address> ./<binary> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
|
||||
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
|
||||
|
||||
To verify that the correct number of nodes have joined the cluster, you can run the following.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import time
|
||||
|
||||
@ray.remote
|
||||
def f():
|
||||
time.sleep(0.01)
|
||||
return ray._private.services.get_node_ip_address()
|
||||
|
||||
# Get a list of the IP addresses of the nodes that have joined the cluster.
|
||||
set(ray.get([f.remote() for _ in range(1000)]))
|
||||
|
||||
|
||||
.. _aws-cluster-under-construction:
|
||||
|
||||
AWS Configurations
|
||||
==================
|
||||
|
||||
.. _aws-cluster-efs-under-construction:
|
||||
|
||||
Using Amazon EFS
|
||||
----------------
|
||||
|
||||
To use Amazon EFS, install some utilities and mount the EFS in ``setup_commands``. Note that these instructions only work if you are using the AWS Autoscaler.
|
||||
|
||||
.. note::
|
||||
|
||||
You need to replace the ``{{FileSystemId}}`` to your own EFS ID before using the config. You may also need to set correct ``SecurityGroupIds`` for the instances in the config file.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
setup_commands:
|
||||
- sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1`;
|
||||
sudo pkill -9 apt-get;
|
||||
sudo pkill -9 dpkg;
|
||||
sudo dpkg --configure -a;
|
||||
sudo apt-get -y install binutils;
|
||||
cd $HOME;
|
||||
git clone https://github.com/aws/efs-utils;
|
||||
cd $HOME/efs-utils;
|
||||
./build-deb.sh;
|
||||
sudo apt-get -y install ./build/amazon-efs-utils*deb;
|
||||
cd $HOME;
|
||||
mkdir efs;
|
||||
sudo mount -t efs {{FileSystemId}}:/ efs;
|
||||
sudo chmod 777 efs;
|
||||
|
||||
.. _aws-cluster-s3-under-construction:
|
||||
|
||||
Configure worker nodes to access Amazon S3
|
||||
------------------------------------------
|
||||
|
||||
In various scenarios, worker nodes may need write access to the S3 bucket.
|
||||
E.g. Ray Tune has the option that worker nodes write distributed checkpoints to S3 instead of syncing back to the driver using rsync.
|
||||
|
||||
If you see errors like "Unable to locate credentials", make sure that the correct ``IamInstanceProfile`` is configured for worker nodes in ``cluster.yaml`` file.
|
||||
This may look like:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
worker_nodes:
|
||||
InstanceType: m5.xlarge
|
||||
ImageId: latest_dlami
|
||||
IamInstanceProfile:
|
||||
Arn: arn:aws:iam::YOUR_AWS_ACCOUNT:YOUR_INSTANCE_PROFILE
|
||||
|
||||
You can verify if the set up is correct by entering one worker node and do
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
aws configure list
|
||||
|
||||
You should see something like
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
Name Value Type Location
|
||||
---- ----- ---- --------
|
||||
profile <not set> None None
|
||||
access_key ****************XXXX iam-role
|
||||
secret_key ****************YYYY iam-role
|
||||
region <not set> None None
|
||||
|
||||
Please refer to `this discussion <https://github.com/ray-project/ray/issues/9327>`__ for more details.
|
||||
|
||||
|
||||
.. _aws-cluster-cloudwatch-under-construction:
|
||||
|
||||
Using Amazon CloudWatch
|
||||
=======================
|
||||
|
||||
Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
|
||||
CloudWatch integration with Ray requires an AMI (or Docker image) with the Unified CloudWatch Agent pre-installed.
|
||||
|
||||
AMIs with the Unified CloudWatch Agent pre-installed are provided by the Amazon Ray Team, and are currently available in the us-east-1, us-east-2, us-west-1, and us-west-2 regions.
|
||||
Please direct any questions, comments, or issues to the `Amazon Ray Team <https://github.com/amzn/amazon-ray/issues/new/choose>`_.
|
||||
|
||||
The table below lists AMIs with the Unified CloudWatch Agent pre-installed in each region, and you can also find AMIs at `amazon-ray README <https://github.com/amzn/amazon-ray>`_.
|
||||
|
||||
.. list-table:: All available unified CloudWatch agent images
|
||||
|
||||
* - Base AMI
|
||||
- AMI ID
|
||||
- Region
|
||||
- Unified CloudWatch Agent Version
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-069f2811478f86c20
|
||||
- us-east-1
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-058cc0932940c2b8b
|
||||
- us-east-2
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-044f95c9ef12883ef
|
||||
- us-west-1
|
||||
- v1.247348.0b251302
|
||||
* - AWS Deep Learning AMI (Ubuntu 18.04, 64-bit)
|
||||
- ami-0d88d9cbe28fac870
|
||||
- us-west-2
|
||||
- v1.247348.0b251302
|
||||
|
||||
.. note::
|
||||
|
||||
Using Amazon CloudWatch will incur charges, please refer to `CloudWatch pricing <https://aws.amazon.com/cloudwatch/pricing/>`_ for details.
|
||||
|
||||
Getting started
|
||||
---------------
|
||||
|
||||
1. Create a minimal cluster config YAML named ``cloudwatch-basic.yaml`` with the following contents:
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
# Start by defining a `cloudwatch` section to enable CloudWatch integration with your Ray cluster.
|
||||
cloudwatch:
|
||||
agent:
|
||||
# Path to Unified CloudWatch Agent config file
|
||||
config: "cloudwatch/example-cloudwatch-agent-config.json"
|
||||
dashboard:
|
||||
# CloudWatch Dashboard name
|
||||
name: "example-dashboard-name"
|
||||
# Path to the CloudWatch Dashboard config file
|
||||
config: "cloudwatch/example-cloudwatch-dashboard-config.json"
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
available_node_types:
|
||||
ray.head.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
|
||||
resources: {}
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870 # Unified CloudWatch agent pre-installed AMI, us-west-2
|
||||
IamInstanceProfile:
|
||||
Name: ray-autoscaler-cloudwatch-v1
|
||||
resources: {}
|
||||
min_workers: 0
|
||||
|
||||
2. Download CloudWatch Agent and Dashboard config.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
First, create a ``cloudwatch`` directory in the same directory as ``cloudwatch-basic.yaml``.
|
||||
Then, download the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_ and `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ config files to the ``cloudwatch`` directory.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ mkdir cloudwatch
|
||||
$ cd cloudwatch
|
||||
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json
|
||||
$ wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json
|
||||
|
||||
3. Run ``ray up cloudwatch-basic.yaml`` to start your Ray Cluster.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This will launch your Ray cluster in ``us-west-2`` by default. When launching a cluster for a different region, you'll need to change your cluster config YAML file's ``region`` AND ``ImageId``.
|
||||
See the "Unified CloudWatch Agent Images" table above for available AMIs by region.
|
||||
|
||||
4. Check out your Ray cluster's logs, metrics, and dashboard in the `CloudWatch Console <https://console.aws.amazon.com/cloudwatch/>`_!
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A tail can be acquired on all logs written to a CloudWatch log group by ensuring that you have the `AWS CLI V2+ installed <https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html>`_ and then running:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
aws logs tail $log_group_name --follow
|
||||
|
||||
Advanced Setup
|
||||
--------------
|
||||
|
||||
Refer to `example-cloudwatch.yaml <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ for a complete example.
|
||||
|
||||
1. Choose an AMI with the Unified CloudWatch Agent pre-installed.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Ensure that you're launching your Ray EC2 cluster in the same region as the AMI,
|
||||
then specify the ``ImageId`` to use with your cluster's head and worker nodes in your cluster config YAML file.
|
||||
|
||||
The following CLI command returns the latest available Unified CloudWatch Agent Image for ``us-west-2``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
aws ec2 describe-images --region us-west-2 --filters "Name=owner-id,Values=160082703681" "Name=name,Values=*cloudwatch*" --query 'Images[*].[ImageId,CreationDate]' --output text | sort -k2 -r | head -n1
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
available_node_types:
|
||||
ray.head.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
ImageId: ami-0d88d9cbe28fac870
|
||||
|
||||
To build your own AMI with the Unified CloudWatch Agent installed:
|
||||
|
||||
1. Follow the `CloudWatch Agent Installation <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance.html>`_ user guide to install the Unified CloudWatch Agent on an EC2 instance.
|
||||
2. Follow the `EC2 AMI Creation <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html#creating-an-ami>`_ user guide to create an AMI from this EC2 instance.
|
||||
|
||||
2. Define your own CloudWatch Agent, Dashboard, and Alarm JSON config files.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can start by using the example `CloudWatch Agent <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-agent-config.json>`_, `CloudWatch Dashboard <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-dashboard-config.json>`_ and `CloudWatch Alarm <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/cloudwatch/example-cloudwatch-alarm-config.json>`_ config files.
|
||||
|
||||
These example config files include the following features:
|
||||
|
||||
**Logs and Metrics**: Logs written to ``/tmp/ray/session_*/logs/**.out`` will be available in the ``{cluster_name}-ray_logs_out`` log group,
|
||||
and logs written to ``/tmp/ray/session_*/logs/**.err`` will be available in the ``{cluster_name}-ray_logs_err`` log group.
|
||||
Log streams are named after the EC2 instance ID that emitted their logs.
|
||||
Extended EC2 metrics including CPU/Disk/Memory usage and process statistics can be found in the ``{cluster_name}-ray-CWAgent`` metric namespace.
|
||||
|
||||
**Dashboard**: You will have a cluster-level dashboard showing total cluster CPUs and available object store memory.
|
||||
Process counts, disk usage, memory usage, and CPU utilization will be displayed as both cluster-level sums and single-node maximums/averages.
|
||||
|
||||
**Alarms**: Node-level alarms tracking prolonged high memory, disk, and CPU usage are configured. Alarm actions are NOT set,
|
||||
and must be manually provided in your alarm config file.
|
||||
|
||||
For more advanced options, see the `Agent <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_, `Dashboard <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/CloudWatch-Dashboard-Body-Structure.html>`_ and `Alarm <https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html>`_ config user guides.
|
||||
|
||||
CloudWatch Agent, Dashboard, and Alarm JSON config files support the following variables:
|
||||
|
||||
``{instance_id}``: Replaced with each EC2 instance ID in your Ray cluster.
|
||||
|
||||
``{region}``: Replaced with your Ray cluster's region.
|
||||
|
||||
``{cluster_name}``: Replaced with your Ray cluster name.
|
||||
|
||||
See CloudWatch Agent `Configuration File Details <https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html>`_ for additional variables supported natively by the Unified CloudWatch Agent.
|
||||
|
||||
.. note::
|
||||
Remember to replace the ``AlarmActions`` placeholder in your CloudWatch Alarm config file!
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
"AlarmActions":[
|
||||
"TODO: Add alarm actions! See https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html"
|
||||
]
|
||||
|
||||
3. Reference your CloudWatch JSON config files in your cluster config YAML.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Specify the file path to your CloudWatch JSON config files relative to the working directory that you will run ``ray up`` from:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
cloudwatch:
|
||||
agent:
|
||||
config: "cloudwatch/example-cloudwatch-agent-config.json"
|
||||
|
||||
|
||||
4. Set your IAM Role and EC2 Instance Profile.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
By default the ``ray-autoscaler-cloudwatch-v1`` IAM role and EC2 instance profile is created at Ray cluster launch time.
|
||||
This role contains all additional permissions required to integrate CloudWatch with Ray, namely the ``CloudWatchAgentAdminPolicy``, ``AmazonSSMManagedInstanceCore``, ``ssm:SendCommand``, ``ssm:ListCommandInvocations``, and ``iam:PassRole`` managed policies.
|
||||
|
||||
Ensure that all worker nodes are configured to use the ``ray-autoscaler-cloudwatch-v1`` EC2 instance profile in your cluster config YAML:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
ray.worker.default:
|
||||
node_config:
|
||||
InstanceType: c5a.large
|
||||
IamInstanceProfile:
|
||||
Name: ray-autoscaler-cloudwatch-v1
|
||||
|
||||
5. Export Ray system metrics to CloudWatch.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To export Ray's Prometheus system metrics to CloudWatch, first ensure that your cluster has the
|
||||
Ray Dashboard installed, then uncomment the ``head_setup_commands`` section in `example-cloudwatch.yaml file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-cloudwatch.yaml>`_ file.
|
||||
You can find Ray Prometheus metrics in the ``{cluster_name}-ray-prometheus`` metric namespace.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
head_setup_commands:
|
||||
# Make `ray_prometheus_waiter.sh` executable.
|
||||
- >-
|
||||
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
|
||||
&& sudo chmod +x $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
|
||||
# Copy `prometheus.yml` to Unified CloudWatch Agent folder
|
||||
- >-
|
||||
RAY_INSTALL_DIR=`pip show ray | grep -Po "(?<=Location:).*"`
|
||||
&& sudo cp -f $RAY_INSTALL_DIR/ray/autoscaler/aws/cloudwatch/prometheus.yml /opt/aws/amazon-cloudwatch-agent/etc
|
||||
# First get current cluster name, then let the Unified CloudWatch Agent restart and use `AmazonCloudWatch-ray_agent_config_{cluster_name}` parameter at SSM Parameter Store.
|
||||
- >-
|
||||
nohup sudo sh -c "`pip show ray | grep -Po "(?<=Location:).*"`/ray/autoscaler/aws/cloudwatch/ray_prometheus_waiter.sh
|
||||
`cat ~/ray_bootstrap_config.yaml | jq '.cluster_name'`
|
||||
>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.out' 2>> '/opt/aws/amazon-cloudwatch-agent/logs/ray_prometheus_waiter.err'" &
|
||||
|
||||
6. Update CloudWatch Agent, Dashboard and Alarm config files.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can apply changes to the CloudWatch Logs, Metrics, Dashboard, and Alarms for your cluster by simply modifying the CloudWatch config files referenced by your Ray cluster config YAML and re-running ``ray up example-cloudwatch.yaml``.
|
||||
The Unified CloudWatch Agent will be automatically restarted on all cluster nodes, and your config changes will be applied.
|
||||
|
||||
|
||||
|
||||
What's Next?
|
||||
============
|
||||
|
||||
Now that you have a working understanding of the cluster launcher, check out:
|
||||
|
||||
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
|
||||
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
|
||||
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
|
||||
|
||||
|
||||
|
||||
Questions or Issues?
|
||||
====================
|
||||
|
||||
.. include:: /_includes/_help.rst
|
|
@ -0,0 +1,89 @@
|
|||
|
||||
# Launching Ray Clusters on Azure
|
||||
|
||||
This guide details the steps needed to start a Ray cluster on Azure.
|
||||
|
||||
There are two ways to start an Azure Ray cluster.
|
||||
- Launch through Ray cluster launcher.
|
||||
- Deploy a cluster using Azure portal.
|
||||
|
||||
```{note}
|
||||
The Azure integration is community-maintained. Please reach out to the integration maintainers on Github if
|
||||
you run into any problems: gramhagen, eisber, ijrsvt.
|
||||
```
|
||||
|
||||
## Using Ray cluster launcher
|
||||
|
||||
|
||||
### Install Ray cluster launcher
|
||||
|
||||
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
|
||||
|
||||
```bash
|
||||
# install ray
|
||||
pip install -U ray[default]
|
||||
```
|
||||
|
||||
### Install and Configure Azure CLI
|
||||
|
||||
Next, install the Azure CLI (`pip install -U azure-cli azure-identity`) and login using `az login`.
|
||||
|
||||
```bash
|
||||
# Install azure cli.
|
||||
pip install azure-cli azure-identity
|
||||
|
||||
# Login to azure. This will redirect you to your web browser.
|
||||
az login
|
||||
```
|
||||
|
||||
### Start Ray with the Ray cluster launcher
|
||||
|
||||
|
||||
The provided [cluster config file](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml) will create a small cluster with a Standard DS2v3 on-demand head node that is configured to autoscale to up to two Standard DS2v3 [spot-instance](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms) worker nodes.
|
||||
|
||||
Note that you'll need to fill in your Azure [resource_group](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L42) and [location](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L41) in those templates. You also need set the subscription to use. You can do this from the command line with `az account set -s <subscription_id>` or by filling in the [subscription_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/azure/example-full.yaml#L44) in the cluster config file.
|
||||
|
||||
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
```bash
|
||||
# Download the example-full.yaml
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
# Update the example-full.yaml to update resource_group, location, and subscription_id.
|
||||
# vi example-full.yaml
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
ray up example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
ray attach example-full.yaml
|
||||
# Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
ray down example-full.yaml
|
||||
```
|
||||
|
||||
Congratulations, you have started a Ray cluster on Azure!
|
||||
|
||||
## Using Azure portal
|
||||
|
||||
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through the Ray autoscaler. This will deploy [Azure Data Science VMs (DSVM)](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/) for both the head node and the auto-scalable cluster managed by [Azure Virtual Machine Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/).
|
||||
The head node conveniently exposes both SSH as well as JupyterLab.
|
||||
|
||||
|
||||
|
||||
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
|
||||
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
|
||||
|
||||
```python
|
||||
import ray; ray.init()
|
||||
```
|
||||
|
||||
Under the hood, the [azure-init.sh](https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh) script is executed and performs the following actions:
|
||||
|
||||
1. Activates one of the conda environments available on DSVM
|
||||
2. Installs Ray and any other user-specified dependencies
|
||||
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
|
|
@ -1,257 +0,0 @@
|
|||
.. warning::
|
||||
This page is under construction!
|
||||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
.. _cluster-cloud-under-construction-azure:
|
||||
|
||||
Launching Ray Clusters on Azure
|
||||
===============================
|
||||
|
||||
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
|
||||
|
||||
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
|
||||
|
||||
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
|
||||
|
||||
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
|
||||
|
||||
.. _ref-cloud-setup-under-construction-azure:
|
||||
|
||||
Ray with cloud providers
|
||||
------------------------
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
||||
/cluster/aws-tips.rst
|
||||
|
||||
.. tabbed:: AWS
|
||||
|
||||
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
|
||||
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
|
||||
|
||||
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
$ # Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
|
||||
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
|
||||
|
||||
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
|
||||
.. tabbed:: Azure
|
||||
|
||||
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
|
||||
|
||||
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
|
||||
|
||||
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
# test ray setup
|
||||
$ python -c 'import ray; ray.init()'
|
||||
$ exit
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
**Azure Portal**:
|
||||
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
|
||||
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
|
||||
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
|
||||
The head node conveniently exposes both SSH as well as JupyterLab.
|
||||
|
||||
.. image:: https://aka.ms/deploytoazurebutton
|
||||
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
|
||||
:alt: Deploy to Azure
|
||||
|
||||
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
|
||||
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import ray
|
||||
ray.init()
|
||||
|
||||
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
|
||||
|
||||
1. Activates one of the conda environments available on DSVM
|
||||
2. Installs Ray and any other user-specified dependencies
|
||||
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
|
||||
|
||||
|
||||
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
|
||||
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
.. tabbed:: GCP
|
||||
|
||||
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
|
||||
|
||||
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
|
||||
|
||||
.. tabbed:: Aliyun
|
||||
|
||||
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
|
||||
|
||||
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
|
||||
|
||||
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
|
||||
|
||||
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
|
||||
.. tabbed:: Custom
|
||||
|
||||
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
|
||||
You can specify the external node provider using the yaml config:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: external
|
||||
module: mypackage.myclass
|
||||
|
||||
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
|
||||
|
||||
Additional Cloud Providers
|
||||
--------------------------
|
||||
|
||||
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
|
||||
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
|
||||
|
||||
.. _using-ray-on-a-cluster-under-construction-azure:
|
||||
|
||||
Running a Ray program on the Ray cluster
|
||||
----------------------------------------
|
||||
|
||||
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
|
||||
|
||||
.. tabbed:: Python
|
||||
|
||||
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
|
||||
For example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ray.init()
|
||||
# Connecting to existing Ray cluster at address: <IP address>...
|
||||
|
||||
.. tabbed:: Java
|
||||
|
||||
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
java -classpath <classpath> \
|
||||
-Dray.address=<address> \
|
||||
<classname> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
.. tabbed:: C++
|
||||
|
||||
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
RAY_ADDRESS=<address> ./<binary> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
|
||||
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
|
||||
|
||||
To verify that the correct number of nodes have joined the cluster, you can run the following.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import time
|
||||
|
||||
@ray.remote
|
||||
def f():
|
||||
time.sleep(0.01)
|
||||
return ray._private.services.get_node_ip_address()
|
||||
|
||||
# Get a list of the IP addresses of the nodes that have joined the cluster.
|
||||
set(ray.get([f.remote() for _ in range(1000)]))
|
||||
|
||||
|
||||
What's Next?
|
||||
-------------
|
||||
|
||||
Now that you have a working understanding of the cluster launcher, check out:
|
||||
|
||||
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
|
||||
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
|
||||
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
|
||||
|
||||
|
||||
|
||||
Questions or Issues?
|
||||
--------------------
|
||||
|
||||
.. include:: /_includes/_help.rst
|
|
@ -0,0 +1,58 @@
|
|||
|
||||
# Launching Ray Clusters on GCP
|
||||
|
||||
This guide details the steps needed to start a Ray cluster in GCP.
|
||||
|
||||
To start a GCP Ray cluster, you will use the Ray cluster launcher with the Google API client.
|
||||
|
||||
## Install Ray cluster launcher
|
||||
|
||||
The Ray cluster launcher is part of the `ray` CLI. Use the CLI to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install the ray CLI with cluster launcher support. Follow [the Ray installation documentation](installation) for more detailed instructions.
|
||||
|
||||
```bash
|
||||
# install ray
|
||||
pip install -U ray[default]
|
||||
```
|
||||
|
||||
## Install and Configure Google API Client
|
||||
|
||||
If you have never created a Google APIs Console project, read google Cloud's [Managing Projects page](https://cloud.google.com/resource-manager/docs/creating-managing-projects?visit_id=637952351450670909-433962807&rd=1) and create a project in the [Google API Console](https://console.developers.google.com/).
|
||||
Next, install the Google API Client using `pip install -U google-api-python-client`.
|
||||
|
||||
|
||||
```bash
|
||||
# Install the Google API Client.
|
||||
pip install google-api-python-client
|
||||
```
|
||||
|
||||
## Start Ray with the Ray cluster launcher
|
||||
|
||||
Once the Google API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided [cluster config file](https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml) will create a small cluster with an on-demand n1-standard-2 head node and is configured to autoscale to up to two n1-standard-2 [preemptible workers](https://cloud.google.com/preemptible-vms/). Note that you'll need to fill in your GCP [project_id](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/gcp/example-full.yaml#L42) in those templates.
|
||||
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
```bash
|
||||
# Download the example-full.yaml
|
||||
wget https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
# Edit the example-full.yaml to update project_id.
|
||||
# vi example-full.yaml
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
ray up example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
ray attach example-full.yaml
|
||||
|
||||
# Try running a Ray program.
|
||||
python -c 'import ray; ray.init()'
|
||||
exit
|
||||
|
||||
# Tear down the cluster.
|
||||
ray down example-full.yaml
|
||||
```
|
||||
|
||||
Congrats, you have started a Ray cluster on GCP!
|
||||
|
|
@ -1,257 +0,0 @@
|
|||
.. warning::
|
||||
This page is under construction!
|
||||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
.. _cluster-cloud-under-construction-gcp:
|
||||
|
||||
Launching Ray Clusters on GCP
|
||||
=============================
|
||||
|
||||
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
|
||||
|
||||
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
|
||||
|
||||
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
|
||||
|
||||
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
|
||||
|
||||
.. _ref-cloud-setup-under-construction-gcp:
|
||||
|
||||
Ray with cloud providers
|
||||
------------------------
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
||||
/cluster/aws-tips.rst
|
||||
|
||||
.. tabbed:: AWS
|
||||
|
||||
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
|
||||
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
|
||||
|
||||
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
$ # Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
|
||||
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
|
||||
|
||||
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
|
||||
.. tabbed:: Azure
|
||||
|
||||
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
|
||||
|
||||
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
|
||||
|
||||
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
# test ray setup
|
||||
$ python -c 'import ray; ray.init()'
|
||||
$ exit
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
**Azure Portal**:
|
||||
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
|
||||
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
|
||||
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
|
||||
The head node conveniently exposes both SSH as well as JupyterLab.
|
||||
|
||||
.. image:: https://aka.ms/deploytoazurebutton
|
||||
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
|
||||
:alt: Deploy to Azure
|
||||
|
||||
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
|
||||
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import ray
|
||||
ray.init()
|
||||
|
||||
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
|
||||
|
||||
1. Activates one of the conda environments available on DSVM
|
||||
2. Installs Ray and any other user-specified dependencies
|
||||
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
|
||||
|
||||
|
||||
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
|
||||
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
.. tabbed:: GCP
|
||||
|
||||
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
|
||||
|
||||
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
|
||||
|
||||
.. tabbed:: Aliyun
|
||||
|
||||
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
|
||||
|
||||
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
|
||||
|
||||
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
|
||||
|
||||
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
|
||||
.. tabbed:: Custom
|
||||
|
||||
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
|
||||
You can specify the external node provider using the yaml config:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: external
|
||||
module: mypackage.myclass
|
||||
|
||||
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
|
||||
|
||||
Additional Cloud Providers
|
||||
--------------------------
|
||||
|
||||
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
|
||||
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
|
||||
|
||||
.. _using-ray-on-a-cluster-under-construction-gcp:
|
||||
|
||||
Running a Ray program on the Ray cluster
|
||||
----------------------------------------
|
||||
|
||||
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
|
||||
|
||||
.. tabbed:: Python
|
||||
|
||||
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
|
||||
For example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ray.init()
|
||||
# Connecting to existing Ray cluster at address: <IP address>...
|
||||
|
||||
.. tabbed:: Java
|
||||
|
||||
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
java -classpath <classpath> \
|
||||
-Dray.address=<address> \
|
||||
<classname> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
.. tabbed:: C++
|
||||
|
||||
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
RAY_ADDRESS=<address> ./<binary> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
|
||||
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
|
||||
|
||||
To verify that the correct number of nodes have joined the cluster, you can run the following.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import time
|
||||
|
||||
@ray.remote
|
||||
def f():
|
||||
time.sleep(0.01)
|
||||
return ray._private.services.get_node_ip_address()
|
||||
|
||||
# Get a list of the IP addresses of the nodes that have joined the cluster.
|
||||
set(ray.get([f.remote() for _ in range(1000)]))
|
||||
|
||||
|
||||
What's Next?
|
||||
-------------
|
||||
|
||||
Now that you have a working understanding of the cluster launcher, check out:
|
||||
|
||||
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
|
||||
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
|
||||
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
|
||||
|
||||
|
||||
|
||||
Questions or Issues?
|
||||
--------------------
|
||||
|
||||
.. include:: /_includes/_help.rst
|
|
@ -3,10 +3,19 @@
|
|||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
Launching Ray Clusters
|
||||
======================
|
||||
|
||||
In this section, you can find guides for launching Ray clusters on various cluster management frameworks and clouds.
|
||||
|
||||
Table of Contents
|
||||
-----------------
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
aws.rst
|
||||
gcp.rst
|
||||
azure.rst
|
||||
add-your-own-cloud-provider.rst
|
||||
aws.md
|
||||
aws-cloud-watch.rst
|
||||
gcp.md
|
||||
azure.md
|
||||
on-premises.md
|
|
@ -0,0 +1,119 @@
|
|||
(on-prem)=
|
||||
|
||||
# Launching an On-Premise Cluster
|
||||
|
||||
This document describes how to set up an on-premise Ray cluster, i.e., to run Ray on bare metal machines, or in a private cloud. We provide two ways to start an on-premise cluster.
|
||||
|
||||
* You can [manually set up](manual-setup-cluster) the Ray cluster by installing the Ray package and starting the Ray processes on each node.
|
||||
* Alternatively, if you know all the nodes in advance and have SSH access to them, you should start the Ray cluster using the [cluster-launcher](manual-cluster-launcher).
|
||||
|
||||
(manual-setup-cluster)=
|
||||
|
||||
## Manually Set up a Ray Cluster
|
||||
This section assumes that you have a list of machines and that the nodes in the cluster share the same network. It also assumes that Ray is installed on each machine. You can use pip to install the ray command line tool with cluster launcher support. Follow the [Ray installation instructions](installation) for more details.
|
||||
|
||||
```bash
|
||||
# install ray
|
||||
pip install -U "ray[default]"
|
||||
```
|
||||
|
||||
### Start the Head Node
|
||||
Choose any node to be the head node and run the following. If the `--port` argument is omitted, Ray will first choose port 6379, and then fall back to a random port if in 6379 is in use.
|
||||
|
||||
```bash
|
||||
ray start --head --port=6379
|
||||
```
|
||||
|
||||
The command will print out the Ray cluster address, which can be passed to `ray start` on other machines to start the worker nodes (see below). If you receive a ConnectionError, check your firewall settings and network configuration.
|
||||
|
||||
### Start Worker Nodes
|
||||
Then on each of the other nodes, run the following command to connect to the head node you just created.
|
||||
|
||||
```bash
|
||||
ray start --address=<head-node-address:port>
|
||||
```
|
||||
Make sure to replace `head-node-address:port` with the value printed by the command on the head node (it should look something like 123.45.67.89:6379).
|
||||
|
||||
Note that if your compute nodes are on their own subnetwork with Network Address Translation, the address printed by the head node will not work if connecting from a machine outside that subnetwork. You will need to use a head node address reachable from the remote machine. If the head node has a domain address like compute04.berkeley.edu, you can simply use that in place of an IP address and rely on DNS.
|
||||
|
||||
Ray autodetects the resources (e.g., CPU) available on each node, but you can also manually override this by passing custom resources to the `ray start` command. For example, if you wish to specify that a machine has 10 CPUs and 1 GPU available for use by Ray, you can do this with the flags `--num-cpus=10` and `--num-gpus=1`.
|
||||
See the [Configuration page](../../ray-core/configure.html#configuring-ray) for more information.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
If you see `Unable to connect to GCS at ...`, this means the head node is inaccessible at the given `--address`.
|
||||
Some possible causes include:
|
||||
|
||||
- the head node is not actually running;
|
||||
- a different version of Ray is running at the specified address;
|
||||
- the specified address is wrong;
|
||||
- or there are firewall settings preventing access.
|
||||
|
||||
If the connection fails, to check whether each port can be reached from a node, you can use a tool such as nmap or nc.
|
||||
|
||||
```bash
|
||||
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
|
||||
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
|
||||
Host is up, received echo-reply ttl 60 (0.00087s latency).
|
||||
rDNS record for 123.456.78.910: compute04.berkeley.edu
|
||||
PORT STATE SERVICE REASON VERSION
|
||||
6379/tcp open redis? syn-ack
|
||||
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
|
||||
$ nc -vv -z $HEAD_ADDRESS $PORT
|
||||
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
|
||||
```
|
||||
|
||||
If the node cannot access that port at that IP address, you might see
|
||||
|
||||
```bash
|
||||
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
|
||||
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
|
||||
Host is up (0.0011s latency).
|
||||
rDNS record for 123.456.78.910: compute04.berkeley.edu
|
||||
PORT STATE SERVICE REASON VERSION
|
||||
6379/tcp closed redis reset ttl 60
|
||||
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
|
||||
$ nc -vv -z $HEAD_ADDRESS $PORT
|
||||
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
|
||||
```
|
||||
(manual-cluster-launcher)=
|
||||
|
||||
## Using Ray cluster launcher
|
||||
|
||||
The Ray cluster launcher is part of the `ray` command line tool. It allows you to start, stop and attach to a running ray cluster using commands such as `ray up`, `ray down` and `ray attach`. You can use pip to install it, or follow [install ray](installation) for more detailed instructions.
|
||||
|
||||
```bash
|
||||
# install ray
|
||||
pip install "ray[default]"
|
||||
```
|
||||
|
||||
### Start Ray with the Ray cluster launcher
|
||||
|
||||
The provided [example-full.yaml](https://github.com/ray-project/ray/tree/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml) cluster config file will create a Ray cluster given a list of nodes.
|
||||
|
||||
Note that you'll need to fill in your [head_ip](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L20), a list of [worker_ips](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L26), and the [ssh_user](https://github.com/ray-project/ray/blob/eacc763c84d47c9c5b86b26a32fd62c685be84e6/python/ray/autoscaler/local/example-full.yaml#L34) field in those templates
|
||||
|
||||
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
```bash
|
||||
# Download the example-full.yaml
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/autoscaler/local/example-full.yaml
|
||||
|
||||
# Update the example-full.yaml to update head_ip, worker_ips, and ssh_user.
|
||||
# vi example-full.yaml
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
ray up example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
ray attach example-full.yaml
|
||||
# Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
ray down example-full.yaml
|
||||
```
|
||||
|
||||
Congrats, you have started a local Ray cluster!
|
|
@ -3,12 +3,15 @@
|
|||
Monitoring and observability
|
||||
----------------------------
|
||||
|
||||
Ray comes with 3 main observability features:
|
||||
Ray comes with following observability features:
|
||||
|
||||
1. :ref:`The dashboard <Ray-dashboard>`
|
||||
2. :ref:`ray status <monitor-cluster>`
|
||||
3. :ref:`Prometheus metrics <multi-node-metrics>`
|
||||
|
||||
Please refer to :ref:`the observability documentation <observability>` for more on Ray's observability features.
|
||||
|
||||
|
||||
Monitoring the cluster via the dashboard
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
|
|
@ -1,4 +0,0 @@
|
|||
:::{warning}
|
||||
This page is under construction!
|
||||
:::
|
||||
# Best practices for multi-tenancy
|
|
@ -1,447 +0,0 @@
|
|||
.. warning::
|
||||
This page is under construction!
|
||||
|
||||
.. include:: /_includes/clusters/we_are_hiring.rst
|
||||
|
||||
.. _cluster-cloud-under-construction:
|
||||
|
||||
Launching Cloud Clusters
|
||||
========================
|
||||
|
||||
This section provides instructions for configuring the Ray Cluster Launcher to use with various cloud providers or on a private cluster of host machines.
|
||||
|
||||
See this blog post for a `step by step guide`_ to using the Ray Cluster Launcher.
|
||||
|
||||
To learn about deploying Ray on an existing Kubernetes cluster, refer to the guide :ref:`here<kuberay-index>`.
|
||||
|
||||
.. _`step by step guide`: https://medium.com/distributed-computing-with-ray/a-step-by-step-guide-to-scaling-your-first-python-application-in-the-cloud-8761fe331ef1
|
||||
|
||||
.. _ref-cloud-setup-under-construction:
|
||||
|
||||
Ray with cloud providers
|
||||
------------------------
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
||||
/cluster/aws-tips.rst
|
||||
|
||||
.. tabbed:: AWS
|
||||
|
||||
First, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials``,
|
||||
as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
|
||||
|
||||
Once boto is configured to manage resources on your AWS account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/aws/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example-full.yaml>`__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
$ # Try running a Ray program.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
|
||||
|
||||
|
||||
AWS Node Provider Maintainers (GitHub handles): pdames, Zyiqin-Miranda, DmitriGekhtman, wuisawesome
|
||||
|
||||
See :ref:`aws-cluster` for recipes on customizing AWS clusters.
|
||||
.. tabbed:: Azure
|
||||
|
||||
First, install the Azure CLI (``pip install azure-cli azure-identity``) then login using (``az login``).
|
||||
|
||||
Set the subscription to use from the command line (``az account set -s <subscription_id>``) or by modifying the provider section of the config provided e.g: `ray/python/ray/autoscaler/azure/example-full.yaml`
|
||||
|
||||
Once the Azure CLI is configured to manage resources on your Azure account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/azure/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/azure/example-full.yaml>`__ cluster config file will create a small cluster with a Standard DS2v3 head node (on-demand) configured to autoscale up to two Standard DS2v3 `spot workers <https://docs.microsoft.com/en-us/azure/virtual-machines/windows/spot-vms>`__. Note that you'll need to fill in your resource group and location in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
# test ray setup
|
||||
$ python -c 'import ray; ray.init()'
|
||||
$ exit
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/azure/example-full.yaml
|
||||
|
||||
**Azure Portal**:
|
||||
Alternatively, you can deploy a cluster using Azure portal directly. Please note that autoscaling is done using Azure VM Scale Sets and not through
|
||||
the Ray autoscaler. This will deploy `Azure Data Science VMs (DSVM) <https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/>`_
|
||||
for both the head node and the auto-scalable cluster managed by `Azure Virtual Machine Scale Sets <https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets/>`_.
|
||||
The head node conveniently exposes both SSH as well as JupyterLab.
|
||||
|
||||
.. image:: https://aka.ms/deploytoazurebutton
|
||||
:target: https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fray-project%2Fray%2Fmaster%2Fdoc%2Fazure%2Fazure-ray-template.json
|
||||
:alt: Deploy to Azure
|
||||
|
||||
Once the template is successfully deployed the deployment Outputs page provides the ssh command to connect and the link to the JupyterHub on the head node (username/password as specified on the template input).
|
||||
Use the following code in a Jupyter notebook (using the conda environment specified in the template input, py38_tensorflow by default) to connect to the Ray cluster.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import ray
|
||||
ray.init()
|
||||
|
||||
Note that on each node the `azure-init.sh <https://github.com/ray-project/ray/blob/master/doc/azure/azure-init.sh>`_ script is executed and performs the following actions:
|
||||
|
||||
1. Activates one of the conda environments available on DSVM
|
||||
2. Installs Ray and any other user-specified dependencies
|
||||
3. Sets up a systemd task (``/lib/systemd/system/ray.service``) to start Ray in head or worker mode
|
||||
|
||||
|
||||
Azure Node Provider Maintainers (GitHub handles): gramhagen, eisber, ijrsvt
|
||||
.. note:: The Azure Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
.. tabbed:: GCP
|
||||
|
||||
First, install the Google API client (``pip install google-api-python-client``), set up your GCP credentials, and create a new GCP project.
|
||||
|
||||
Once the API client is configured to manage resources on your GCP account, you should be ready to launch your cluster. The provided `ray/python/ray/autoscaler/gcp/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/gcp/example-full.yaml>`__ cluster config file will create a small cluster with a n1-standard-2 head node (on-demand) configured to autoscale up to two n1-standard-2 `preemptible workers <https://cloud.google.com/preemptible-vms/>`__. Note that you'll need to fill in your project id in those templates.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
|
||||
|
||||
GCP Node Provider Maintainers (GitHub handles): wuisawesome, DmitriGekhtman, ijrsvt
|
||||
|
||||
.. tabbed:: Aliyun
|
||||
|
||||
First, install the aliyun client package (``pip install aliyun-python-sdk-core aliyun-python-sdk-ecs``). Obtain the AccessKey pair of the Aliyun account as described in `the docs <https://www.alibabacloud.com/help/en/doc-detail/175967.htm>`__ and grant AliyunECSFullAccess/AliyunVPCFullAccess permissions to the RAM user. Finally, set the AccessKey pair in your cluster config file.
|
||||
|
||||
Once the above is done, you should be ready to launch your cluster. The provided `aliyun/example-full.yaml </ray/python/ray/autoscaler/aliyun/example-full.yaml>`__ cluster config file will create a small cluster with an ``ecs.n4.large`` head node (on-demand) configured to autoscale up to two ``ecs.n4.2xlarge`` nodes.
|
||||
|
||||
Make sure your account balance is not less than 100 RMB, otherwise you will receive a `InvalidAccountStatus.NotEnoughBalance` error.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to SSH into the cluster head node.
|
||||
$ ray up ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster.
|
||||
$ ray down ray/python/ray/autoscaler/aliyun/example-full.yaml
|
||||
|
||||
Aliyun Node Provider Maintainers (GitHub handles): zhuangzhuang131419, chenk008
|
||||
|
||||
.. note:: The Aliyun Node Provider is community-maintained. It is maintained by its authors, not the Ray team.
|
||||
|
||||
|
||||
.. tabbed:: Custom
|
||||
|
||||
Ray also supports external node providers (check `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__ implementation).
|
||||
You can specify the external node provider using the yaml config:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
provider:
|
||||
type: external
|
||||
module: mypackage.myclass
|
||||
|
||||
The module needs to be in the format ``package.provider_class`` or ``package.sub_package.provider_class``.
|
||||
|
||||
|
||||
.. _cluster-private-setup-under-construction:
|
||||
|
||||
Local On Premise Cluster (List of nodes)
|
||||
----------------------------------------
|
||||
You would use this mode if you want to run distributed Ray applications on some local nodes available on premise.
|
||||
|
||||
The most preferable way to run a Ray cluster on a private cluster of hosts is via the Ray Cluster Launcher.
|
||||
|
||||
There are two ways of running private clusters:
|
||||
|
||||
- Manually managed, i.e., the user explicitly specifies the head and worker ips.
|
||||
|
||||
- Automatically managed, i.e., the user only specifies a coordinator address to a coordinating server that automatically coordinates its head and worker ips.
|
||||
|
||||
.. tip:: To avoid getting the password prompt when running private clusters make sure to setup your ssh keys on the private cluster as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ssh-keygen
|
||||
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
|
||||
|
||||
.. tabbed:: Manually Managed
|
||||
|
||||
|
||||
You can get started by filling out the fields in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
|
||||
Be sure to specify the proper ``head_ip``, list of ``worker_ips``, and the ``ssh_user`` field.
|
||||
|
||||
Test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to get a remote shell into the head node.
|
||||
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster
|
||||
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
|
||||
|
||||
.. tabbed:: Automatically Managed
|
||||
|
||||
|
||||
Start by launching the coordinator server that will manage all the on prem clusters. This server also makes sure to isolate the resources between different users. The script for running the coordinator server is `ray/python/ray/autoscaler/local/coordinator_server.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/coordinator_server.py>`__. To launch the coordinator server run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ python coordinator_server.py --ips <list_of_node_ips> --port <PORT>
|
||||
|
||||
where ``list_of_node_ips`` is a comma separated list of all the available nodes on the private cluster. For example, ``160.24.42.48,160.24.42.49,...`` and ``<PORT>`` is the port that the coordinator server will listen on.
|
||||
After running the coordinator server it will print the address of the coordinator server. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
>> INFO:ray.autoscaler.local.coordinator_server:Running on prem coordinator server
|
||||
on address <Host:PORT>
|
||||
|
||||
Next, the user only specifies the ``<Host:PORT>`` printed above in the ``coordinator_address`` entry instead of specific head/worker ips in the provided `ray/python/ray/autoscaler/local/example-full.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/local/example-full.yaml>`__.
|
||||
|
||||
Now we can test that it works by running the following commands from your local machine:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Create or update the cluster. When the command finishes, it will print
|
||||
# out the command that can be used to get a remote shell into the head node.
|
||||
$ ray up ray/python/ray/autoscaler/local/example-full.yaml
|
||||
|
||||
# Get a remote screen on the head node.
|
||||
$ ray attach ray/python/ray/autoscaler/local/example-full.yaml
|
||||
$ # Try running a Ray program with 'ray.init()'.
|
||||
|
||||
# Tear down the cluster
|
||||
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
|
||||
|
||||
|
||||
.. _manual-cluster-under-construction:
|
||||
|
||||
Manual Ray Cluster Setup
|
||||
------------------------
|
||||
|
||||
The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
|
||||
|
||||
This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
|
||||
on each machine. To install Ray, follow the `installation instructions`_.
|
||||
|
||||
.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
|
||||
|
||||
Starting Ray on each machine
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
On the head node (just choose one node to be the head node), run the following.
|
||||
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
|
||||
random port.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ray start --head --port=6379
|
||||
...
|
||||
Next steps
|
||||
To connect to this Ray runtime from another node, run
|
||||
ray start --address='<ip address>:6379'
|
||||
|
||||
If connection fails, check your firewall settings and network configuration.
|
||||
|
||||
The command will print out the address of the Ray GCS server that was started
|
||||
(the local node IP address plus the port number you specified).
|
||||
|
||||
.. note::
|
||||
|
||||
If you already has remote Redis instances, you can specify environment variable
|
||||
`RAY_REDIS_ADDRESS=ip1:port1,ip2:port2...` to use them. The first one is
|
||||
primary and rest are shards.
|
||||
|
||||
**Then on each of the other nodes**, run the following. Make sure to replace
|
||||
``<address>`` with the value printed by the command on the head node (it
|
||||
should look something like ``123.45.67.89:6379``).
|
||||
|
||||
Note that if your compute nodes are on their own subnetwork with Network
|
||||
Address Translation, to connect from a regular machine outside that subnetwork,
|
||||
the command printed by the head node will not work. You need to find the
|
||||
address that will reach the head node from the second machine. If the head node
|
||||
has a domain address like compute04.berkeley.edu, you can simply use that in
|
||||
place of an IP address and rely on the DNS.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ray start --address=<address>
|
||||
--------------------
|
||||
Ray runtime started.
|
||||
--------------------
|
||||
|
||||
To terminate the Ray runtime, run
|
||||
ray stop
|
||||
|
||||
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
|
||||
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
|
||||
|
||||
If you see ``Unable to connect to GCS at ...``,
|
||||
this means the head node is inaccessible at the given ``--address`` (because, for
|
||||
example, the head node is not actually running, a different version of Ray is
|
||||
running at the specified address, the specified address is wrong, or there are
|
||||
firewall settings preventing access).
|
||||
|
||||
If you see ``Ray runtime started.``, then the node successfully connected to
|
||||
the head node at the ``--address``. You should now be able to connect to the
|
||||
cluster with ``ray.init()``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
If connection fails, check your firewall settings and network configuration.
|
||||
|
||||
If the connection fails, to check whether each port can be reached from a node,
|
||||
you can use a tool such as ``nmap`` or ``nc``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
|
||||
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
|
||||
Host is up, received echo-reply ttl 60 (0.00087s latency).
|
||||
rDNS record for 123.456.78.910: compute04.berkeley.edu
|
||||
PORT STATE SERVICE REASON VERSION
|
||||
6379/tcp open redis? syn-ack
|
||||
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
|
||||
$ nc -vv -z $HEAD_ADDRESS $PORT
|
||||
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
|
||||
|
||||
If the node cannot access that port at that IP address, you might see
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
|
||||
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
|
||||
Host is up (0.0011s latency).
|
||||
rDNS record for 123.456.78.910: compute04.berkeley.edu
|
||||
PORT STATE SERVICE REASON VERSION
|
||||
6379/tcp closed redis reset ttl 60
|
||||
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
|
||||
$ nc -vv -z $HEAD_ADDRESS $PORT
|
||||
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
|
||||
|
||||
|
||||
Stopping Ray
|
||||
~~~~~~~~~~~~
|
||||
|
||||
When you want to stop the Ray processes, run ``ray stop`` on each node.
|
||||
|
||||
|
||||
Additional Cloud Providers
|
||||
--------------------------
|
||||
|
||||
To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface (100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__. Contributions are welcome!
|
||||
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
|
||||
|
||||
.. _using-ray-on-a-cluster-under-construction:
|
||||
|
||||
Running a Ray program on the Ray cluster
|
||||
----------------------------------------
|
||||
|
||||
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
|
||||
|
||||
.. tabbed:: Python
|
||||
|
||||
Within your program/script, ``ray.init()`` will now automatically find and connect to the latest Ray cluster.
|
||||
For example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
ray.init()
|
||||
# Connecting to existing Ray cluster at address: <IP address>...
|
||||
|
||||
.. tabbed:: Java
|
||||
|
||||
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
java -classpath <classpath> \
|
||||
-Dray.address=<address> \
|
||||
<classname> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
.. tabbed:: C++
|
||||
|
||||
You need to add the ``RAY_ADDRESS`` env var to your command line (like ``RAY_ADDRESS=...``).
|
||||
|
||||
To connect your program to the Ray cluster, run it like this:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
RAY_ADDRESS=<address> ./<binary> <args>
|
||||
|
||||
.. note:: Specifying ``auto`` as the address hasn't been implemented in C++ yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
|
||||
|
||||
|
||||
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
|
||||
|
||||
To verify that the correct number of nodes have joined the cluster, you can run the following.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import time
|
||||
|
||||
@ray.remote
|
||||
def f():
|
||||
time.sleep(0.01)
|
||||
return ray._private.services.get_node_ip_address()
|
||||
|
||||
# Get a list of the IP addresses of the nodes that have joined the cluster.
|
||||
set(ray.get([f.remote() for _ in range(1000)]))
|
||||
|
||||
|
||||
What's Next?
|
||||
-------------
|
||||
|
||||
Now that you have a working understanding of the cluster launcher, check out:
|
||||
|
||||
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
|
||||
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
|
||||
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
|
||||
|
||||
|
||||
|
||||
Questions or Issues?
|
||||
--------------------
|
||||
|
||||
.. include:: /_includes/_help.rst
|
|
@ -1,5 +1,7 @@
|
|||
.. _observability:
|
||||
|
||||
Observability
|
||||
===============
|
||||
=============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
|
Loading…
Add table
Reference in a new issue