mirror of
https://github.com/vale981/ray
synced 2025-03-08 19:41:38 -05:00

Page structure changes: Deploying a Ray Cluster on Kubernetes Getting Started -> links to jobs Deploying a Ray Cluster on VMs Getting started -> links to jobs User Guides Autoscaling (moved more content here in favor of the Getting started page) Running Applications on Ray Clusters Ray Jobs Quickstart Using the Ray Jobs CLI Python SDK REST API Ray Job Submission API Reference Ray Client Content changes: modified "Deploying a Ray Cluster ..." quickstart pages to briefly summarize ad-hoc command execution, then link to jobs modified Ray Jobs example to be more incremental - start with a simple example, then show long-running script, then show example with a runtime env, instead of all of them at once center Ray Jobs quickstart around using the CLI. Made some minor changes to the Python SDK page to match it remove "Ray Jobs Architecture" moved "Autoscaling" content away from Kubernetes "Getting started" page into its own user guide. I think it's too complicated for "Getting Started". No content cuts. Cut "Viewing the dashboard" and "Ray Client" from Kubernetes "Getting started" page. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
190 lines
8.1 KiB
ReStructuredText
190 lines
8.1 KiB
ReStructuredText
.. include:: /_includes/clusters/we_are_hiring.rst
|
|
|
|
Monitoring and observability
|
|
----------------------------
|
|
|
|
Ray ships with the following observability features:
|
|
|
|
1. :ref:`The dashboard <ray-dashboard>`, for viewing cluster state.
|
|
2. CLI tools such as the :ref:`Ray state APIs <state-api-overview-ref>` and :ref:`ray status <monitor-cluster>`, for checking application and cluster status.
|
|
3. :ref:`Prometheus metrics <multi-node-metrics>` for internal and custom user-defined metrics.
|
|
|
|
For more information on these tools, check out the more comprehensive :ref:`Observability guide<observability>`.
|
|
|
|
The rest of this page will focus on how to access these services when running a Ray Cluster.
|
|
|
|
Monitoring the cluster via the dashboard
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
|
|
including the running jobs, actors, workers, nodes, etc.
|
|
By default, the :ref:`cluster launcher <ref-cluster-quick-start-vms-under-construction>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
|
|
not publicly expose the port.
|
|
|
|
.. tabbed:: If using the VM cluster launcher
|
|
|
|
You can securely port-forward local traffic to the dashboard via the ``ray
|
|
dashboard`` command.
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ray dashboard [-p <port, 8265 by default>] <cluster config file>
|
|
|
|
The dashboard will now be visible at ``http://localhost:8265``.
|
|
|
|
.. tabbed:: If using Kubernetes
|
|
|
|
The KubeRay operator makes the dashboard available via a Service targeting
|
|
the Ray head pod, named ``<RayCluster name>-head-svc``. You can access the
|
|
dashboard from within the Kubernetes cluster at ``http://<RayCluster name>-head-svc:8265``.
|
|
|
|
You can also view the dashboard from outside the Kubernetes cluster by
|
|
using port-forwarding:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265
|
|
|
|
For more information about configuring network access to a Ray cluster on
|
|
Kubernetes, see the :ref:`networking notes <kuberay-networking>`.
|
|
|
|
|
|
Using Ray Cluster CLI tools
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Ray provides several CLI tools for observing the current cluster and
|
|
application state. The ``ray status`` command provides information about the
|
|
current status of nodes in the cluster, as well as information about
|
|
autoscaling. Other Ray CLI tools allow you to read logs across the cluster and
|
|
summarize application state such as the currently running tasks and actors.
|
|
These tools are summarized :ref:`here <state-api-overview-ref>`.
|
|
|
|
These CLI commands can be run on any node in a Ray Cluster. Examples for
|
|
executing these commands from a machine outside the Ray Cluster are provided
|
|
below.
|
|
|
|
.. tabbed:: If using the VM cluster launcher
|
|
|
|
Execute a command on the cluster using ``ray exec``:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ray exec <cluster config file> "ray status"
|
|
|
|
.. tabbed:: If using Kubernetes
|
|
|
|
Execute a command on the cluster using ``kubectl exec`` and the configured
|
|
RayCluster name. We will use the Service targeting the Ray head pod to
|
|
execute a CLI command on the cluster.
|
|
|
|
.. code-block:: shell
|
|
|
|
# First, find the name of the Ray head service.
|
|
$ kubectl get pod | grep <RayCluster name>-head
|
|
# NAME READY STATUS RESTARTS AGE
|
|
# <RayCluster name>-head-xxxxx 2/2 Running 0 XXs
|
|
|
|
# Then, use the name of the Ray head service to run `ray status`.
|
|
$ kubectl exec <RayCluster name>-head-xxxxx -- ray status
|
|
|
|
.. _multi-node-metrics:
|
|
|
|
Prometheus metrics
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Ray runs a metrics agent per node to export :ref:`metrics <ray-metrics>` about Ray core as well as
|
|
custom user-defined metrics. Each metrics agent collects metrics from the local
|
|
node and exposes these in a Prometheus format. You can then scrape each
|
|
endpoint to access Ray's metrics.
|
|
|
|
To scrape the endpoints, we need to ensure service discovery, allowing
|
|
Prometheus to find the metrics agents' endpoints on each node.
|
|
|
|
Auto-discovering metrics endpoints
|
|
##################################
|
|
|
|
You can allow Prometheus to dynamically find endpoints it should scrape by using Prometheus' `file based service discovery <https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus>`_.
|
|
This is the recommended way to export Prometheus metrics when using the Ray :ref:`cluster launcher <ref-cluster-quick-start-vms-under-construction>`, as node IP addresses can often change as the cluster scales up and down.
|
|
|
|
Ray auto-generates a Prometheus `service discovery file <https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus>`_ on the head node to facilitate metrics agents' service discovery.
|
|
This allows you to scrape all metrics in the cluster without knowing their IPs. Let's walk through how to acheive this.
|
|
|
|
The service discovery file is generated on the :ref:`head node <cluster-head-node-under-construction>`. On this node, look for ``/tmp/ray/prom_metrics_service_discovery.json`` (or the eqiuvalent file if using a custom Ray ``temp_dir``).
|
|
Ray will periodically update this file with the addresses of all metrics agents in the cluster.
|
|
|
|
Now, on the same node, modify a Prometheus config to scrape the file for service discovery.
|
|
Prometheus will automatically update the addresses that it scrapes based on the contents of Ray's service discovery file.
|
|
|
|
.. code-block:: yaml
|
|
|
|
# Prometheus config file
|
|
|
|
# my global config
|
|
global:
|
|
scrape_interval: 2s
|
|
evaluation_interval: 2s
|
|
|
|
# Scrape from Ray.
|
|
scrape_configs:
|
|
- job_name: 'ray'
|
|
file_sd_configs:
|
|
- files:
|
|
- '/tmp/ray/prom_metrics_service_discovery.json'
|
|
|
|
Manually discovering metrics endpoints
|
|
######################################
|
|
|
|
If you already know the IP addresses of all nodes in your Ray Cluster, you can
|
|
configure Prometheus to read metrics from a static list of endpoints. To
|
|
do this, first set a fixed port that Ray should use to export metrics. If
|
|
using the cluster launcher, pass ``--metrics-export-port=<port>`` to ``ray
|
|
start``. If using KubeRay, you can specify
|
|
``rayStartParams.metrics-export-port`` in the RayCluster configuration file.
|
|
The port must be specified on all nodes in the cluster.
|
|
|
|
If you do not know the IP addresses of the nodes in your Ray cluster,
|
|
you can also programmatically discover the endpoints by reading the
|
|
Ray Cluster information. Here, we will use a Python script and the
|
|
``ray.nodes()`` API to find the metrics agents' URLs, by combining the
|
|
``NodeManagerAddress`` with the ``MetricsExportPort``. For example:
|
|
|
|
.. code-block:: python
|
|
|
|
# On a cluster node:
|
|
import ray
|
|
ray.init()
|
|
from pprint import pprint
|
|
pprint(ray.nodes())
|
|
|
|
"""
|
|
The <NodeManagerAddress>:<MetricsExportPort> from each of these entries
|
|
should be passed to Prometheus.
|
|
[{'Alive': True,
|
|
'MetricsExportPort': 8080,
|
|
'NodeID': '2f480984702a22556b90566bdac818a4a771e69a',
|
|
'NodeManagerAddress': '192.168.1.82',
|
|
'NodeManagerHostname': 'host2.attlocal.net',
|
|
'NodeManagerPort': 61760,
|
|
'ObjectManagerPort': 61454,
|
|
'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store',
|
|
'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet',
|
|
'Resources': {'CPU': 1.0,
|
|
'memory': 123.0,
|
|
'node:192.168.1.82': 1.0,
|
|
'object_store_memory': 2.0},
|
|
'alive': True},
|
|
{'Alive': True,
|
|
'MetricsExportPort': 8080,
|
|
'NodeID': 'ce6f30a7e2ef58c8a6893b3df171bcd464b33c77',
|
|
'NodeManagerAddress': '192.168.1.82',
|
|
'NodeManagerHostname': 'host1.attlocal.net',
|
|
'NodeManagerPort': 62052,
|
|
'ObjectManagerPort': 61468,
|
|
'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store.1',
|
|
'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet.1',
|
|
'Resources': {'CPU': 1.0,
|
|
'memory': 134.0,
|
|
'node:192.168.1.82': 1.0,
|
|
'object_store_memory': 2.0},
|
|
'alive': True}]
|
|
"""
|