mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
188 lines
8 KiB
ReStructuredText
188 lines
8 KiB
ReStructuredText
Cluster Monitoring
|
|
------------------
|
|
|
|
Ray ships with the following observability features:
|
|
|
|
1. :ref:`The dashboard <ray-dashboard>`, for viewing cluster state.
|
|
2. CLI tools such as the :ref:`Ray state APIs <state-api-overview-ref>` and :ref:`ray status <monitor-cluster>`, for checking application and cluster status.
|
|
3. :ref:`Prometheus metrics <multi-node-metrics>` for internal and custom user-defined metrics.
|
|
|
|
For more information on these tools, check out the more comprehensive :ref:`Observability guide <observability>`.
|
|
|
|
The rest of this page will focus on how to access these services when running a Ray Cluster.
|
|
|
|
Monitoring the cluster via the dashboard
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
|
|
including the running jobs, actors, workers, nodes, etc.
|
|
By default, the :ref:`cluster launcher <vm-cluster-quick-start>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
|
|
not publicly expose the port.
|
|
|
|
.. tabbed:: If using the VM cluster launcher
|
|
|
|
You can securely port-forward local traffic to the dashboard via the ``ray
|
|
dashboard`` command.
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ray dashboard [-p <port, 8265 by default>] <cluster config file>
|
|
|
|
The dashboard will now be visible at ``http://localhost:8265``.
|
|
|
|
.. tabbed:: If using Kubernetes
|
|
|
|
The KubeRay operator makes the dashboard available via a Service targeting
|
|
the Ray head pod, named ``<RayCluster name>-head-svc``. You can access the
|
|
dashboard from within the Kubernetes cluster at ``http://<RayCluster name>-head-svc:8265``.
|
|
|
|
You can also view the dashboard from outside the Kubernetes cluster by
|
|
using port-forwarding:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265
|
|
|
|
For more information about configuring network access to a Ray cluster on
|
|
Kubernetes, see the :ref:`networking notes <kuberay-networking>`.
|
|
|
|
|
|
Using Ray Cluster CLI tools
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Ray provides several CLI tools for observing the current cluster and
|
|
application state. The ``ray status`` command provides information about the
|
|
current status of nodes in the cluster, as well as information about
|
|
autoscaling. Other Ray CLI tools allow you to read logs across the cluster and
|
|
summarize application state such as the currently running tasks and actors.
|
|
These tools are summarized :ref:`here <state-api-overview-ref>`.
|
|
|
|
These CLI commands can be run on any node in a Ray Cluster. Examples for
|
|
executing these commands from a machine outside the Ray Cluster are provided
|
|
below.
|
|
|
|
.. tabbed:: If using the VM cluster launcher
|
|
|
|
Execute a command on the cluster using ``ray exec``:
|
|
|
|
.. code-block:: shell
|
|
|
|
$ ray exec <cluster config file> "ray status"
|
|
|
|
.. tabbed:: If using Kubernetes
|
|
|
|
Execute a command on the cluster using ``kubectl exec`` and the configured
|
|
RayCluster name. We will use the Service targeting the Ray head pod to
|
|
execute a CLI command on the cluster.
|
|
|
|
.. code-block:: shell
|
|
|
|
# First, find the name of the Ray head service.
|
|
$ kubectl get pod | grep <RayCluster name>-head
|
|
# NAME READY STATUS RESTARTS AGE
|
|
# <RayCluster name>-head-xxxxx 2/2 Running 0 XXs
|
|
|
|
# Then, use the name of the Ray head service to run `ray status`.
|
|
$ kubectl exec <RayCluster name>-head-xxxxx -- ray status
|
|
|
|
.. _multi-node-metrics:
|
|
|
|
Prometheus metrics
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Ray runs a metrics agent per node to export :ref:`metrics <ray-metrics>` about Ray core as well as
|
|
custom user-defined metrics. Each metrics agent collects metrics from the local
|
|
node and exposes these in a Prometheus format. You can then scrape each
|
|
endpoint to access Ray's metrics.
|
|
|
|
To scrape the endpoints, we need to ensure service discovery, allowing
|
|
Prometheus to find the metrics agents' endpoints on each node.
|
|
|
|
Auto-discovering metrics endpoints
|
|
##################################
|
|
|
|
You can allow Prometheus to dynamically find endpoints it should scrape by using Prometheus' `file based service discovery <https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus>`_.
|
|
This is the recommended way to export Prometheus metrics when using the Ray :ref:`cluster launcher <vm-cluster-quick-start>`, as node IP addresses can often change as the cluster scales up and down.
|
|
|
|
Ray auto-generates a Prometheus `service discovery file <https://prometheus.io/docs/guides/file-sd/#installing-configuring-and-running-prometheus>`_ on the head node to facilitate metrics agents' service discovery.
|
|
This allows you to scrape all metrics in the cluster without knowing their IPs. Let's walk through how to acheive this.
|
|
|
|
The service discovery file is generated on the :ref:`head node <cluster-head-node>`. On this node, look for ``/tmp/ray/prom_metrics_service_discovery.json`` (or the eqiuvalent file if using a custom Ray ``temp_dir``).
|
|
Ray will periodically update this file with the addresses of all metrics agents in the cluster.
|
|
|
|
Now, on the same node, modify a Prometheus config to scrape the file for service discovery.
|
|
Prometheus will automatically update the addresses that it scrapes based on the contents of Ray's service discovery file.
|
|
|
|
.. code-block:: yaml
|
|
|
|
# Prometheus config file
|
|
|
|
# my global config
|
|
global:
|
|
scrape_interval: 2s
|
|
evaluation_interval: 2s
|
|
|
|
# Scrape from Ray.
|
|
scrape_configs:
|
|
- job_name: 'ray'
|
|
file_sd_configs:
|
|
- files:
|
|
- '/tmp/ray/prom_metrics_service_discovery.json'
|
|
|
|
Manually discovering metrics endpoints
|
|
######################################
|
|
|
|
If you already know the IP addresses of all nodes in your Ray Cluster, you can
|
|
configure Prometheus to read metrics from a static list of endpoints. To
|
|
do this, first set a fixed port that Ray should use to export metrics. If
|
|
using the cluster launcher, pass ``--metrics-export-port=<port>`` to ``ray
|
|
start``. If using KubeRay, you can specify
|
|
``rayStartParams.metrics-export-port`` in the RayCluster configuration file.
|
|
The port must be specified on all nodes in the cluster.
|
|
|
|
If you do not know the IP addresses of the nodes in your Ray cluster,
|
|
you can also programmatically discover the endpoints by reading the
|
|
Ray Cluster information. Here, we will use a Python script and the
|
|
``ray.nodes()`` API to find the metrics agents' URLs, by combining the
|
|
``NodeManagerAddress`` with the ``MetricsExportPort``. For example:
|
|
|
|
.. code-block:: python
|
|
|
|
# On a cluster node:
|
|
import ray
|
|
ray.init()
|
|
from pprint import pprint
|
|
pprint(ray.nodes())
|
|
|
|
"""
|
|
The <NodeManagerAddress>:<MetricsExportPort> from each of these entries
|
|
should be passed to Prometheus.
|
|
[{'Alive': True,
|
|
'MetricsExportPort': 8080,
|
|
'NodeID': '2f480984702a22556b90566bdac818a4a771e69a',
|
|
'NodeManagerAddress': '192.168.1.82',
|
|
'NodeManagerHostname': 'host2.attlocal.net',
|
|
'NodeManagerPort': 61760,
|
|
'ObjectManagerPort': 61454,
|
|
'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store',
|
|
'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet',
|
|
'Resources': {'CPU': 1.0,
|
|
'memory': 123.0,
|
|
'node:192.168.1.82': 1.0,
|
|
'object_store_memory': 2.0},
|
|
'alive': True},
|
|
{'Alive': True,
|
|
'MetricsExportPort': 8080,
|
|
'NodeID': 'ce6f30a7e2ef58c8a6893b3df171bcd464b33c77',
|
|
'NodeManagerAddress': '192.168.1.82',
|
|
'NodeManagerHostname': 'host1.attlocal.net',
|
|
'NodeManagerPort': 62052,
|
|
'ObjectManagerPort': 61468,
|
|
'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store.1',
|
|
'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet.1',
|
|
'Resources': {'CPU': 1.0,
|
|
'memory': 134.0,
|
|
'node:192.168.1.82': 1.0,
|
|
'object_store_memory': 2.0},
|
|
'alive': True}]
|
|
"""
|