Cluster Monitoring ------------------ Ray ships with the following observability features: 1. :ref:`The dashboard `, for viewing cluster state. 2. CLI tools such as the :ref:`Ray state APIs ` and :ref:`ray status `, for checking application and cluster status. 3. :ref:`Prometheus metrics ` for internal and custom user-defined metrics. For more information on these tools, check out the more comprehensive :ref:`Observability guide `. The rest of this page will focus on how to access these services when running a Ray Cluster. Monitoring the cluster via the dashboard ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ :ref:`The dashboard ` provides detailed information about the state of the cluster, including the running jobs, actors, workers, nodes, etc. By default, the :ref:`cluster launcher ` and :ref:`KubeRay operator ` will launch the dashboard, but will not publicly expose the port. .. tabbed:: If using the VM cluster launcher You can securely port-forward local traffic to the dashboard via the ``ray dashboard`` command. .. code-block:: shell $ ray dashboard [-p ] The dashboard will now be visible at ``http://localhost:8265``. .. tabbed:: If using Kubernetes The KubeRay operator makes the dashboard available via a Service targeting the Ray head pod, named ``-head-svc``. You can access the dashboard from within the Kubernetes cluster at ``http://-head-svc:8265``. You can also view the dashboard from outside the Kubernetes cluster by using port-forwarding: .. code-block:: shell $ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265 For more information about configuring network access to a Ray cluster on Kubernetes, see the :ref:`networking notes `. Using Ray Cluster CLI tools ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Ray provides several CLI tools for observing the current cluster and application state. The ``ray status`` command provides information about the current status of nodes in the cluster, as well as information about autoscaling. Other Ray CLI tools allow you to read logs across the cluster and summarize application state such as the currently running tasks and actors. These tools are summarized :ref:`here `. These CLI commands can be run on any node in a Ray Cluster. Examples for executing these commands from a machine outside the Ray Cluster are provided below. .. tabbed:: If using the VM cluster launcher Execute a command on the cluster using ``ray exec``: .. code-block:: shell $ ray exec "ray status" .. tabbed:: If using Kubernetes Execute a command on the cluster using ``kubectl exec`` and the configured RayCluster name. We will use the Service targeting the Ray head pod to execute a CLI command on the cluster. .. code-block:: shell # First, find the name of the Ray head service. $ kubectl get pod | grep -head # NAME READY STATUS RESTARTS AGE # -head-xxxxx 2/2 Running 0 XXs # Then, use the name of the Ray head service to run `ray status`. $ kubectl exec -head-xxxxx -- ray status .. _multi-node-metrics: Prometheus metrics ^^^^^^^^^^^^^^^^^^ Ray runs a metrics agent per node to export :ref:`metrics ` about Ray core as well as custom user-defined metrics. Each metrics agent collects metrics from the local node and exposes these in a Prometheus format. You can then scrape each endpoint to access Ray's metrics. To scrape the endpoints, we need to ensure service discovery, allowing Prometheus to find the metrics agents' endpoints on each node. Auto-discovering metrics endpoints ################################## You can allow Prometheus to dynamically find endpoints it should scrape by using Prometheus' `file based service discovery `_. This is the recommended way to export Prometheus metrics when using the Ray :ref:`cluster launcher `, as node IP addresses can often change as the cluster scales up and down. Ray auto-generates a Prometheus `service discovery file `_ on the head node to facilitate metrics agents' service discovery. This allows you to scrape all metrics in the cluster without knowing their IPs. Let's walk through how to acheive this. The service discovery file is generated on the :ref:`head node `. On this node, look for ``/tmp/ray/prom_metrics_service_discovery.json`` (or the eqiuvalent file if using a custom Ray ``temp_dir``). Ray will periodically update this file with the addresses of all metrics agents in the cluster. Now, on the same node, modify a Prometheus config to scrape the file for service discovery. Prometheus will automatically update the addresses that it scrapes based on the contents of Ray's service discovery file. .. code-block:: yaml # Prometheus config file # my global config global: scrape_interval: 2s evaluation_interval: 2s # Scrape from Ray. scrape_configs: - job_name: 'ray' file_sd_configs: - files: - '/tmp/ray/prom_metrics_service_discovery.json' Manually discovering metrics endpoints ###################################### If you already know the IP addresses of all nodes in your Ray Cluster, you can configure Prometheus to read metrics from a static list of endpoints. To do this, first set a fixed port that Ray should use to export metrics. If using the cluster launcher, pass ``--metrics-export-port=`` to ``ray start``. If using KubeRay, you can specify ``rayStartParams.metrics-export-port`` in the RayCluster configuration file. The port must be specified on all nodes in the cluster. If you do not know the IP addresses of the nodes in your Ray cluster, you can also programmatically discover the endpoints by reading the Ray Cluster information. Here, we will use a Python script and the ``ray.nodes()`` API to find the metrics agents' URLs, by combining the ``NodeManagerAddress`` with the ``MetricsExportPort``. For example: .. code-block:: python # On a cluster node: import ray ray.init() from pprint import pprint pprint(ray.nodes()) """ The : from each of these entries should be passed to Prometheus. [{'Alive': True, 'MetricsExportPort': 8080, 'NodeID': '2f480984702a22556b90566bdac818a4a771e69a', 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host2.attlocal.net', 'NodeManagerPort': 61760, 'ObjectManagerPort': 61454, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet', 'Resources': {'CPU': 1.0, 'memory': 123.0, 'node:192.168.1.82': 1.0, 'object_store_memory': 2.0}, 'alive': True}, {'Alive': True, 'MetricsExportPort': 8080, 'NodeID': 'ce6f30a7e2ef58c8a6893b3df171bcd464b33c77', 'NodeManagerAddress': '192.168.1.82', 'NodeManagerHostname': 'host1.attlocal.net', 'NodeManagerPort': 62052, 'ObjectManagerPort': 61468, 'ObjectStoreSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/plasma_store.1', 'RayletSocketName': '/tmp/ray/session_2020-08-04_18-18-16_481195_34255/sockets/raylet.1', 'Resources': {'CPU': 1.0, 'memory': 134.0, 'node:192.168.1.82': 1.0, 'object_store_memory': 2.0}, 'alive': True}] """