ray/doc/source/cluster/yarn.rst

.. include:: we_are_hiring.rst

.. _ray-yarn-deploy:

Deploying on YARN
=================

.. warning::

  Running Ray on YARN is still a work in progress. If you have a
  suggestion for how to improve this documentation or want to request
  a missing feature, please feel free to create a pull request or get in touch
  using one of the channels in the `Questions or Issues?`_ section below.

This document assumes that you have access to a YARN cluster and will walk
you through using `Skein`_ to deploy a YARN job that starts a Ray cluster and
runs an example script on it.

Skein uses a declarative specification (either written as a yaml file or using the Python API) and allows users to launch jobs and scale applications without the need to write Java code.

You will first need to install Skein: ``pip install skein``.

The Skein ``yaml`` file and example Ray program used here are provided in the
`Ray repository`_ to get you started. Refer to the provided ``yaml``
files to be sure that you maintain important configuration options for Ray to
function properly.

.. _`Ray repository`: https://github.com/ray-project/ray/tree/master/doc/yarn

Skein Configuration
-------------------

A Ray job is configured to run as two `Skein services`:

1. The ``ray-head`` service that starts the Ray head node and then runs the
   application.
2. The ``ray-worker`` service that starts worker nodes that join the Ray cluster.
   You can change the number of instances in this configuration or at runtime
   using ``skein container scale`` to scale the cluster up/down.

The specification for each service consists of necessary files and commands that will be run to start the service.

.. code-block:: yaml

    services:
        ray-head:
            # There should only be one instance of the head node per cluster.
            instances: 1
            resources:
                # The resources for the worker node.
                vcores: 1
                memory: 2048
            files:
                ...
            script:
                ...
        ray-worker:
            # Number of ray worker nodes to start initially.
            # This can be scaled using 'skein container scale'.
            instances: 3
            resources:
                # The resources for the worker node.
                vcores: 1
                memory: 2048
            files:
                ...
            script:
                ...

Packaging Dependencies
----------------------

Use the ``files`` option to specify files that will be copied into the YARN container for the application to use. See `the Skein file distribution page <https://jcrist.github.io/skein/distributing-files.html>`_ for more information.

.. code-block:: yaml

    services:
        ray-head:
            # There should only be one instance of the head node per cluster.
            instances: 1
            resources:
                # The resources for the head node.
                vcores: 1
                memory: 2048
            files:
                # ray/doc/yarn/example.py
                example.py: example.py
            #     # A packaged python environment using `conda-pack`. Note that Skein
            #     # doesn't require any specific way of distributing files, but this
            #     # is a good one for python projects. This is optional.
            #     # See https://jcrist.github.io/skein/distributing-files.html
            #     environment: environment.tar.gz

Ray Setup in YARN
-----------------

Below is a walkthrough of the bash commands used to start the ``ray-head`` and ``ray-worker`` services. Note that this configuration will launch a new Ray cluster for each application, not reuse the same cluster.

Head node commands
~~~~~~~~~~~~~~~~~~

Start by activating a pre-existing environment for dependency management.

.. code-block:: bash

    source environment/bin/activate

Register the Ray head address needed by the workers in the Skein key-value store.

.. code-block:: bash

    skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current

Start all the processes needed on the ray head node. By default, we set object store memory
and heap memory to roughly 200 MB. This is conservative and should be set according to application needs.

.. code-block:: bash

    ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1

Execute the user script containing the Ray program.

.. code-block:: bash

    python example.py

Clean up all started processes even if the application fails or is killed.

.. code-block:: bash

    ray stop
    skein application shutdown current

Putting things together, we have:

.. literalinclude:: /../yarn/ray-skein.yaml
   :language: yaml
   :start-after: # Head service
   :end-before: # Worker service


Worker node commands
~~~~~~~~~~~~~~~~~~~~

Fetch the address of the head node from the Skein key-value store.

.. code-block:: bash

    RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS)

Start all of the processes needed on a ray worker node, blocking until killed by Skein/YARN via SIGTERM. After receiving SIGTERM, all started processes should also die (ray stop).

.. code-block:: bash

    ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop

Putting things together, we have:

.. literalinclude:: /../yarn/ray-skein.yaml
   :language: yaml
   :start-after: # Worker service

Running a Job
-------------

Within your Ray script, use the following to connect to the started Ray cluster:

.. literalinclude:: /../yarn/example.py
    :language: python
    :start-after: if __name__ == "__main__"

You can use the following command to launch the application as specified by the Skein YAML file.

.. code-block:: bash

    skein application submit [TEST.YAML]

Once it has been submitted, you can see the job running on the YARN dashboard.

.. image:: /images/yarn-job.png

Cleaning Up
-----------

To clean up a running job, use the following (using the application ID):

.. code-block:: bash

    skein application shutdown $appid

Questions or Issues?
--------------------

.. include:: /_includes/_help.rst

.. _`Skein`: https://jcrist.github.io/skein/
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`.. include:: we_are_hiring.rst`
[docs] autoscaler/K8s hiring roles (#20621) * we are hiring * fixes as philipp requested 2021-11-23 14:56:22 -08:00
[docs] Revised Cluster documentation (#9062) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> 2020-06-26 09:29:22 -07:00			`.. _ray-yarn-deploy:`

Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`Deploying on YARN`
			`=================`

			`.. warning::`

			`Running Ray on YARN is still a work in progress. If you have a`
			`suggestion for how to improve this documentation or want to request`
			`a missing feature, please feel free to create a pull request or get in touch`
			using one of the channels in the `Questions or Issues?`_ section below.

			`This document assumes that you have access to a YARN cluster and will walk`
			you through using `Skein`_ to deploy a YARN job that starts a Ray cluster and
			`runs an example script on it.`

			`Skein uses a declarative specification (either written as a yaml file or using the Python API) and allows users to launch jobs and scale applications without the need to write Java code.`

[doc] Fix typo (#15828) 2021-05-17 07:08:14 +08:00			You will first need to install Skein: ``pip install skein``.
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			The Skein ``yaml`` file and example Ray program used here are provided in the
			`Ray repository`_ to get you started. Refer to the provided ``yaml``
			`files to be sure that you maintain important configuration options for Ray to`
			`function properly.`

			.. _`Ray repository`: https://github.com/ray-project/ray/tree/master/doc/yarn

			`Skein Configuration`
			`-------------------`

			A Ray job is configured to run as two `Skein services`:

			1. The ``ray-head`` service that starts the Ray head node and then runs the
			`application.`
			2. The ``ray-worker`` service that starts worker nodes that join the Ray cluster.
			`You can change the number of instances in this configuration or at runtime`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			using ``skein container scale`` to scale the cluster up/down.
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`The specification for each service consists of necessary files and commands that will be run to start the service.`

			`.. code-block:: yaml`

			`services:`
			`ray-head:`
			`# There should only be one instance of the head node per cluster.`
			`instances: 1`
			`resources:`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`# The resources for the worker node.`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`vcores: 1`
			`memory: 2048`
			`files:`
			`...`
			`script:`
			`...`
			`ray-worker:`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`# Number of ray worker nodes to start initially.`
			`# This can be scaled using 'skein container scale'.`
			`instances: 3`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`resources:`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`# The resources for the worker node.`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`vcores: 1`
			`memory: 2048`
			`files:`
			`...`
			`script:`
			`...`

			`Packaging Dependencies`
			`----------------------`

			Use the ``files`` option to specify files that will be copied into the YARN container for the application to use. See `the Skein file distribution page <https://jcrist.github.io/skein/distributing-files.html>`_ for more information.

			`.. code-block:: yaml`

			`services:`
			`ray-head:`
			`# There should only be one instance of the head node per cluster.`
			`instances: 1`
			`resources:`
			`# The resources for the head node.`
			`vcores: 1`
			`memory: 2048`
			`files:`
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`# ray/doc/yarn/example.py`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`example.py: example.py`
			# # A packaged python environment using `conda-pack`. Note that Skein
			`# # doesn't require any specific way of distributing files, but this`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`# # is a good one for python projects. This is optional.`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00			`# # See https://jcrist.github.io/skein/distributing-files.html`
			`# environment: environment.tar.gz`

			`Ray Setup in YARN`
			`-----------------`

			Below is a walkthrough of the bash commands used to start the ``ray-head`` and ``ray-worker`` services. Note that this configuration will launch a new Ray cluster for each application, not reuse the same cluster.

			`Head node commands`
			`~~~~~~~~~~~~~~~~~~`

			`Start by activating a pre-existing environment for dependency management.`

			`.. code-block:: bash`

[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`source environment/bin/activate`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`Register the Ray head address needed by the workers in the Skein key-value store.`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`.. code-block:: bash`

[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`skein kv put --key=RAY_HEAD_ADDRESS --value=$(hostname -i) current`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Start all the processes needed on the ray head node. By default, we set object store memory`
			`and heap memory to roughly 200 MB. This is conservative and should be set according to application needs.`

			`.. code-block:: bash`

[docs] redis-port -> port (#10937) 2020-09-23 17:04:13 -07:00			`ray start --head --port=6379 --object-store-memory=200000000 --memory 200000000 --num-cpus=1`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Execute the user script containing the Ray program.`

			`.. code-block:: bash`

			`python example.py`

			`Clean up all started processes even if the application fails or is killed.`

			`.. code-block:: bash`

			`ray stop`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`skein application shutdown current`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Putting things together, we have:`

Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`.. literalinclude:: /../yarn/ray-skein.yaml`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`:language: yaml`
			`:start-after: # Head service`
			`:end-before: # Worker service`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00

			`Worker node commands`
			`~~~~~~~~~~~~~~~~~~~~`

			`Fetch the address of the head node from the Skein key-value store.`

			`.. code-block:: bash`

[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`RAY_HEAD_ADDRESS=$(skein kv get current --key=RAY_HEAD_ADDRESS)`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Start all of the processes needed on a ray worker node, blocking until killed by Skein/YARN via SIGTERM. After receiving SIGTERM, all started processes should also die (ray stop).`

			`.. code-block:: bash`

			`ray start --object-store-memory=200000000 --memory 200000000 --num-cpus=1 --address=$RAY_HEAD_ADDRESS:6379 --block; ray stop`

			`Putting things together, we have:`

Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`.. literalinclude:: /../yarn/ray-skein.yaml`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`:language: yaml`
			`:start-after: # Worker service`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Running a Job`
			`-------------`

			`Within your Ray script, use the following to connect to the started Ray cluster:`

Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`.. literalinclude:: /../yarn/example.py`
[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`:language: python`
			`:start-after: if __name__ == "__main__"`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`You can use the following command to launch the application as specified by the Skein YAML file.`

			`.. code-block:: bash`

			`skein application submit [TEST.YAML]`

			`Once it has been submitted, you can see the job running on the YARN dashboard.`

[docs] Revised Cluster documentation (#9062) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> 2020-06-26 09:29:22 -07:00			`.. image:: /images/yarn-job.png`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`Cleaning Up`
			`-----------`

[docs] yarn update (#6173) 2019-11-19 16:15:08 -08:00			`To clean up a running job, use the following (using the application ID):`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			`.. code-block:: bash`

			`skein application shutdown $appid`

			`Questions or Issues?`
			`--------------------`

[docs] new structure (#21776) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted). 2022-01-22 00:42:05 +01:00			`.. include:: /_includes/_help.rst`
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
			.. _`Skein`: https://jcrist.github.io/skein/