ray/ci/long_running_tests/README.rst

Long Running Tests
==================

This directory contains scripts for starting long-running workloads which are
intended to run forever until they fail.

Running the Workloads
---------------------

To run the workloads, run

.. code-block:: bash

    ./start_workloads.sh <ray-branch> <ray-version> <ray-commit>

using the appropriate values of ``<ray-branch>``, ``<ray-version>``, and
``<ray-commit>``. This will start one EC2 instance per  workload and will start
the workloads running (one per instance). Running the ``./start_workloads.sh``
script again will clean up any state from the previous runs and will start the
workloads again.

Check Workload Statuses
-----------------------

To check up on the workloads, run either ``./check_workloads.sh --load``, which
will print the load on each machine, or ``./check_workloads.sh --logs``, which
will print the tail of the output for each workload.

To debug workloads that have failed, you may find it useful to ssh to the
relevant machine, attach to the tmux session (usually ``tmux a -t 0``), inspect
the logs under ``/tmp/ray/session*/logs/``, and also inspect
``/tmp/ray/session*/debug_state.txt``.

Shut Down the Workloads
-----------------------

The instances running the workloads can all be killed by running
``./shut_down_workloads.sh``.

Adding a Workload
-----------------

To create a new workload, simply add a new Python file under ``workloads/``.
Add script for running infinitely long stress tests. (#4163) Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance). - The workloads run forever. - The workloads all simulate multiple nodes but use a single machine. - You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`. - You have to manually shut down the instances. As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs. 2019-02-27 14:33:06 -08:00			`Long Running Tests`
			`==================`

			`This directory contains scripts for starting long-running workloads which are`
			`intended to run forever until they fail.`

			`Running the Workloads`
			`---------------------`

Make some fixes to long running stress tests. (#5056) 2019-06-28 15:42:54 -07:00			`To run the workloads, run`

			`.. code-block:: bash`

			`./start_workloads.sh <ray-branch> <ray-version> <ray-commit>`

			using the appropriate values of ``<ray-branch>``, ``<ray-version>``, and
			``<ray-commit>``. This will start one EC2 instance per workload and will start
			the workloads running (one per instance). Running the ``./start_workloads.sh``
			`script again will clean up any state from the previous runs and will start the`
			`workloads again.`
Add script for running infinitely long stress tests. (#4163) Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance). - The workloads run forever. - The workloads all simulate multiple nodes but use a single machine. - You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`. - You have to manually shut down the instances. As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs. 2019-02-27 14:33:06 -08:00
			`Check Workload Statuses`
			`-----------------------`

[rllib] Add three new long-running stress tests {APEX, IMPALA, PBT} (#4215) 2019-03-04 14:05:42 -08:00			To check up on the workloads, run either ``./check_workloads.sh --load``, which
			will print the load on each machine, or ``./check_workloads.sh --logs``, which
			`will print the tail of the output for each workload.`
Add script for running infinitely long stress tests. (#4163) Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance). - The workloads run forever. - The workloads all simulate multiple nodes but use a single machine. - You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`. - You have to manually shut down the instances. As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs. 2019-02-27 14:33:06 -08:00
			`To debug workloads that have failed, you may find it useful to ssh to the`
			relevant machine, attach to the tmux session (usually ``tmux a -t 0``), inspect
			the logs under ``/tmp/ray/session*/logs/``, and also inspect
			``/tmp/ray/session*/debug_state.txt``.

Add script for shutting down tests. (#4203) 2019-03-01 19:56:30 -08:00			`Shut Down the Workloads`
			`-----------------------`

			`The instances running the workloads can all be killed by running`
			``./shut_down_workloads.sh``.

Add script for running infinitely long stress tests. (#4163) Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance). - The workloads run forever. - The workloads all simulate multiple nodes but use a single machine. - You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`. - You have to manually shut down the instances. As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs. 2019-02-27 14:33:06 -08:00			`Adding a Workload`
			`-----------------`

			To create a new workload, simply add a new Python file under ``workloads/``.