ray/ci/long_running_tests at 14ff402d707c770e9ba6577ce2c65506f8e3285f - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

History

Robert Nishihara 75504b9586 Add script for running infinitely long stress tests. (#4163 ) Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance). - The workloads run forever. - The workloads all simulate multiple nodes but use a single machine. - You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`. - You have to manually shut down the instances. As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs.		2019-02-27 14:33:06 -08:00
..
workloads	Add script for running infinitely long stress tests. (#4163 )	2019-02-27 14:33:06 -08:00
check_workloads.sh	Add script for running infinitely long stress tests. (#4163 )	2019-02-27 14:33:06 -08:00
config.yaml	Add script for running infinitely long stress tests. (#4163 )	2019-02-27 14:33:06 -08:00
README.rst	Add script for running infinitely long stress tests. (#4163 )	2019-02-27 14:33:06 -08:00
start_workloads.sh	Add script for running infinitely long stress tests. (#4163 )	2019-02-27 14:33:06 -08:00

README.rst

Long Running Tests
==================

This directory contains scripts for starting long-running workloads which are
intended to run forever until they fail.

Running the Workloads
---------------------

To run the workloads, run ``./start_workloads.sh``. This will start one EC2
instance per  workload and will start the workloads running (one per instance).
Running the ``./start_workloads.sh`` script again will clean up any state from
the previous runs and will start the workloads again.

Check Workload Statuses
-----------------------

To check up on the workloads, run ``./check_workloads.sh``. This will print the
tail of each workload, and from the output you might be able to see if something
has failed.

To debug workloads that have failed, you may find it useful to ssh to the
relevant machine, attach to the tmux session (usually ``tmux a -t 0``), inspect
the logs under ``/tmp/ray/session*/logs/``, and also inspect
``/tmp/ray/session*/debug_state.txt``.

Adding a Workload
-----------------

To create a new workload, simply add a new Python file under ``workloads/``.