ray/release/long_running_distributed_tests
SangBin Cho 6649f078e5
[Internal Observability] Move debug_state.txt to the log dir + support gcs_server debug state (#20722)
Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below.
Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds
For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.
2021-11-28 20:42:37 -08:00
..
workloads [Tune] Remove queue_trials. (#19472) 2021-10-22 09:24:54 +01:00
app_config.yaml [Release] Use nightly Docker images (#20001) 2021-11-10 18:00:16 -08:00
compute_tpl.yaml Increase disk for long running distributed tests (#18855) 2021-09-23 17:52:35 +01:00
long_running_distributed.yaml [release] Fix app config: Install latest releases. Bump xgboost-ray version (#16581) 2021-06-24 12:56:21 +01:00
README.rst [Internal Observability] Move debug_state.txt to the log dir + support gcs_server debug state (#20722) 2021-11-28 20:42:37 -08:00

Long Running Distributed Tests
==============================

This directory contains the long-running multi-node workloads which are intended to run
forever until they fail. To set up the project you need to run

.. code-block:: bash

    $ pip install anyscale
    $ anyscale init

Running the Workloads
---------------------
Easiest approach is to use the `Anyscale UI <https://www.anyscale.dev/>`_. First run ``anyscale snapshot create`` from the command line to create a project snapshot. Then from the UI, you can launch an individual session and execute the test_workload command for each test. 

You can also start the workloads using the CLI with:

.. code-block:: bash

    $ anyscale start --ray-wheel=<RAY_WHEEL_LINK>
    $ anyscale run test_workload --workload=<WORKLOAD_NAME>


Doing this for each workload will start one EC2 instance per workload and will start the workloads
running (one per instance). A list of
available workload options is available in the `ray_projects/project.yaml` file.


Debugging
---------
The primary method to debug the test while it is running is to view the logs and the dashboard from the UI. After the test has failed, you can still view the stdout logs in the UI and also inspect
the logs under ``/tmp/ray/session*/logs/`` and
``/tmp/ray/session*/logs/debug_state.txt``.

Shut Down the Workloads
-----------------------

The instances running the workloads can all be killed by running
``anyscale stop <SESSION_NAME>``.

Adding a Workload
-----------------

To create a new workload, simply add a new Python file under ``workloads/`` and
add the workload in the run command in `ray-project/project.yaml`.