ray/release/long_running_distributed_tests at b7dd7ddb5231bc4bc83ae1e385edc761d5476627 - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

History

Ameer Haj Ali b7dd7ddb52 deprecate useless fields in the cluster yaml. (#13637 ) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs * joblib strikes again on windows * add ability to not start autoscaler/monitor * a * remove worker_default * Remove default pod type from operator * Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types * deprecate useless fields Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>		2021-01-23 12:06:51 -08:00
..
workloads	[Release] release tests yamls for Tune & GPU (#12496 )	2020-12-08 10:15:07 -08:00
cluster.yaml	deprecate useless fields in the cluster yaml. (#13637 )	2021-01-23 12:06:51 -08:00
README.rst	Clean up release tests (#11420 )	2020-10-22 17:04:41 -07:00
run.sh	[Release] release tests yamls for Tune & GPU (#12496 )	2020-12-08 10:15:07 -08:00

README.rst

Long Running Distributed Tests
==============================

This directory contains the long-running multi-node workloads which are intended to run
forever until they fail. To set up the project you need to run

.. code-block:: bash

    $ pip install anyscale
    $ anyscale init

Running the Workloads
---------------------
Easiest approach is to use the `Anyscale UI <https://www.anyscale.dev/>`_. First run ``anyscale snapshot create`` from the command line to create a project snapshot. Then from the UI, you can launch an individual session and execute the test_workload command for each test. 

You can also start the workloads using the CLI with:

.. code-block:: bash

    $ anyscale start --ray-wheel=<RAY_WHEEL_LINK>
    $ anyscale run test_workload --workload=<WORKLOAD_NAME>


Doing this for each workload will start one EC2 instance per workload and will start the workloads
running (one per instance). A list of
available workload options is available in the `ray_projects/project.yaml` file.


Debugging
---------
The primary method to debug the test while it is running is to view the logs and the dashboard from the UI. After the test has failed, you can still view the stdout logs in the UI and also inspect
the logs under ``/tmp/ray/session*/logs/`` and
``/tmp/ray/session*/debug_state.txt``.

Shut Down the Workloads
-----------------------

The instances running the workloads can all be killed by running
``anyscale stop <SESSION_NAME>``.

Adding a Workload
-----------------

To create a new workload, simply add a new Python file under ``workloads/`` and
add the workload in the run command in `ray-project/project.yaml`.