ray/release/xgboost_tests at 58d7398246726a6e1752b9ad0486355efa839448 - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

History

Kai Fricke 1e113d2e6e [tune/xgboost] Update release test docs (#13880 ) * Update release test docs * Update		2021-02-04 13:10:56 +01:00
..
workloads	[xgboost] Add XGBoost release tests (#13456 )	2021-01-20 18:40:23 +01:00
cluster_cpu_moderate.yaml	deprecate useless fields in the cluster yaml. (#13637 )	2021-01-23 12:06:51 -08:00
cluster_cpu_small.yaml	deprecate useless fields in the cluster yaml. (#13637 )	2021-01-23 12:06:51 -08:00
cluster_gpu_small.yaml	deprecate useless fields in the cluster yaml. (#13637 )	2021-01-23 12:06:51 -08:00
create_test_data.py	[xgboost] Add XGBoost release tests (#13456 )	2021-01-20 18:40:23 +01:00
README.rst	[tune/xgboost] Update release test docs (#13880 )	2021-02-04 13:10:56 +01:00
requirements.txt	[xgboost] Add XGBoost release tests (#13456 )	2021-01-20 18:40:23 +01:00
run.sh	[xgboost] Add XGBoost release tests (#13456 )	2021-01-20 18:40:23 +01:00
wait_cluster.py	[xgboost] Add XGBoost release tests (#13456 )	2021-01-20 18:40:23 +01:00

README.rst

XGBoost on Ray tests
====================

This directory contains various XGBoost on Ray release tests.

You should run these tests with the `releaser <https://github.com/ray-project/releaser>`_ tool.

Overview
--------
There are four kinds of tests:

1. ``distributed_api_test`` - checks general API functionality and should finish very quickly (< 1 minute)
2. ``train_*`` - checks single trial training on different setups.
3. ``tune_*`` - checks multi trial training via Ray Tune.
4. ``ft_*`` - checks fault tolerance. **These tests are currently flaky**

Generally the releaser tool will run all tests in parallel, but if you do
it sequentially, be sure to do it in the order above. If ``train_*`` fails,
``tune_*`` will fail, too.

Flaky fault tolerance tests
---------------------------
The fault tolerance tests are currently flaky. In some runs, more nodes die
than expected, causing the test to fail. In other cases, the re-scheduled
actors become available too soon after crashing, causing the assertions to
fail. Please consider re-running the test a couple of times or contact the
test owner with outputs from the tests for further questions.

Acceptance criteria
-------------------
These tests are considered passing when they throw no error at the end of
the output log.