ray/release/lightgbm_tests/workloads/train_small.py
Kai Fricke 331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00

61 lines
1.5 KiB
Python

"""Small cluster training
This training run will start 4 workers on 4 nodes (including head node).
Test owner: Yard1 (primary), krfricke
Acceptance criteria: Should run through and report final results.
"""
import json
import os
import time
import ray
from lightgbm_ray import RayParams
from ray.util.lightgbm.release_test_util import train_ray
if __name__ == "__main__":
addr = os.environ.get("RAY_ADDRESS")
job_name = os.environ.get("RAY_JOB_NAME", "train_small")
if addr.startswith("anyscale://"):
ray.init(address=addr, job_name=job_name)
else:
ray.init(address="auto")
output = os.environ["TEST_OUTPUT_JSON"]
ray_params = RayParams(
elastic_training=False,
max_actor_restarts=2,
num_actors=4,
cpus_per_actor=4,
gpus_per_actor=0,
)
start = time.time()
@ray.remote(num_cpus=0)
def train():
os.environ["TEST_OUTPUT_JSON"] = output
train_ray(
path="/data/classification.parquet",
num_workers=4,
num_boost_rounds=100,
num_files=25,
regression=False,
use_gpu=False,
ray_params=ray_params,
lightgbm_params=None,
)
ray.get(train.remote())
taken = time.time() - start
result = {
"time_taken": taken,
}
test_output_json = os.environ.get("TEST_OUTPUT_JSON", "/tmp/train_small.json")
with open(test_output_json, "wt") as f:
json.dump(result, f)
print("PASSED.")