ray/release/ml_user_tests/xgboost/train_gpu_connect.py

"""Small cluster training

This training run will start 4 workers on 4 nodes (including head node).

Test owner: krfricke

Acceptance criteria: Should run through and report final results.
"""
import json
import os
import time

import ray

if __name__ == "__main__":
    os.environ["RXGB_PLACEMENT_GROUP_TIMEOUT_S"] = "1200"

    addr = os.environ.get("RAY_ADDRESS")
    job_name = os.environ.get("RAY_JOB_NAME", "train_gpu_connect")

    # Manually set NCCL_SOCKET_IFNAME to "ens3" so NCCL training works on
    # anyscale_default_cloud.
    # See https://github.com/pytorch/pytorch/issues/68893 for more details.
    # Passing in runtime_env to ray.init() will also set it for all the
    # workers.
    runtime_env = {
        "env_vars": {
            "RXGB_PLACEMENT_GROUP_TIMEOUT_S": "1200",
            "NCCL_SOCKET_IFNAME": "ens3",
        },
        "working_dir": os.path.dirname(__file__),
    }

    if addr.startswith("anyscale://"):
        ray.init(address=addr, job_name=job_name, runtime_env=runtime_env)
    else:
        ray.init(address="auto", runtime_env=runtime_env)

    from xgboost_ray import RayParams
    from release_test_util import train_ray, get_parquet_files

    ray_params = RayParams(
        elastic_training=False,
        max_actor_restarts=2,
        num_actors=4,
        cpus_per_actor=4,
        gpus_per_actor=1,
    )

    @ray.remote
    def ray_get_parquet_files():
        return get_parquet_files(
            path="/data/classification.parquet",
            num_files=25,
        )

    start = time.time()
    train_ray(
        path=ray.get(ray_get_parquet_files.remote()),
        num_workers=4,
        num_boost_rounds=100,
        regression=False,
        use_gpu=True,
        ray_params=ray_params,
        xgboost_params=None,
    )
    taken = time.time() - start

    result = {
        "time_taken": taken,
    }
    test_output_json = os.environ.get("TEST_OUTPUT_JSON", "/tmp/train_gpu_connect.json")
    with open(test_output_json, "wt") as f:
        json.dump(result, f)

    print("PASSED.")
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`"""Small cluster training`

			`This training run will start 4 workers on 4 nodes (including head node).`

			`Test owner: krfricke`

			`Acceptance criteria: Should run through and report final results.`
			`"""`
			`import json`
			`import os`
			`import time`

			`import ray`

			`if __name__ == "__main__":`
[xgboost/release] Xgboost/connect gpu test (#19838) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test 2021-11-02 16:40:48 +01:00			`os.environ["RXGB_PLACEMENT_GROUP_TIMEOUT_S"] = "1200"`

[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`addr = os.environ.get("RAY_ADDRESS")`
[xgboost/release] Xgboost/connect gpu test (#19838) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test 2021-11-02 16:40:48 +01:00			`job_name = os.environ.get("RAY_JOB_NAME", "train_gpu_connect")`
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node. 2021-11-24 11:27:17 +01:00
[Release] Upgrade instance types for xgboost gpu release tests (#24002) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048 2022-04-20 15:18:22 -07:00			`# Manually set NCCL_SOCKET_IFNAME to "ens3" so NCCL training works on`
			`# anyscale_default_cloud.`
			`# See https://github.com/pytorch/pytorch/issues/68893 for more details.`
			`# Passing in runtime_env to ray.init() will also set it for all the`
			`# workers.`
			`runtime_env = {`
			`"env_vars": {`
			`"RXGB_PLACEMENT_GROUP_TIMEOUT_S": "1200",`
			`"NCCL_SOCKET_IFNAME": "ens3",`
[AIR] Remove ML code from `ray.util` (#27005) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com> 2022-07-27 06:24:19 -07:00			`},`
			`"working_dir": os.path.dirname(__file__),`
[Release] Upgrade instance types for xgboost gpu release tests (#24002) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048 2022-04-20 15:18:22 -07:00			`}`
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node. 2021-11-24 11:27:17 +01:00
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`if addr.startswith("anyscale://"):`
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node. 2021-11-24 11:27:17 +01:00			`ray.init(address=addr, job_name=job_name, runtime_env=runtime_env)`
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`else:`
[Release] Upgrade instance types for xgboost gpu release tests (#24002) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048 2022-04-20 15:18:22 -07:00			`ray.init(address="auto", runtime_env=runtime_env)`
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00
[xgboost/release] Xgboost/connect gpu test (#19838) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test 2021-11-02 16:40:48 +01:00			`from xgboost_ray import RayParams`
[AIR] Remove ML code from `ray.util` (#27005) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com> 2022-07-27 06:24:19 -07:00			`from release_test_util import train_ray, get_parquet_files`
[xgboost/release] Xgboost/connect gpu test (#19838) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test 2021-11-02 16:40:48 +01:00
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`ray_params = RayParams(`
			`elastic_training=False,`
			`max_actor_restarts=2,`
			`num_actors=4,`
			`cpus_per_actor=4,`
[xgboost/release] Xgboost/connect gpu test (#19838) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test 2021-11-02 16:40:48 +01:00			`gpus_per_actor=1,`
			`)`
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00
			`@ray.remote`
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node. 2021-11-24 11:27:17 +01:00			`def ray_get_parquet_files():`
			`return get_parquet_files(`
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`path="/data/classification.parquet",`
			`num_files=25,`
			`)`

			`start = time.time()`
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node. 2021-11-24 11:27:17 +01:00			`train_ray(`
			`path=ray.get(ray_get_parquet_files.remote()),`
			`num_workers=4,`
			`num_boost_rounds=100,`
			`regression=False,`
			`use_gpu=True,`
			`ray_params=ray_params,`
			`xgboost_params=None,`
			`)`
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00			`taken = time.time() - start`

			`result = {`
			`"time_taken": taken,`
			`}`
			`test_output_json = os.environ.get("TEST_OUTPUT_JSON", "/tmp/train_gpu_connect.json")`
			`with open(test_output_json, "wt") as f:`
			`json.dump(result, f)`

			`print("PASSED.")`