ray/release/long_running_tests/workloads/many_drivers.py

# This workload tests many drivers using the same cluster.
import json
import os
import time
import argparse

import ray
from ray.cluster_utils import Cluster
from ray._private.test_utils import run_string_as_driver


def update_progress(result):
    result["last_update"] = time.time()
    test_output_json = os.environ.get(
        "TEST_OUTPUT_JSON", "/tmp/release_test_output.json"
    )
    with open(test_output_json, "wt") as f:
        json.dump(result, f)


num_redis_shards = 5
redis_max_memory = 10 ** 8
object_store_memory = 10 ** 8
num_nodes = 4

message = (
    "Make sure there is enough memory on this machine to run this "
    "workload. We divide the system memory by 2 to provide a buffer."
)
assert (
    num_nodes * object_store_memory + num_redis_shards * redis_max_memory
    < ray._private.utils.get_system_memory() / 2
), message

# Simulate a cluster on one machine.

cluster = Cluster()
for i in range(num_nodes):
    cluster.add_node(
        redis_port=6379 if i == 0 else None,
        num_redis_shards=num_redis_shards if i == 0 else None,
        num_cpus=4,
        num_gpus=0,
        resources={str(i): 5},
        object_store_memory=object_store_memory,
        redis_max_memory=redis_max_memory,
        dashboard_host="0.0.0.0",
    )
ray.init(address=cluster.address)

# Run the workload.

# Define a driver script that runs a few tasks and actors on each node in the
# cluster.
driver_script = """
import ray

ray.init(address="{}")

num_nodes = {}


@ray.remote
def f():
    return 1


@ray.remote
class Actor(object):
    def method(self):
        return 1


for _ in range(5):
    for i in range(num_nodes):
        assert (ray.get(
            f._remote(args=[],
            kwargs={{}},
            resources={{str(i): 1}})) == 1)
        actor = Actor._remote(
            args=[], kwargs={{}}, resources={{str(i): 1}})
        assert ray.get(actor.method.remote()) == 1

# Tests datasets doesn't leak workers.
ray.data.range(100).map(lambda x: x).take()

print("success")
""".format(
    cluster.address, num_nodes
)


@ray.remote
def run_driver():
    output = run_string_as_driver(driver_script, encode="utf-8")
    assert "success" in output


iteration = 0
running_ids = [
    run_driver._remote(args=[], kwargs={}, num_cpus=0, resources={str(i): 0.01})
    for i in range(num_nodes)
]
start_time = time.time()
previous_time = start_time

parser = argparse.ArgumentParser(prog="Many Drivers long running tests")
parser.add_argument(
    "--iteration-num", type=int, help="How many iterations to run", required=False
)
parser.add_argument(
    "--smoke-test",
    action="store_true",
    help="Whether or not the test is smoke test.",
    default=False,
)
args = parser.parse_args()

iteration_num = args.iteration_num
if args.smoke_test:
    iteration_num = 400
while True:
    if iteration_num is not None and iteration_num < iteration:
        break
    # Wait for a driver to finish and start a new driver.
    [ready_id], running_ids = ray.wait(running_ids, num_returns=1)
    ray.get(ready_id)

    running_ids.append(
        run_driver._remote(
            args=[], kwargs={}, num_cpus=0, resources={str(iteration % num_nodes): 0.01}
        )
    )

    new_time = time.time()
    print(
        "Iteration {}:\n"
        "  - Iteration time: {}.\n"
        "  - Absolute time: {}.\n"
        "  - Total elapsed time: {}.".format(
            iteration, new_time - previous_time, new_time, new_time - start_time
        )
    )
    update_progress(
        {
            "iteration": iteration,
            "iteration_time": new_time - previous_time,
            "absolute_time": new_time,
            "elapsed_time": new_time - start_time,
        }
    )
    previous_time = new_time
    iteration += 1
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`# This workload tests many drivers using the same cluster.`
[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355) 2021-06-16 21:37:17 +01:00			`import json`
			`import os`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`import time`
[nightly] Limit many drivers iteration to 4000 iterations (#21958) Due to faster running of many drivers, we limit the iteration to 4k for the test. 2022-01-31 13:26:02 -08:00			`import argparse`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00
			`import ray`
Fix long running stress tests (#6374) 2019-12-05 18:29:41 -08:00			`from ray.cluster_utils import Cluster`
[Core] Second pass at privatizing APIs. (#17885) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils 2021-08-18 20:56:33 -07:00			`from ray._private.test_utils import run_string_as_driver`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00
[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355) 2021-06-16 21:37:17 +01:00
			`def update_progress(result):`
			`result["last_update"] = time.time()`
			`test_output_json = os.environ.get(`
			`"TEST_OUTPUT_JSON", "/tmp/release_test_output.json"`
			`)`
			`with open(test_output_json, "wt") as f:`
			`json.dump(result, f)`


Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`num_redis_shards = 5`
			`redis_max_memory = 10 ** 8`
			`object_store_memory = 10 ** 8`
			`num_nodes = 4`

			`message = (`
			`"Make sure there is enough memory on this machine to run this "`
			`"workload. We divide the system memory by 2 to provide a buffer."`
			`)`
			`assert (`
			`num_nodes * object_store_memory + num_redis_shards * redis_max_memory`
[release] release 1.3.0 results and test updates (#15366) Convert a number of release tests and add logs for release 1.3.0 2021-05-04 23:10:04 +02:00			`< ray._private.utils.get_system_memory() / 2`
			`), message`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00
			`# Simulate a cluster on one machine.`

			`cluster = Cluster()`
			`for i in range(num_nodes):`
			`cluster.add_node(`
			`redis_port=6379 if i == 0 else None,`
			`num_redis_shards=num_redis_shards if i == 0 else None,`
			`num_cpus=4,`
			`num_gpus=0,`
			`resources={str(i): 5},`
			`object_store_memory=object_store_memory,`
Use 2xlarge instances in long running tests (#6802) 2020-01-15 19:47:59 -06:00			`redis_max_memory=redis_max_memory,`
Make Dashboard Port Configurable (#8999) 2020-06-19 14:26:22 -07:00			`dashboard_host="0.0.0.0",`
			`)`
Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`ray.init(address=cluster.address)`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00
			`# Run the workload.`

			`# Define a driver script that runs a few tasks and actors on each node in the`
			`# cluster.`
			`driver_script = """`
			`import ray`

Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`ray.init(address="{}")`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00
			`num_nodes = {}`


			`@ray.remote`
			`def f():`
			`return 1`


			`@ray.remote`
			`class Actor(object):`
			`def method(self):`
			`return 1`


			`for _ in range(5):`
			`for i in range(num_nodes):`
			`assert (ray.get(`
Fix a bug from many drivers. (#22248) After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem. 2022-02-10 08:17:15 +09:00			`f._remote(args=[],`
			`kwargs={{}},`
			`resources={{str(i): 1}})) == 1)`
			`actor = Actor._remote(`
			`args=[], kwargs={{}}, resources={{str(i): 1}})`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`assert ray.get(actor.method.remote()) == 1`

Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156) 2022-02-07 14:05:44 -08:00			`# Tests datasets doesn't leak workers.`
			`ray.data.range(100).map(lambda x: x).take()`

Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`print("success")`
Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`""".format(`
			`cluster.address, num_nodes`
			`)`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00

			`@ray.remote`
			`def run_driver():`
Fix a bug from many drivers. (#22248) After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem. 2022-02-10 08:17:15 +09:00			`output = run_string_as_driver(driver_script, encode="utf-8")`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`assert "success" in output`


			`iteration = 0`
			`running_ids = [`
			`run_driver._remote(args=[], kwargs={}, num_cpus=0, resources={str(i): 0.01})`
			`for i in range(num_nodes)`
			`]`
			`start_time = time.time()`
			`previous_time = start_time`
[nightly] Limit many drivers iteration to 4000 iterations (#21958) Due to faster running of many drivers, we limit the iteration to 4k for the test. 2022-01-31 13:26:02 -08:00
			`parser = argparse.ArgumentParser(prog="Many Drivers long running tests")`
			`parser.add_argument(`
			`"--iteration-num", type=int, help="How many iterations to run", required=False`
			`)`
[Nightly test] Bring back the old way of running commands. (#22209) Bring back the old way of running commands for non-k8s tests. This also fixes the regression from many_drivers.py 2022-02-08 18:44:07 +09:00			`parser.add_argument(`
			`"--smoke-test",`
			`action="store_true",`
			`help="Whether or not the test is smoke test.",`
			`default=False,`
			`)`
[nightly] Limit many drivers iteration to 4000 iterations (#21958) Due to faster running of many drivers, we limit the iteration to 4k for the test. 2022-01-31 13:26:02 -08:00			`args = parser.parse_args()`
[Nightly test] Bring back the old way of running commands. (#22209) Bring back the old way of running commands for non-k8s tests. This also fixes the regression from many_drivers.py 2022-02-08 18:44:07 +09:00
			`iteration_num = args.iteration_num`
			`if args.smoke_test:`
			`iteration_num = 400`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`while True:`
[Nightly test] Bring back the old way of running commands. (#22209) Bring back the old way of running commands for non-k8s tests. This also fixes the regression from many_drivers.py 2022-02-08 18:44:07 +09:00			`if iteration_num is not None and iteration_num < iteration:`
[nightly] Limit many drivers iteration to 4000 iterations (#21958) Due to faster running of many drivers, we limit the iteration to 4k for the test. 2022-01-31 13:26:02 -08:00			`break`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`# Wait for a driver to finish and start a new driver.`
			`[ready_id], running_ids = ray.wait(running_ids, num_returns=1)`
			`ray.get(ready_id)`

			`running_ids.append(`
			`run_driver._remote(`
			`args=[], kwargs={}, num_cpus=0, resources={str(iteration % num_nodes): 0.01}`
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`)`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`)`

			`new_time = time.time()`
			`print(`
			`"Iteration {}:\n"`
			`" - Iteration time: {}.\n"`
			`" - Absolute time: {}.\n"`
			`" - Total elapsed time: {}.".format(`
			`iteration, new_time - previous_time, new_time, new_time - start_time`
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`)`
Remove Jenkins backend tests and add new long running stress test. (#4288) 2019-03-08 15:29:39 -08:00			`)`
[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355) 2021-06-16 21:37:17 +01:00			`update_progress(`
			`{`
			`"iteration": iteration,`
			`"iteration_time": new_time - previous_time,`
			`"absolute_time": new_time,`
			`"elapsed_time": new_time - start_time,`
			`}`
			`)`
move variable updates from middle of loop to end (#17591) 2021-08-05 01:53:01 -07:00			`previous_time = new_time`
			`iteration += 1`