ray/benchmarks/distributed/test_many_tasks.py

import click
import json
import os
import ray
import ray._private.test_utils as test_utils
import time
import tqdm

sleep_time = 300


def test_max_running_tasks(num_tasks):
    cpus_per_task = 0.25

    @ray.remote(num_cpus=cpus_per_task)
    def task():
        time.sleep(sleep_time)

    refs = [task.remote() for _ in tqdm.trange(num_tasks, desc="Launching tasks")]

    max_cpus = ray.cluster_resources()["CPU"]
    min_cpus_available = max_cpus
    for _ in tqdm.trange(int(sleep_time / 0.1), desc="Waiting"):
        try:
            cur_cpus = ray.available_resources().get("CPU", 0)
            min_cpus_available = min(min_cpus_available, cur_cpus)
        except Exception:
            # There are race conditions `.get` can fail if a new heartbeat
            # comes at the same time.
            pass
        time.sleep(0.1)

    # There are some relevant magic numbers in this check. 10k tasks each
    # require 1/4 cpus. Therefore, ideally 2.5k cpus will be used.
    err_str = f"Only {max_cpus - min_cpus_available}/{max_cpus} cpus used."
    threshold = num_tasks * cpus_per_task * 0.75
    assert max_cpus - min_cpus_available > threshold, err_str

    for _ in tqdm.trange(num_tasks, desc="Ensuring all tasks have finished"):
        done, refs = ray.wait(refs)
        assert ray.get(done[0]) is None


def no_resource_leaks():
    return ray.available_resources() == ray.cluster_resources()


@click.command()
@click.option("--num-tasks", required=True, type=int, help="Number of tasks to launch.")
def test(num_tasks):
    ray.init(address="auto")

    test_utils.wait_for_condition(no_resource_leaks)
    monitor_actor = test_utils.monitor_memory_usage()
    start_time = time.time()
    test_max_running_tasks(num_tasks)
    end_time = time.time()
    ray.get(monitor_actor.stop_run.remote())
    used_gb, usage = ray.get(monitor_actor.get_peak_memory_info.remote())
    print(f"Peak memory usage: {round(used_gb, 2)}GB")
    print(f"Peak memory usage per processes:\n {usage}")
    del monitor_actor
    test_utils.wait_for_condition(no_resource_leaks)

    rate = num_tasks / (end_time - start_time - sleep_time)
    print(
        f"Success! Started {num_tasks} tasks in {end_time - start_time}s. "
        f"({rate} tasks/s)"
    )

    if "TEST_OUTPUT_JSON" in os.environ:
        out_file = open(os.environ["TEST_OUTPUT_JSON"], "w")
        results = {
            "tasks_per_second": rate,
            "num_tasks": num_tasks,
            "time": end_time - start_time,
            "success": "1",
            "_peak_memory": round(used_gb, 2),
            "_peak_process_memory": usage,
        }
        json.dump(results, out_file)


if __name__ == "__main__":
    test()
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00			`import click`
			`import json`
			`import os`
			`import ray`
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`import ray._private.test_utils as test_utils`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00			`import time`
			`import tqdm`

Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`sleep_time = 300`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00

			`def test_max_running_tasks(num_tasks):`
			`cpus_per_task = 0.25`

Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`@ray.remote(num_cpus=cpus_per_task)`
			`def task():`
			`time.sleep(sleep_time)`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`refs = [task.remote() for _ in tqdm.trange(num_tasks, desc="Launching tasks")]`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`max_cpus = ray.cluster_resources()["CPU"]`
			`min_cpus_available = max_cpus`
			`for _ in tqdm.trange(int(sleep_time / 0.1), desc="Waiting"):`
			`try:`
			`cur_cpus = ray.available_resources().get("CPU", 0)`
			`min_cpus_available = min(min_cpus_available, cur_cpus)`
			`except Exception:`
			# There are race conditions `.get` can fail if a new heartbeat
			`# comes at the same time.`
			`pass`
			`time.sleep(0.1)`

			`# There are some relevant magic numbers in this check. 10k tasks each`
			`# require 1/4 cpus. Therefore, ideally 2.5k cpus will be used.`
			`err_str = f"Only {max_cpus - min_cpus_available}/{max_cpus} cpus used."`
			`threshold = num_tasks * cpus_per_task * 0.75`
			`assert max_cpus - min_cpus_available > threshold, err_str`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00
			`for _ in tqdm.trange(num_tasks, desc="Ensuring all tasks have finished"):`
			`done, refs = ray.wait(refs)`
			`assert ray.get(done[0]) is None`


			`def no_resource_leaks():`
			`return ray.available_resources() == ray.cluster_resources()`


			`@click.command()`
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`@click.option("--num-tasks", required=True, type=int, help="Number of tasks to launch.")`
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`def test(num_tasks):`
			`ray.init(address="auto")`

			`test_utils.wait_for_condition(no_resource_leaks)`
Add memory monitor to scalability tests. (#21102) This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA. NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet. 2021-12-15 18:31:38 +09:00			`monitor_actor = test_utils.monitor_memory_usage()`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00			`start_time = time.time()`
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`test_max_running_tasks(num_tasks)`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00			`end_time = time.time()`
Add memory monitor to scalability tests. (#21102) This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA. NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet. 2021-12-15 18:31:38 +09:00			`ray.get(monitor_actor.stop_run.remote())`
			`used_gb, usage = ray.get(monitor_actor.get_peak_memory_info.remote())`
			`print(f"Peak memory usage: {round(used_gb, 2)}GB")`
			`print(f"Peak memory usage per processes:\n {usage}")`
[Nightly Test] Fix broken scalability test #21201 I added memory monitor to the scalability tests. This broke the tests because creating a memory monitor requires the node resources (to be scheduled on a head node), and that broke "resource leak" check. Ideally, this resource leak check should be more robust, but I fix the issue in an easier way for now. In the sooner future, memory monitor will become a fixture, and in that case, we should fix resource leak function code. 2021-12-21 07:58:39 +09:00			`del monitor_actor`
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`test_utils.wait_for_condition(no_resource_leaks)`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00
Revert "[core] Refactor test_many_tasks (#18169)" (#18216) This reverts commit eb6fd20d53c443aba71cd0dd255f26a3e04d0b3f. 2021-08-30 10:35:23 -07:00			`rate = num_tasks / (end_time - start_time - sleep_time)`
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`print(`
			`f"Success! Started {num_tasks} tasks in {end_time - start_time}s. "`
			`f"({rate} tasks/s)"`
			`)`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00
			`if "TEST_OUTPUT_JSON" in os.environ:`
			`out_file = open(os.environ["TEST_OUTPUT_JSON"], "w")`
			`results = {`
			`"tasks_per_second": rate,`
			`"num_tasks": num_tasks,`
			`"time": end_time - start_time,`
Add memory monitor to scalability tests. (#21102) This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA. NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet. 2021-12-15 18:31:38 +09:00			`"success": "1",`
			`"_peak_memory": round(used_gb, 2),`
[CI] Format Python code with Black (#21975) See #21316 and #21311 for the motivation behind these changes. 2022-01-29 18:41:57 -08:00			`"_peak_process_memory": usage,`
Split scalability envelope + smoke tests (#17455) * . * done? * done? * sang comments * . Co-authored-by: Alex Wu <alex@anyscale.com> 2021-07-30 10:20:19 -07:00			`}`
			`json.dump(results, out_file)`


			`if __name__ == "__main__":`
			`test()`