ray/release/benchmarks
Tao Wang a051e693c1
[Test]Add a time check for task benchmark (#23170)
In test_many_tasks.py case, we usually found the case failing and found the reason.

We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round.
It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by

            cur_cpus = ray.available_resources().get("CPU", 0)
            min_cpus_available = min(min_cpus_available, cur_cpus)
especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes).

The situation we thought it would be:

for _ in range(sleep_time / 0.1):
    sleep(0.1)
The actual situation happens:

for _ in range(sleep_time / 0.1):
    do_something(); # it costs time, sometimes pretty much
    sleep(0.1)
We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.
2022-04-11 06:27:04 -07:00
..
distributed [Test]Add a time check for task benchmark (#23170) 2022-04-11 06:27:04 -07:00
object_store [Release Test] Add perf metrics for core scalability tests (#23110) 2022-03-14 10:20:39 +09:00
single_node [Release Test] Add perf metrics for core scalability tests (#23110) 2022-03-14 10:20:39 +09:00
app_config.yaml Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00
distributed.yaml Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00
distributed_smoke_test.yaml Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00
many_nodes.yaml [Nightly tests] Improve k8s testing (#23108) 2022-03-14 03:49:15 -07:00
object_store.yaml [Nightly tests] Improve k8s testing (#23108) 2022-03-14 03:49:15 -07:00
README.md Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00
scheduling.yaml Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00
single_node.yaml Migrate scalability tests (#22901) 2022-03-08 17:22:41 -08:00

Ray Scalability Envelope

Distributed Benchmarks

All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.

Dimension Quantity
# nodes in cluster (with trivial task workload) 250+
# actors in cluster (with trivial workload) 10k+
# simultaneously running tasks 10k+
# simultaneously running placement groups 1k+

Object Store Benchmarks

Dimension Quantity
1 GiB object broadcast (# of nodes) 50+

Single Node Benchmarks.

All single node benchmarks are run on a single m4.16xlarge.

Dimension Quantity
# of object arguments to a single task 10000+
# of objects returned from a single task 3000+
# of plasma objects in a single ray.get call 10000+
# of tasks queued on a single node 1,000,000+
Maximum ray.get numpy object size 100GiB+