hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

History

Tao Wang a051e693c1 [Test]Add a time check for task benchmark (#23170 ) In test_many_tasks.py case, we usually found the case failing and found the reason. We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round. It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by cur_cpus = ray.available_resources().get("CPU", 0) min_cpus_available = min(min_cpus_available, cur_cpus) especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes). The situation we thought it would be: for _ in range(sleep_time / 0.1): sleep(0.1) The actual situation happens: for _ in range(sleep_time / 0.1): do_something(); # it costs time, sometimes pretty much sleep(0.1) We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.		2022-04-11 06:27:04 -07:00
..
distributed	[Test]Add a time check for task benchmark (#23170 )	2022-04-11 06:27:04 -07:00
object_store	[Release Test] Add perf metrics for core scalability tests (#23110 )	2022-03-14 10:20:39 +09:00
single_node	[Release Test] Add perf metrics for core scalability tests (#23110 )	2022-03-14 10:20:39 +09:00
app_config.yaml	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00
distributed.yaml	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00
distributed_smoke_test.yaml	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00
many_nodes.yaml	[Nightly tests] Improve k8s testing (#23108 )	2022-03-14 03:49:15 -07:00
object_store.yaml	[Nightly tests] Improve k8s testing (#23108 )	2022-03-14 03:49:15 -07:00
README.md	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00
scheduling.yaml	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00
single_node.yaml	Migrate scalability tests (#22901 )	2022-03-08 17:22:41 -08:00

README.md

Ray Scalability Envelope

Distributed Benchmarks

All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.

Dimension	Quantity
# nodes in cluster (with trivial task workload)	250+
# actors in cluster (with trivial workload)	10k+
# simultaneously running tasks	10k+
# simultaneously running placement groups	1k+

Object Store Benchmarks

Dimension	Quantity
1 GiB object broadcast (# of nodes)	50+

Single Node Benchmarks.

All single node benchmarks are run on a single m4.16xlarge.

Dimension	Quantity
# of object arguments to a single task	10000+
# of objects returned from a single task	3000+
# of plasma objects in a single `ray.get` call	10000+
# of tasks queued on a single node	1,000,000+
Maximum `ray.get` numpy object size	100GiB+