hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

History

SangBin Cho 2c2d96eeb1 [Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.		2022-03-14 03:49:15 -07:00
..
distributed	[Nightly tests] Improve k8s testing (#23108 )	2022-03-14 03:49:15 -07:00
object_store	[CI] Format Python code with Black (#21975 )	2022-01-29 18:41:57 -08:00
single_node	[CI] Format Python code with Black (#21975 )	2022-01-29 18:41:57 -08:00
app_config.yaml	[nightly] Fix benchmark commit check failure (#21119 )	2021-12-15 14:54:03 -08:00
benchmark_tests.yaml	[nightly] Stop GCS HA related nightly test (#22636 )	2022-02-24 16:40:08 -08:00
distributed.yaml	Split scalability envelope + smoke tests (#17455 )	2021-07-30 10:20:19 -07:00
distributed_smoke_test.yaml	Split scalability envelope + smoke tests (#17455 )	2021-07-30 10:20:19 -07:00
many_nodes.yaml	Split scalability envelope + smoke tests (#17455 )	2021-07-30 10:20:19 -07:00
object_store.yaml	[Nightly Test] Readjust nightly test schedule (#20717 )	2021-11-26 06:59:16 -08:00
README.md	Move scalability envelope back down to 250 nodes (#15381 )	2021-04-16 19:39:24 -07:00
scheduling.yaml	[nightly] Add more many tasks tests (#21727 )	2022-01-20 14:52:26 -08:00
single_node.yaml	Integrate scalability envelope with releaser (#16417 )	2021-06-15 10:42:55 -07:00

README.md

Ray Scalability Envelope

Distributed Benchmarks

All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.

Dimension	Quantity
# nodes in cluster (with trivial task workload)	250+
# actors in cluster (with trivial workload)	10k+
# simultaneously running tasks	10k+
# simultaneously running placement groups	1k+

Object Store Benchmarks

Dimension	Quantity
1 GiB object broadcast (# of nodes)	50+

Single Node Benchmarks.

All single node benchmarks are run on a single m4.16xlarge.

Dimension	Quantity
# of object arguments to a single task	10000+
# of objects returned from a single task	3000+
# of plasma objects in a single `ray.get` call	10000+
# of tasks queued on a single node	1,000,000+
Maximum `ray.get` numpy object size	100GiB+