ray/benchmarks
SangBin Cho 2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
..
distributed [Nightly tests] Improve k8s testing (#23108) 2022-03-14 03:49:15 -07:00
object_store [CI] Format Python code with Black (#21975) 2022-01-29 18:41:57 -08:00
single_node [CI] Format Python code with Black (#21975) 2022-01-29 18:41:57 -08:00
app_config.yaml [nightly] Fix benchmark commit check failure (#21119) 2021-12-15 14:54:03 -08:00
benchmark_tests.yaml [nightly] Stop GCS HA related nightly test (#22636) 2022-02-24 16:40:08 -08:00
distributed.yaml Split scalability envelope + smoke tests (#17455) 2021-07-30 10:20:19 -07:00
distributed_smoke_test.yaml Split scalability envelope + smoke tests (#17455) 2021-07-30 10:20:19 -07:00
many_nodes.yaml Split scalability envelope + smoke tests (#17455) 2021-07-30 10:20:19 -07:00
object_store.yaml [Nightly Test] Readjust nightly test schedule (#20717) 2021-11-26 06:59:16 -08:00
README.md Move scalability envelope back down to 250 nodes (#15381) 2021-04-16 19:39:24 -07:00
scheduling.yaml [nightly] Add more many tasks tests (#21727) 2022-01-20 14:52:26 -08:00
single_node.yaml Integrate scalability envelope with releaser (#16417) 2021-06-15 10:42:55 -07:00

Ray Scalability Envelope

Distributed Benchmarks

All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.

Dimension Quantity
# nodes in cluster (with trivial task workload) 250+
# actors in cluster (with trivial workload) 10k+
# simultaneously running tasks 10k+
# simultaneously running placement groups 1k+

Object Store Benchmarks

Dimension Quantity
1 GiB object broadcast (# of nodes) 50+

Single Node Benchmarks.

All single node benchmarks are run on a single m4.16xlarge.

Dimension Quantity
# of object arguments to a single task 10000+
# of objects returned from a single task 3000+
# of plasma objects in a single ray.get call 10000+
# of tasks queued on a single node 1,000,000+
Maximum ray.get numpy object size 100GiB+