mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
![]() This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big. |
||
---|---|---|
.. | ||
distributed | ||
object_store | ||
single_node | ||
app_config.yaml | ||
benchmark_tests.yaml | ||
distributed.yaml | ||
distributed_smoke_test.yaml | ||
many_nodes.yaml | ||
object_store.yaml | ||
README.md | ||
scheduling.yaml | ||
single_node.yaml |
Ray Scalability Envelope
Distributed Benchmarks
All distributed tests are run on 64 nodes with 64 cores/node. Maximum number of nodes is achieved by adding 4 core nodes.
Dimension | Quantity |
---|---|
# nodes in cluster (with trivial task workload) | 250+ |
# actors in cluster (with trivial workload) | 10k+ |
# simultaneously running tasks | 10k+ |
# simultaneously running placement groups | 1k+ |
Object Store Benchmarks
Dimension | Quantity |
---|---|
1 GiB object broadcast (# of nodes) | 50+ |
Single Node Benchmarks.
All single node benchmarks are run on a single m4.16xlarge.
Dimension | Quantity |
---|---|
# of object arguments to a single task | 10000+ |
# of objects returned from a single task | 3000+ |
# of plasma objects in a single ray.get call |
10000+ |
# of tasks queued on a single node | 1,000,000+ |
Maximum ray.get numpy object size |
100GiB+ |