In test_many_tasks.py case, we usually found the case failing and found the reason.
We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round.
It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by
cur_cpus = ray.available_resources().get("CPU", 0)
min_cpus_available = min(min_cpus_available, cur_cpus)
especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes).
The situation we thought it would be:
for _ in range(sleep_time / 0.1):
sleep(0.1)
The actual situation happens:
for _ in range(sleep_time / 0.1):
do_something(); # it costs time, sometimes pretty much
sleep(0.1)
We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.
This PR improves broken k8s tests.
Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
This PR migrates scalability tests to the new infra.
I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke