This PR improves broken k8s tests.
Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
This PR add four tests for many tasks:
many short tasks send from the single node
many short tasks send from multiple nodes
many long tasks send from multiple nodes
many long tasks send from the single node
TODO: migrate many nodes actor tests to this one.
scheduling envelop should contain:
(tasks): scheduling_test_many_xx_tasks_yy_nodes
(actors):many_nodes_actor_test (to be combined with this one)
(shuffle): pipelined_ingestion_1500_gb_15_windows
(shuffle): dask_on_ray_1tb_sort
I added memory monitor to the scalability tests. This broke the tests because creating a memory monitor requires the node resources (to be scheduled on a head node), and that broke "resource leak" check. Ideally, this resource leak check should be more robust, but I fix the issue in an easier way for now. In the sooner future, memory monitor will become a fixture, and in that case, we should fix resource leak function code.
This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA.
NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet.
* Do not divide by zero
* Don't take min or mean of an empty list
* max workers 0 for head node in distributed benchmark
* test
* Correct the type annotation
* comment grammar tweak
* message
* docs
* test
* Move test cli to large tests.