Commit graph

10 commits

Author SHA1 Message Date
mwtian
513881584d
[Core] install jemalloc in Ray docker and use jemalloc in benchmark release tests (#26112)
There are mysterious memory usage growth in Ray clusters that disappear when running with jemalloc. Before we are able to figure out the root cause, it seems using jemalloc by default can be a good walkaround. Because of its efficiency, using jemalloc by default can be beneficial, but we need to run more benchmarks to verify.
2022-06-27 23:26:56 -07:00
mwtian
1483c4553c
use smaller instance for scheduling tests (#25635)
m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.
2022-06-10 17:09:35 +00:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Jian Xiao
ba500133af
lower the utilization threshold in many tasks scheduling test by 5% (#24758)
Fix the failure to unbreak nightly and unblock 1.13 release.

The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-05-13 10:44:58 -07:00
Tao Wang
a051e693c1
[Test]Add a time check for task benchmark (#23170)
In test_many_tasks.py case, we usually found the case failing and found the reason.

We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round.
It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by

            cur_cpus = ray.available_resources().get("CPU", 0)
            min_cpus_available = min(min_cpus_available, cur_cpus)
especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes).

The situation we thought it would be:

for _ in range(sleep_time / 0.1):
    sleep(0.1)
The actual situation happens:

for _ in range(sleep_time / 0.1):
    do_something(); # it costs time, sometimes pretty much
    sleep(0.1)
We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.
2022-04-11 06:27:04 -07:00
Jiajun Yao
bab19e8e68
Add perf metrics for test_many_tasks.py (#23318)
Add perf metrics for test_many_tasks.py
Use the new smoke test structure
2022-03-22 16:16:42 -07:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
SangBin Cho
2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
Jiajun Yao
e4620669a1
[Release Test] Add perf metrics for core scalability tests (#23110)
* Add perf metrics for core scalability tests

* lint
2022-03-14 10:20:39 +09:00
SangBin Cho
549527687f
Migrate scalability tests (#22901)
This PR migrates scalability tests to the new infra.

I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
2022-03-08 17:22:41 -08:00