hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
mwtian	513881584d	[Core] install jemalloc in Ray docker and use jemalloc in `benchmark` release tests (#26112 ) There are mysterious memory usage growth in Ray clusters that disappear when running with jemalloc. Before we are able to figure out the root cause, it seems using jemalloc by default can be a good walkaround. Because of its efficiency, using jemalloc by default can be beneficial, but we need to run more benchmarks to verify.	2022-06-27 23:26:56 -07:00
mwtian	1483c4553c	use smaller instance for scheduling tests (#25635 ) m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.	2022-06-10 17:09:35 +00:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Jian Xiao	ba500133af	lower the utilization threshold in many tasks scheduling test by 5% (#24758 ) Fix the failure to unbreak nightly and unblock 1.13 release. The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-05-13 10:44:58 -07:00
Tao Wang	a051e693c1	[Test]Add a time check for task benchmark (#23170 ) In test_many_tasks.py case, we usually found the case failing and found the reason. We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round. It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by cur_cpus = ray.available_resources().get("CPU", 0) min_cpus_available = min(min_cpus_available, cur_cpus) especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes). The situation we thought it would be: for _ in range(sleep_time / 0.1): sleep(0.1) The actual situation happens: for _ in range(sleep_time / 0.1): do_something(); # it costs time, sometimes pretty much sleep(0.1) We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.	2022-04-11 06:27:04 -07:00
Jiajun Yao	bab19e8e68	Add perf metrics for test_many_tasks.py (#23318 ) Add perf metrics for test_many_tasks.py Use the new smoke test structure	2022-03-22 16:16:42 -07:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
SangBin Cho	2c2d96eeb1	[Nightly tests] Improve k8s testing (#23108 ) This PR improves broken k8s tests. Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately). Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.	2022-03-14 03:49:15 -07:00
Jiajun Yao	e4620669a1	[Release Test] Add perf metrics for core scalability tests (#23110 ) * Add perf metrics for core scalability tests * lint	2022-03-14 10:20:39 +09:00
SangBin Cho	549527687f	Migrate scalability tests (#22901 ) This PR migrates scalability tests to the new infra. I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke	2022-03-08 17:22:41 -08:00

10 commits