This PR **enables the usage stats only on the release test infrastructure** (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.
Try to clear the result dir before running the e2e.py script, to avoid failures where the directory already exists, or a file cannot be overwritten due to permission issue.
Many release tests have error messages when copying results with `shutil.copytree()`. e.g.
https://buildkite.com/ray-project/periodic-ci/builds/2511#131c0d22-61a3-4dcf-b80a-de37b68ec591/139-450
This PR tries to make the copying process tolerate existing destination directory. There is logic to remove the destination directory, but I'm not sure why it failed.
This error should not be failing the tests though.
This fixes the previous problems from team column revert.
This has 2 additional changes;
alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289
Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
When a session startup times out due to resources not being available, the session may still come up after that timeout. At that time the control script (e2e.py) is already terminated, so the session runs until the autosuspend limit is hit, incurring unnecessary costs. Instead, we should always trigger session termination on session timeout.
Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now.
This reverts commit 42bcab27e8.
Please review **e2e.py and test_suite belonging to your team**!
This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#
This PR adds a team name to each test suite.
If the name is not specified, it will be reported as unspecified.
If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).
Note that we will aggregate all of test config into a single file, nightly_test.yaml.
This PR adds four staging nightly tests for gcs :
- many_actors
- many_tasks
- many_pgs
- many_nodes
These are benchmark tests that are highly related to gcs ha.
To make it easier to add tests, this PR also change e2e.py a little bit to include testing flags to app config.
Quotation marks were needed in Anyscale app configs to avoid install errors when # were used e.g. in URLs.
Since this has been fixed on the Anyscale side, we can get rid of these.