Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:
- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
This PR add four tests for many tasks:
many short tasks send from the single node
many short tasks send from multiple nodes
many long tasks send from multiple nodes
many long tasks send from the single node
TODO: migrate many nodes actor tests to this one.
scheduling envelop should contain:
(tasks): scheduling_test_many_xx_tasks_yy_nodes
(actors):many_nodes_actor_test (to be combined with this one)
(shuffle): pipelined_ingestion_1500_gb_15_windows
(shuffle): dask_on_ray_1tb_sort
The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.
This fixes the previous problems from team column revert.
This has 2 additional changes;
alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289
Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
When a session startup times out due to resources not being available, the session may still come up after that timeout. At that time the control script (e2e.py) is already terminated, so the session runs until the autosuspend limit is hit, incurring unnecessary costs. Instead, we should always trigger session termination on session timeout.
Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now.
This reverts commit 42bcab27e8.
RAY_GCS_ACTOR_SCHEDULING_ENABLED is wrong should be RAY_gcs_actor_scheduling_enabled. Since gcs based actor scheduling is not enabled yet so I just removed this flag.
Expands the `to_torch` method for Datasets with:
* An ability to choose to output a list/dict of feature tensors instead of just one (through setting `feature_columns` to be a list of lists or a dict of lists)
* An ability to choose whether the label should be unsqueezed or not
* An ability to pass `None` as the label (for prediction).
Furthermore, this changes how the `feature_column_dtypes` argument works. Previously, it took a list of dtypes for each feature. However, as the tensor was concatenated in the end, only one dtype mattered (the biggest one). Now, this argument expects a single dtype which will be applied to the features tensor (or a list/dict if `feature_columns` is a list of list/dict of lists).
Unit tests for all cases are included.
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Please review **e2e.py and test_suite belonging to your team**!
This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#
This PR adds a team name to each test suite.
If the name is not specified, it will be reported as unspecified.
If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).
Note that we will aggregate all of test config into a single file, nightly_test.yaml.
This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA.
NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet.
This PR adds four staging nightly tests for gcs :
- many_actors
- many_tasks
- many_pgs
- many_nodes
These are benchmark tests that are highly related to gcs ha.
To make it easier to add tests, this PR also change e2e.py a little bit to include testing flags to app config.
we fixed groupby issue in cuj2; sync the change into nightly test. this test doesn't need to use gpu at all. it returns soon after data ingestion finishes.
Quotation marks were needed in Anyscale app configs to avoid install errors when # were used e.g. in URLs.
Since this has been fixed on the Anyscale side, we can get rid of these.