Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matt@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
My experiments (the script
) showed that dataset.split_at_indices() with SPREAD tasks have more predictable performance
Concretely: on 10 m5.4xlarge nodes with 5000 iops disk
calling ds.split_at_indices(81) on 200GB dataset with 400 blocks: the split_at_indices without this PR takes 7-19 seconds, split_at_indices with SPREAD takes 7-12 seconds.
Currently, it's not very easy to figure out why a DatasetPipeline may be underperforming. Add some warnings to help guide the user. As a next step, we can try to default to a good pipeline setting based on these constraints.
This is an experimental feature, so the following changes are added only to the WandbLoggerCallback. We are planning to collect feedback about usage and accordingly update or add these changes to the other W&B integration interfaces.
Allow reading the W&B project name and group name from environment variable if not already passed to callback
Add external hooks to fetch W&B API key, and to process any information about W&B run
Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>
The current Dataset.split_at_indices() implementation suffers from O(n^2) memory usage in the small-split case (see issue) due to recursive splitting of the same blocks. This PR implements a split_at_indices() algorithm that minimizes the number of split tasks and data movement while ensuring that at most one block is used in each split task, for the sake of memory stability. Co-authored-by: scv119 <scv119@gmail.com>
The PR adds a new experimental flag to the placement group API to avoid placement group taking all cpus on each node. It is used internally by Air to avoid placement group (created by Tune) is using all CPU resources which are needed for dataset
This PR is to resolve#20888, where users have concern for the dataset-like methods used in dataset pipeline (such as map_batches, random_shuffle_each_window, etc). The reason is currently we define those dataset-like methods implicitly through Python setattr/getattr, to delegate the real work from dataset pipelien to dataset. This does not work very well with external developers/users if they want to navigate to the definition of method, or determine the method's return value data type.
So this PR is to explicitly define every dataset-like APIs in dataset pipeline class. This gives us a view of how much code we need to duplicate in upper bound. If we go with this direction, this means whenever we update or add a new method in Dataset, we need to update or add the same in DatasetPipeline.
When trail is resumed, it is useful for the user to know from which checkpoint it happened.
Signed-off-by: sustr-equi <sustr@equilibretechnologies.com>
Co-authored-by: sustr-equi <sustr@equilibretechnologies.com>
In the previous PR #25883, a subtle regression was introduced in the case where data sizes blow up significantly.
For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase).
Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block).
There are small typos in:
- doc/source/data/faq.rst
- python/ray/serve/replica.py
Fixes:
- Should read `successfully` rather than `succssifully`.
- Should read `pseudo` rather than `psuedo`.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.
As discussed on Ray Slack (https://ray-distributed.slack.com/archives/CNECXMW22/p1657051287814569), the changes introduced in #18770 and #20822 have caused the concurrency limiting logic in BOHB to work incorrectly. This PR restores the old logic, while making use of the set_max_concurrency API (as eg. HEBO), maintaining backwards compatibility.
It should be noted that the old logic this PR reintroduces is essentially a hack and should be refactored in the future. This PR is intended to rapidly fix a bug causing search performance to be suboptimal.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Embed console print to gather dogfooding feedback.
With CLIs:
```
(dev) ➜ ray git:(ricky/obs-feedback) ray list --help
Usage: ray list [OPTIONS] {actors|jobs|placement-
groups|nodes|workers|tasks|objects|runtime-envs}
List RESOURCE used by Ray.
RESOURCE is the name of the possible resources from `StateResource`,
i.e. 'jobs', 'actors', 'nodes', ...
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
```
With running SDK python api:
```
In [3]: from ray.experimental.state.api import list_nodes
In [6]: list_nodes()
2022-07-11 19:45:18,973 INFO api.py:69 --
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
Out[6]:
[{'node_name': '172.31.47.143',
'node_ip': '172.31.47.143',
'resources_total': {'CPU': 8.0,
'object_store_memory': 9149783654.0,
'memory': 18299567310.0,
'node:172.31.47.143': 1.0},
'node_id': '513a3ca212403d234f6dfbe1f7523052637a06e0ee9e4502144f2da3',
'state': 'ALIVE'}]
```
This PR is doing 2 things.
(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.
(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
Actor tasks are sometimes failed while their dependencies are still being resolved. This can cause hanging or crashes when we resolve the dependencies for a task that has already been canceled. It can lead to a crash from the ref counter when, for the same actor, actor task 1 depends on actor task 2. The sequence is:
Actor tasks 1 and 2 queued, 1 depends on 2.
Fail actor task 1. We clear its refs, including its dependency on 2.
Fail actor task 2. We store an error as its return value. Since task 1 depends on it, we inline the dependency and try to clear task 1's refs again, causing a ref counting error because we already cleared them in step 2.
This PR fixes the issue by canceling dependency resolution for tasks before failing them. This involves some refactoring of the LocalDependencyResolver. Most of the changes are for testing (split out the unit tests for LocalDependencyResolver into their own suite).
Related issue number
Closes#18908.
This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests.
This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.
Fixes failing hyperopt notebook in CI (as found in #26410). The cause was a mismatch between keys in points to evaluate and the search space - now, an informative exception will be raised.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>