The current Dataset.split_at_indices() implementation suffers from O(n^2) memory usage in the small-split case (see issue) due to recursive splitting of the same blocks. This PR implements a split_at_indices() algorithm that minimizes the number of split tasks and data movement while ensuring that at most one block is used in each split task, for the sake of memory stability. Co-authored-by: scv119 <scv119@gmail.com>
The PR adds a new experimental flag to the placement group API to avoid placement group taking all cpus on each node. It is used internally by Air to avoid placement group (created by Tune) is using all CPU resources which are needed for dataset
This PR is to resolve#20888, where users have concern for the dataset-like methods used in dataset pipeline (such as map_batches, random_shuffle_each_window, etc). The reason is currently we define those dataset-like methods implicitly through Python setattr/getattr, to delegate the real work from dataset pipelien to dataset. This does not work very well with external developers/users if they want to navigate to the definition of method, or determine the method's return value data type.
So this PR is to explicitly define every dataset-like APIs in dataset pipeline class. This gives us a view of how much code we need to duplicate in upper bound. If we go with this direction, this means whenever we update or add a new method in Dataset, we need to update or add the same in DatasetPipeline.
When trail is resumed, it is useful for the user to know from which checkpoint it happened.
Signed-off-by: sustr-equi <sustr@equilibretechnologies.com>
Co-authored-by: sustr-equi <sustr@equilibretechnologies.com>
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
In the previous PR #25883, a subtle regression was introduced in the case where data sizes blow up significantly.
For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase).
Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block).
There are small typos in:
- doc/source/data/faq.rst
- python/ray/serve/replica.py
Fixes:
- Should read `successfully` rather than `succssifully`.
- Should read `pseudo` rather than `psuedo`.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.
As discussed on Ray Slack (https://ray-distributed.slack.com/archives/CNECXMW22/p1657051287814569), the changes introduced in #18770 and #20822 have caused the concurrency limiting logic in BOHB to work incorrectly. This PR restores the old logic, while making use of the set_max_concurrency API (as eg. HEBO), maintaining backwards compatibility.
It should be noted that the old logic this PR reintroduces is essentially a hack and should be refactored in the future. This PR is intended to rapidly fix a bug causing search performance to be suboptimal.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Embed console print to gather dogfooding feedback.
With CLIs:
```
(dev) ➜ ray git:(ricky/obs-feedback) ray list --help
Usage: ray list [OPTIONS] {actors|jobs|placement-
groups|nodes|workers|tasks|objects|runtime-envs}
List RESOURCE used by Ray.
RESOURCE is the name of the possible resources from `StateResource`,
i.e. 'jobs', 'actors', 'nodes', ...
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
```
With running SDK python api:
```
In [3]: from ray.experimental.state.api import list_nodes
In [6]: list_nodes()
2022-07-11 19:45:18,973 INFO api.py:69 --
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
Out[6]:
[{'node_name': '172.31.47.143',
'node_ip': '172.31.47.143',
'resources_total': {'CPU': 8.0,
'object_store_memory': 9149783654.0,
'memory': 18299567310.0,
'node:172.31.47.143': 1.0},
'node_id': '513a3ca212403d234f6dfbe1f7523052637a06e0ee9e4502144f2da3',
'state': 'ALIVE'}]
```
This PR is doing 2 things.
(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.
(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.