This PR is to resolve#20888, where users have concern for the dataset-like methods used in dataset pipeline (such as map_batches, random_shuffle_each_window, etc). The reason is currently we define those dataset-like methods implicitly through Python setattr/getattr, to delegate the real work from dataset pipelien to dataset. This does not work very well with external developers/users if they want to navigate to the definition of method, or determine the method's return value data type.
So this PR is to explicitly define every dataset-like APIs in dataset pipeline class. This gives us a view of how much code we need to duplicate in upper bound. If we go with this direction, this means whenever we update or add a new method in Dataset, we need to update or add the same in DatasetPipeline.
When trail is resumed, it is useful for the user to know from which checkpoint it happened.
Signed-off-by: sustr-equi <sustr@equilibretechnologies.com>
Co-authored-by: sustr-equi <sustr@equilibretechnologies.com>
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
In the previous PR #25883, a subtle regression was introduced in the case where data sizes blow up significantly.
For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase).
Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block).
There are small typos in:
- doc/source/data/faq.rst
- python/ray/serve/replica.py
Fixes:
- Should read `successfully` rather than `succssifully`.
- Should read `pseudo` rather than `psuedo`.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.
As discussed on Ray Slack (https://ray-distributed.slack.com/archives/CNECXMW22/p1657051287814569), the changes introduced in #18770 and #20822 have caused the concurrency limiting logic in BOHB to work incorrectly. This PR restores the old logic, while making use of the set_max_concurrency API (as eg. HEBO), maintaining backwards compatibility.
It should be noted that the old logic this PR reintroduces is essentially a hack and should be refactored in the future. This PR is intended to rapidly fix a bug causing search performance to be suboptimal.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Embed console print to gather dogfooding feedback.
With CLIs:
```
(dev) ➜ ray git:(ricky/obs-feedback) ray list --help
Usage: ray list [OPTIONS] {actors|jobs|placement-
groups|nodes|workers|tasks|objects|runtime-envs}
List RESOURCE used by Ray.
RESOURCE is the name of the possible resources from `StateResource`,
i.e. 'jobs', 'actors', 'nodes', ...
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
```
With running SDK python api:
```
In [3]: from ray.experimental.state.api import list_nodes
In [6]: list_nodes()
2022-07-11 19:45:18,973 INFO api.py:69 --
==========ALPHA PREVIEW, FEEDBACK NEEDED ===============
State Observability APIs is currently in Alpha-Preview.
If you have any feedback, you could do so at either way as below:
1. Comment on API specification: https://tinyurl.com/api-spec
2. Report bugs/issues with details: https://forms.gle/gh77mwjEskjhN8G46
3. Follow up in #proj-state-obs-dogfooding slack channel.
==========================================================
Out[6]:
[{'node_name': '172.31.47.143',
'node_ip': '172.31.47.143',
'resources_total': {'CPU': 8.0,
'object_store_memory': 9149783654.0,
'memory': 18299567310.0,
'node:172.31.47.143': 1.0},
'node_id': '513a3ca212403d234f6dfbe1f7523052637a06e0ee9e4502144f2da3',
'state': 'ALIVE'}]
```
This PR is doing 2 things.
(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.
(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Actor tasks are sometimes failed while their dependencies are still being resolved. This can cause hanging or crashes when we resolve the dependencies for a task that has already been canceled. It can lead to a crash from the ref counter when, for the same actor, actor task 1 depends on actor task 2. The sequence is:
Actor tasks 1 and 2 queued, 1 depends on 2.
Fail actor task 1. We clear its refs, including its dependency on 2.
Fail actor task 2. We store an error as its return value. Since task 1 depends on it, we inline the dependency and try to clear task 1's refs again, causing a ref counting error because we already cleared them in step 2.
This PR fixes the issue by canceling dependency resolution for tasks before failing them. This involves some refactoring of the LocalDependencyResolver. Most of the changes are for testing (split out the unit tests for LocalDependencyResolver into their own suite).
Related issue number
Closes#18908.