Removes deprecated APIs:
- serve.start()
- get_handle()
Rewrites the ServeHandle doc snippet to use the recommended workflow for ServeHandles (only access them from other deployments, pass Deployments in as input args to `.bind()`, which get resolved to ServeHandles at runtime)
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
- Currently not all code under ray-core/doc_code is covered by CI.
- tf_example.py and torch_example.py are not used anywhere.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
This PR
adds a page of guidance on GPU deployment with Ray/K8s. This page is a modified and slightly expanded version of the existing page https://docs.ray.io/en/latest/cluster/kubernetes-gpu.html
moves managed K8s service intro links to their own page
We decided to allow escaping the parent pg via `PlacementGroupSchedulingStrategy(placement_group=None)` instead of using "DEFAULT". Our doc is updated with that but in the code it's still not allowed.
1. Add doc for python SDK and docstrings on public SDK
2. Rename list -> ray_list and get -> ray_get for better naming
3. Fix some typos
4. Auto translate address to api server url.
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Why are these changes needed?
Editing pass over the tensor support docs for clarity:
Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes.
Signed-off-by: Kai Fricke <kai@anyscale.com>
More replacements of tune.run() in examples/docstrings for Tuner.fit()
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
This PR adds --keep-going flag to the make html target for building the Ray docs. This means that when there is a lint failure in CI, the BuildKite log will show all lint failures instead of just the first one. Despite continuing past the first lint error, it will still fail the build.
Signed-off-by: Cade Daniel <cade@anyscale.com>
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:
1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases:
a. Heterogeneous-shaped (i.e. ragged) tensors
b. Struct arrays
2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails.
Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column.
3. The tensor extension cast fallback warning spams the logs.
This PR:
1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3).
2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).
3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.
The Serve CLI and REST API always sets the host to `0.0.0.0` and the port to Serve's default. This change adds `host` and `port` as top level options in the Serve config file, so users can manually set the host and port of their Serve application to different values.
This change introduces a new Serve config file format:
```yaml
import_path: ...
runtime_env: ...
host: ...
port: ...
deployments: ...
...
```
`host` and `port` are optional and can be omitted. A running Serve application's `host` and `port` cannot be changed. If a user tries to `serve deploy` a config file with different `host` and `port` options than an already-running Serve application, `serve deploy` will fail without making any changes to the application. The user must `serve shutdown` their application and restart it with `serve deploy` to change their `host` and `port`.
**Follow-Up Items**
* The following CLI commands should **not** start Serve automatically. They should check whether Serve is running and perform some sort of no-op if it's not. That would alleviate the concern that the user starts Serve by accident through a `GET` request and needs to deal with default `host`/`port` options. Corresponding docs should also be updated.
* `serve status`
* `serve config`
* `serve shutdown`
This PR
- only prints train_loop info strings (e.g. `train_loop_utils.py:298 -- Moving model to device: cpu`) for rank 0 workers for torch
- renames `BaseWorkerMixin` to `RayTrainWorker` as the name comes up often in output and is more meaningful
Signed-off-by: Kai Fricke <kai@anyscale.com>
Removes all ML related code from `ray.util`
Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`
Moves `ray.util.ml_utils` to other locations
Closes#23900
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
This PR puts the Ray Clusters (under construction) docs section (see #26754) under Ray Clusters as a subpage.
This makes the master branch docs clean and presentable for users
Ray Clusters doc writers can use existing CI to iterate on the docs, without having a massive PR once we're done.
Signed-off-by: Cade Daniel <cade@anyscale.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
This PR:
Creates a new chapter in the docs titled "Ray Clusters (Under Construction)".
The new chapter makes the Ray Clusters docs follow the same structure as the other docs (https://diataxis.fr/)
The new chapter will eventually replace the old chapter.
I want to merge this now so that @DmitriGekhtman can put his Kuberay docs into the new structure.
Signed-off-by: Cade Daniel <cade@anyscale.com>
## Why are these changes needed?
This PR ensures that workflow can work properly with Ray client.
Regular workflow tests will (also) be running under client mode (as a pytest parameter). Some tests are moved and reorganized, because the Ray client tests requires starting the cluster, so some tests requires isolation or related changes.
Tests that literally take down the cluster are not tested with Ray client, since Ray client would fail in this scenario.
Limitations of Ray Workflow under Ray client are noted in the doc.
## Related issue number
Closes#21595
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.
# Related issue number
Closes#26332
This PR just applies the changes from the following PRs:
[Datasets] Automatically cast tensor columns when building Pandas blocks. #26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." #26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. #26904
This fixes the test failures introduced in the originally reverted PRs.