The tune resources user guide contained broken code snippets. This PR fixes those, adds some extra clarifying comments, and improves the code style for readability.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
CI is red because of a dependency issue around dataclass_transform .
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo.
Signed-off-by: Antoni Baum antoni.baum@protonmail.com
Improve docstring for ResultGrid and show API reference and docstring in Tune API section.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.
The main problems here are:
1. Tracebacks are printed multiple times: In the remote worker and on the driver
2. Tracebacks include many internal wrappers
The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).
The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.
### Deduplicating traceback printing
The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable.
Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).
To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.
### Removing internal wrappers from tracebacks
The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch.
Adds a sentence explaining that the Ray container has to go first in the container list.
Adds a sentence to config guide mention min/max replicas and linking to autoscaling.
Documents a bug related to GPU auto-detection in KubeRay 0.3.0.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For a more balanced table of contents, makes CloudWatch instructions a subsection of AWS instructions.
Take out the CLI reference from the core API subsection. It follows the same CLI reference pattern as other library (e.g., Serve has Serve CLI under Serve API section).
Move the code to doc_code
Fix the code example to make batching faster than serial run.
Related issue number
#27048
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
An attempt at making the docs shorter and sweeter including various small cleanup items.
- Reorder the TOC on the sidebar for the user guides to be more linear based on a user's journey.
- Put the batching content under the performance guide.
- Remove the AIR guide (AIR users already have a serving guide).
- Combine the `ServeHandle` and model composition pages into a single guide. We may want to revisit this in the future but for now better to have it in a single place instead of duplicated (with links going to both).
- Fix the index page for the user guides to match the TOC sidebar.
- Rename a few pages for clarity & consistency.
- Remove some now-redundant content (old ML models user guide).
- Adds KubeRay information to the production guide.
- Consolidates the two user guides we had related to production deployment.
- Adds information about experimental GCS HA feature.
Signed-off-by: Yi Cheng 74173148+iycheng@users.noreply.github.com
Why are these changes needed?
This PR update workflow doc to reflect the recent change.
Focusing on position change and others.
Different metrics are collected in Ray Serve when the deployments are called from HTTP vs Python. This needs to be mentioned in the documentation and each metric marked accordingly.
Enables better usage with GCP.
The default behavior is that the head runs with the ray-autoscaler-sa-v1 service Account, but workers do not. Workers can run with this service account by copying & uncommenting L114->L117 from example-full
Signed-off-by: Ian <ian.rodney@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator.
To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code.
These notes are linked from the new docs.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>