These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook.
Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template.
Signed-off-by: pdmurray <peynmurray@gmail.com>
PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this.
Signed-off-by: Balaji Veeramani balaji@anyscale.com
Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo.
Signed-off-by: Antoni Baum antoni.baum@protonmail.com
this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR
…ment from Java. …" (#27945)"
This reverts commit af488e1.
If this passes, it should be preferred over #28098.
Adjust moto setup to use new API.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
The minimum size is 300GB
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
This PR adds a serve ha test. The flow of the tests is:
1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
This the first split PR of #25075, which tried to enable gcs scheduler by default.
This split PR mainly includes:
In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated.
Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario).
Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler #24842 (comment)).
Improve docstring for ResultGrid and show API reference and docstring in Tune API section.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
The API cleanup in #27060 introduced a regression when merging latest master - changes from #26967 were effectively disabled, retaining cluttered output in rllib with verbose=2.
Signed-off-by: Kai Fricke <kai@anyscale.com>
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.
The main problems here are:
1. Tracebacks are printed multiple times: In the remote worker and on the driver
2. Tracebacks include many internal wrappers
The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).
The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.
### Deduplicating traceback printing
The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable.
Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).
To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.
### Removing internal wrappers from tracebacks
The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening.
Signed-off-by: Kai Fricke <kai@anyscale.com>
According to https://peps.python.org/pep-0338/
> The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module.
We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch.
Adds a sentence explaining that the Ray container has to go first in the container list.
Adds a sentence to config guide mention min/max replicas and linking to autoscaling.
Documents a bug related to GPU auto-detection in KubeRay 0.3.0.
For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error.
Signed-off-by: Jiajie Li <ljjsalt@gmail.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For a more balanced table of contents, makes CloudWatch instructions a subsection of AWS instructions.
This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.