This PR changed the OVERRIDE_NODE_ID_FOR_TESTING to RAYLET_NODE_ID so that this is a feature which can be used to start raylet with a given raylet id by setting os env RAY_RAYLET_NODE_ID.
It's important that setting async flag for Python actor in Java for us.
So we added the API which is named "PyActorCreator setAsync(boolean enabled)" based on PyActorCreator,
To avoid misuse for user, we check the flag before the ActorCreationTask is executed.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
This PR updates the quickstart configuration in the Ray docs to reflect the fixes from
ray-project/kuberay#529
To provide access to the fixed version, we update the link to point to KubeRay master rather than the 0.3.0 branch.
After the next KubeRay release (0.4.0), we can update these links to point to a fixed release version again.
Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Upgrade to a more recent Ubuntu version based on user feedback and also matching the version we use for testing in CI.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Adds back more Ray Train APIs to Ray Train docs.
Also makes updates to the user guide for better references.
This test is very unstable and it seems hard to make it stable given that it highly depends on the time.
Fixing it on one platform will fail it on another platform. Given that
- we don't have this test for a long time and things are ok
- we are replacing the heartbeat to pull mode in the near future
This test seems not important for now, so just disable it on the broken platform.
Speeds up HuggingFaceTrainer/Predictor tests in CI by around ~20% by switching to a different GPT model. This is the same model Hugging Face team uses for their own CI.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
The tune resources user guide contained broken code snippets. This PR fixes those, adds some extra clarifying comments, and improves the code style for readability.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Release tests are failing in buildkite run - however succeeds reliably in manual retry.
Suspected it's because not all nodes available when running with large number of actors.
CI is red because of a dependency issue around dataclass_transform .
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
We already had these on docs, a bit of an oversight not adding this to the autoscaler itself too.
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Alex Wu <alex@anyscale.io>
These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook.
Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template.
Signed-off-by: pdmurray <peynmurray@gmail.com>
PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this.
Signed-off-by: Balaji Veeramani balaji@anyscale.com
Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo.
Signed-off-by: Antoni Baum antoni.baum@protonmail.com
this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR
…ment from Java. …" (#27945)"
This reverts commit af488e1.
If this passes, it should be preferred over #28098.
Adjust moto setup to use new API.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
The minimum size is 300GB
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
This PR adds a serve ha test. The flow of the tests is:
1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>