The `serve status` command allows users to get their deployments' status info through the CLI. This change adds a tip to the health-checking documentation to inform users about `serve status`.
Added ensemble model examples to the Documentation. That was needed, due to a user request and there was no methodology outlining the creation of higher level ensemble models.
Co-authored-by: Jiao Dong <sophchess@gmail.com>
Getting or creating a named actor is a common pattern, however it is somewhat esoteric in how to achieve this. Add a utility function and test that it doesn't cause any scary error messages.
Actor.options(name="my_singleton", get_if_exists=True).remote(args)
Algolia search now does not overflow on mobile devices anymore, making the nav scrollable again.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
This adds a multinode checkpoint/restore test for Ray Tune. This covers some of the functionality of the release tests, but in a more controlled environment. In a follow-up PR, we should test (mocked) cloud checkpointing, too.
- Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@SongGuyang @Catch-Bull @edoakes I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default. My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise. So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up. Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.
This also matches the default behavior of pip which allows this situation to occur in the first place: `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
We don't support Windows entirely now.
## Checks
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.
RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.
Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.
Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
This PR makes a number of major overhauls to the Ray core docs:
Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}.
Re-org the user guide to align with key concepts.
Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections.
Minor tweaks and additional transition material.
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster. One gap in this is if the local Python module is in the form of a wheel (`.whl` file.) This PR adds the missing support for uploading and installing the `.whl` file.
The current multi node tests use a hardcoded mapping for local development mounts. With this PR, a new environment variable is introduced to be able to control this dynamically. Additionally, some minor improvements to the test utilities and monitor are added.
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).
These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
Interface for DataParallelTrainer and updates to ScalingConfig definition.
Depends on #22986
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
the include of content for md files like our central getting started page didn't render. fixed here.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).
I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.