- Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@SongGuyang @Catch-Bull @edoakes I know we discussed this earlier, but after thinking about it some more I think a more reasonable default is for `pip check` to be `False` by default. My guess is that a lot of users (including myself) work inside an environment where `python -m pip check` fails, but the environment doesn't cause them any problems otherwise. So a lot of users will hit an error when trying a simple `runtime_env` `pip` example, and possibly give up. Another less important piece of evidence is that we had to set `pip_check = False` to make some CI tests pass in the original PR.
This also matches the default behavior of pip which allows this situation to occur in the first place: `pip install` doesn't error when there's a dependency conflict; rather the command succeeds, the package is installed and usable, and it prints a warning (which is confusingly titled "ERROR")
`serve shutdown` is not idempotent with the new Serve CLI. When serve shuts down, it kills the controller. The REST API does not refresh its cached controller handle, so it attempts to make requests to a dead actor, which fail.
This change updates the REST API and `serve.start()` to refresh the controller handle if the controller has been killed.
These changes expose `Application` as a public API. They also introduce a new public method, `serve.run()`, which allows users to deploy their `Applications` or `DeploymentNodes`. Additionally, the Serve CLI's `run` command and Serve's REST API are updated to use `Applications` and `serve.run()`.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR adds support for checkpoint ser/de. In particular this is special casing the local data representation, which will be converted into a bytes checkpoint on serialization. This way checkpoint objects sent to remote tasks are guaranteed to always point to a valid data location within the remote task.
We are not detecting pickling to/from disk (e.g. to pickle files) for now.
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
We don't support Windows entirely now.
## Checks
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
In linux the thread name could not be longer than 15 chars.
When we use command like top, we are easy being confused by similar thread name like `resource_report_poller` and `resource_report_broadcaster` because they are both show `resource_report`.
This pr uses abbr to make the thread names shorter.
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.
RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.
Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.
Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
Redis password should not be needed in the connection info printed by `ray start --head`.
We can make another cleanup for removing flags and arguments related to Redis password. But it is a bit more risky (affects external Redis) and needs more care.
Implements `TensorflowTrainer`. Depends on https://github.com/ray-project/ray/pull/23211 (review only files with `tensorflow` in the name).
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
This PR makes a number of major overhauls to the Ray core docs:
Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}.
Re-org the user guide to align with key concepts.
Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections.
Minor tweaks and additional transition material.
Skips 404s on node termination for GCP node provider.
Also resets internal "self.nodes_to_terminate" state at the start of an autoscaler iteration -- that's necessary for correct cleanup in the event of failed node termination.
If using the DataParallelTrainer, since we are running the BackendExecutor in a Trainable actor already, we don't need to create a new actor.
However if using Ray Train directly, we still want to run BackendExecutor in an actor for performance with Ray Client.
This PR does some refactoring to support both cases.