Commit graph

1690 commits

Author SHA1 Message Date
Sven Mika
8e00537b65
[RLlib] SlateQ: framework=tf fixes and SlateQ documentation update (#22543) 2022-02-23 13:03:45 +01:00
mwtian
9a157dfe82
[GCS-Ray] update doc and error message for GCS-Ray (#22528)
Update documentation to reflect that Ray no longer starts Redis by default.
2022-02-22 17:56:30 -08:00
Dmitri Gekhtman
a402e956a4
[KubeRay] Format autoscaling config based on RayCluster CR (#22348)
Closes #21655. At the start of each autoscaler iteration, we read the Ray Cluster CR from K8s and use it to extract the autoscaling config.
2022-02-22 11:06:37 -08:00
Antoni Baum
4a15c6f8f3
[tune] Preparation for deadline schedulers (#22006) 2022-02-22 11:05:28 -08:00
Guyang Song
5783cdb254
[runtime env] runtime env inheritance refactor (#22244)
Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc):
- The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment.
- Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent.

We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run.

Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior:
- **If there is no runtime env option when we create actor, inherit the parent runtime env.**
- **Otherwise, use the optional runtime env directly and don't do the merging.**

Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like:
```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))```
This new API also can be used in ray client.
2022-02-21 18:13:22 +08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Archit Kulkarni
8c12e30f11
[Doc] Add actor max restarts default value to fault tolerance doc (#22481) 2022-02-18 17:48:22 -06:00
Max Pumperla
9482f03134
[docs] RLlib concepts consolidation, user guide, RL conf prep (#22496) 2022-02-18 09:35:20 -08:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Ian Rodney
c9a4b17f99
[YAMLs] Fix comments about autoscaler round-robining (#22002) 2022-02-17 13:59:05 -08:00
Sven Mika
e03606f0b3
[RLlib] Bandit documentation enhancements. (#22427) 2022-02-17 13:25:50 +01:00
Qing Wang
7c45d1a366
[doc][Java] Add doc page for java concurrency group. (#21600)
Add document page for Java concurrency group.

Co-authored-by: Kai Yang <kfstorm@outlook.com>
2022-02-16 17:57:03 +08:00
Simon Mo
495221e7d2
[Doc] Update Serve logo for tune user guide (#22369)
We have deprecated the old logo.
2022-02-15 12:10:08 -06:00
Hao Chen
78597d3089
[train] Minor fixes on Ray Train user guide doc (#22379)
Fixes some typos and format issues.
2022-02-15 10:09:27 -08:00
Jun Gong
b729a9390f
[RLlib] Add example commands for using setup-dev.py with RLlib for improved dev setup stability and developer experience. (#22380) 2022-02-15 12:00:36 +01:00
Jun Gong
6f5afcbce9
[RLlib] Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239) 2022-02-15 09:09:24 +01:00
matthewdeng
8f9e0d7f6b
[train] add TorchTensorboardProfilerCallback (#22345)
The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.

```
  | File "ray_sgd_training.py", line 18, in <module>
  | from ray import train
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
  | from ray.train.callbacks import TrainingCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
  | from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
  | from torch.profiler import profile
  | ModuleNotFoundError: No module named 'torch.profiler'
```

A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:

```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
    "Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
2022-02-14 16:16:55 -08:00
Archit Kulkarni
0e350c0074
[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780)
We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. 

This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-02-14 16:03:48 -06:00
Clark Zinzow
53c4c7b1be
[Datasets] Expose TableRow as public API; minimize copies/type conversions on row-based ops. (#22305)
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
2022-02-14 12:56:17 -08:00
Alex Wu
276ff2b7ed
[docs][autoscaler] Add maintainers for node providers (#22237)
This PR adds documentation for the maintainers of the various node providers.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-12 11:31:32 -08:00
Max Pumperla
d594b668bb
[docs] [tune] hyperopt notebook (#22315) 2022-02-12 02:46:03 -08:00
Edward Oakes
49b3e6c53c
[serve] Support user-provided health check via def check_health(self) method (#22178) 2022-02-11 12:53:37 -06:00
Archit Kulkarni
a65f35b867
[Doc] [Jobs] Add ray dashboard docs to jobs doc (#22222)
To use Jobs on a remote cluster, you need to set up port forwarding.  When using the cluster launcher, the `ray dashboard` command provides this automatically.  This PR adds a how-to to the docs for this feature.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-02-11 11:01:37 -06:00
Balaji Veeramani
abad268549
Comment fmt: off annotations (#21984)
Code formatting is disabled in several modules with the explanation
> [The module] ignores yapf because yapf doesn't allow comments right after code blocks,
but we put comments right after code blocks to prevent large white spaces
in the documentation.

Since we no longer use YAPF, it may be possible to re-enable code formatting on 
these modules. I've added "FIXME" comments requesting developers to check
whether code formatter appeasements are still necessary.
2022-02-09 22:12:11 -08:00
Sven Mika
c73e0597fa
[RLlib] Discussion 2022: Fix batch_mode="complete_episodes" documentation inaccuracy. (#22074) 2022-02-10 02:57:27 +01:00
SangBin Cho
e5cab878b8
[Core] Disable runtime env logs (#22198)
Disable runtime env logs streamed to the driver by default and improve the documentation.
2022-02-09 14:43:25 -08:00
Archit Kulkarni
54b2e143e4
[Doc] [Jobs] Add size limit and recommendations for working_dir (#22219)
Previously it wasn't obvious which working_dir option was recommended, and the size limit for local working_dir didn't appear on the Jobs page.   (The user would have had to go to the runtime_env API reference to see the size limit.). This PR makes this information more prominent.
2022-02-09 13:56:02 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Jiao
293e45c527
[Doc] [Serve] Fix README's quick_start and add to test suite (#22228) 2022-02-09 11:49:47 -08:00
Alex Wu
c9a419ac76
[Autoscaler] Remove staroid node provider (#22236)
The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-09 09:18:18 -08:00
xwjiang2010
323511b716
[tune] Single wait refactor. (#21852)
This is a down scoped change. For the full overview picture of Tune control loop, see [`Tune control loop refactoring`](https://docs.google.com/document/d/1RDsW7SVzwMPZfA0WLOPA4YTqbRyXIHGYmBenJk33HaE/edit#heading=h.2za3bbxbs5gn)

1. Previously there are separate waits on pg ready and other events. As a result, there are quite a few timing tweaks that are inefficient, hard to understand and unit test. This PR consolidates into a single wait that is handled by TrialRunner in each step.
- A few event types are introduced, and their mapping into scenarios
  * PG_READY --> Should place a trial onto it. If somehow there is no trial to be placed there, the pg will be put in _ready momentarily. This is due to historically resources is conceptualized as a pull based model. 
  * NO_RUNNING_TRIALS_TIME_OUT --> possibly not sufficient resources case
  * TRAINING_RESULT
  * SAVING_RESULT
  * RESTORING_RESULT
  * YIELD --> This just means that simply taking very long to train. We need to punt back to the main loop to print out status info etc.

2. Previously TrialCleanup is not very efficient and can be racing between Trainable.stop() and `return_placement_group`. This PR streamlines the Trial cleanup process by explicitly let Trainable.stop() to finish followed by `return_placement_group(pg)`. Note, graceful shutdown is needed in cases like `pause_trial` where checkpointing to memory needs to be given the time to happen before the actor is gone. 

3. There are quite some env variables removed (timing tweaks), that I consider OK to proceed without deprecation cycle.
2022-02-09 15:31:17 +00:00
Balaji Veeramani
31ed9e5d02
[CI] Replace YAPF disables with Black disables (#21982) 2022-02-08 16:29:25 -08:00
Jules S. Damji
6b7d995e64
Added a hands-on self-containted MLflow/Ray Serve deployment example (#22192) 2022-02-08 12:10:53 -08:00
Guyang Song
36ba514f9c
[Doc] Fix bad doc and recover doc of c++ api (#22213) 2022-02-08 19:04:37 +08:00
Guyang Song
9f77090c1c
[Doc] Fix bad links of dask and mars in ray-libraries.rst (#22210) 2022-02-08 19:02:49 +08:00
Max Pumperla
5cc9355303
[Docs ] Tune docs overhaul (first part) (#22112)
Continuing docs overhaul, tune now has:

- [x] better landing page
- [x] a getting started guide
- [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials
- [x] the new user guide contains guides to tune features and practical integrations
- [x] we rewrote some of the feature guides for clarity 
- [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower)
- [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030.
- [x] Examples are tested in the new framework, of course.

There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-02-07 15:47:03 +00:00
Clark Zinzow
fb0d6e6b0b
[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067) 2022-02-05 16:59:34 -08:00
Jules S. Damji
c5c5e01b5d
[Doc] [Serve] Fixed minor typo and removed extract ',' (#22101) 2022-02-04 14:51:38 -08:00
Archit Kulkarni
d7be4e1d3c
[doc] [runtime env] Add note that referencing local files in requirements.txt is not supported (#22095) 2022-02-04 15:32:19 -06:00
matthewdeng
014a9959f1
Revert "[train] add TorchTensorboardProfilerCallback (#21864)" (#22117)
This reverts commit f064306de9.
2022-02-04 08:54:16 -08:00
Clark Zinzow
743ce65da8
[Dask-on-Ray] Add support for Dask annotations. (#22057) 2022-02-03 22:15:38 -08:00
matthewdeng
f064306de9
[train] add TorchTensorboardProfilerCallback (#21864)
Implement a TorchTensorboardProfilerCallback and corresponding TorchWorkerProfiler to support distributed PyTorch Profiler With TensorBoard integration.
2022-02-03 19:28:12 -08:00
Max Pumperla
092598774a
[Docs] Executable notebook tutorial (#22030)
We're introducing the usage of [MyST Notebooks](https://myst-nb.readthedocs.io/en/latest/index.html) here and demonstrate how it works by rewriting (and extending) the RLLib Serve tutorial. Benefits:

- [x] Write notebooks in markdown. Can be converted into other formats e.g. with `jupytext`
- [x] Tutorials like this have a binderhub link added to the top nav (launch button).
- [x] Notebooks get executed when docs are built, so it's impossible to have stale docs.
- [x] But locally those builds are cached so that you don't have to wait too long.
- [x] The notebook cell outputs can be shown, hidden or removed.  In particular, we can now avoid adding expected code output as comments in our scripts (which might get outdated).

We're also clarifying  #22022. 

Old tutorial: [here](https://docs.ray.io/en/latest/serve/tutorials/rllib.html)
New tutorial (preview): [here](https://ray--22030.org.readthedocs.build/en/22030/serve/tutorials/rllib.html)

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-02-03 08:13:04 +00:00
Archit Kulkarni
78f882dbbc
[runtime env] Local uri caching for working_dir, py_modules and conda (#20273)
Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them.  In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache.  When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs.

Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit

- Adds unit tests for caching and integration tests for working_dir caching
2022-02-02 14:53:03 -06:00
Eric Liang
3d449d4f71
[docs] Clean up long titles in TOC (#22016) 2022-02-01 22:56:49 -08:00
Balaji Veeramani
6441335f5e
[Doc] Correct information about code style (#21985) 2022-02-01 10:37:21 -08:00
SangBin Cho
2db71f72cc
[Doc] Remove the legacy doc (#21996) 2022-01-31 15:26:19 -08:00
Kai Yang
2038cc96c6
Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" (#21894)
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
2022-01-31 12:09:51 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Junwen Yao
eb8adc6105
[train] add a utility function to turn off TF autosharding (#21887)
This PR adds a utility function to turn off TF autosharding as a temporary solution.

Closes #19324.
2022-01-28 16:09:06 -08:00