Commit graph

2580 commits

Author SHA1 Message Date
Sihan Wang
8ecd928c34
[Serve] Make the checkpoint and recover only from GCS (#26753) 2022-07-25 14:24:53 -07:00
Jules S. Damji
193e824bc1
[AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937)
Co-authored-by: Jules Damji <jules@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-25 14:21:29 -07:00
Jiao
5315f1e643
[AIR] Enable other notebooks previously marked with # REGRESSION (#26896)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-25 13:40:21 -07:00
matthewdeng
df638b3f0f
[Datasets] Automatically cast tensor columns when building Pandas blocks. (#26924)
This PR just applies the changes from the following PRs:

[Datasets] Automatically cast tensor columns when building Pandas blocks. #26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." #26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. #26904
This fixes the test failures introduced in the originally reverted PRs.
2022-07-25 12:12:10 -07:00
Jiao
bf1d9971f1
[setup-dev] Add flag to skip symlink certain folders (#26899) 2022-07-25 10:21:20 -07:00
matthewdeng
3ea80f6aa1
[data] set iter_batches default batch_size (#26955)
Why are these changes needed?
Resubmitting #26869.

This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950
2022-07-25 08:34:25 -07:00
Siyuan (Ryans) Zhuang
4a1ad3e87a
[Workflow] Support "retry_exceptions" of Ray tasks (#26913)
* support 'retry_exceptions'

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* add test

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* add doc

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* fix

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* typo

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-24 20:50:11 -07:00
Eric Liang
1ac2a872e7
[docs] Editing pass over Dataset docs (#26935) 2022-07-24 19:48:29 -07:00
Kai Fricke
803c094534
[air/tuner/docs] Update docs for Tuner() API 2b: Tune examples (ipynb) (#26884)
This PR updates the Ray AIR/Tune ipynb examples to use the Tuner() API instead of tune.run().

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Kai Fricke <coding@kaifricke.com>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2022-07-24 18:53:57 +01:00
Eric Liang
008eecfbff
[docs] Update the AIR data ingest guide (#26909) 2022-07-24 09:59:29 -07:00
Kai Fricke
8fe439998e
[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)
Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig
2022-07-24 07:45:24 -07:00
Christy Bergman
e9503dbe2b
[RLlib] Push suggested changes from #25652 docs wording Parametric Models Action Masking. (#26793) 2022-07-24 15:36:55 +02:00
Eric Liang
d692a55018
[data] Make lazy mode non-experimental (#26934) 2022-07-23 21:28:31 -07:00
matthewdeng
bcec60d898
Revert "[data] set iter_batches default batch_size #26869 " (#26938)
This reverts commit b048c6f659.
2022-07-23 17:46:45 -07:00
matthewdeng
b048c6f659
[data] set iter_batches default batch_size #26869
Why are these changes needed?
Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes
Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.
2022-07-23 13:44:53 -07:00
Stephanie Wang
55a0f7bb2d
[core] ray.init defaults to an existing Ray instance if there is one (#26678)
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```

Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-07-23 11:27:22 -07:00
Eric Liang
63a6c1dfac
[docs] Cleanup the Datasets key concept docs (#26908)
Clean up the Datasets key concept doc to be suitable for consumption by a beginner level user and improving the diagrams.
2022-07-22 23:30:54 -07:00
Kai Fricke
1f32cb95db
[air/tune] Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (#26882) 2022-07-22 20:17:06 -07:00
Eric Liang
36c46e9686
[docs] Improve AIR table of contents titles (#26858) 2022-07-22 17:17:49 -07:00
Kai Fricke
77ba30d34e
[tune] Docs for custom command based syncer (awscli / gsutil) (#26879)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-07-22 15:28:53 -07:00
Siyuan (Ryans) Zhuang
4b50ef6a28
[Workflow] Rename the argument of "workflow.get_output" (#26876)
* rename get_output

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* update doc

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-22 12:06:19 -07:00
Clark Zinzow
a29baf93c8
[Datasets] Add .iter_torch_batches() and .iter_tf_batches() APIs. (#26689)
This PR adds .iter_torch_batches() and .iter_tf_batches() convenience APIs, which takes care of ML framework tensor conversion, the narrow tensor waste for the .iter_batches() call ("numpy" format), and unifies batch formats around two options: a single tensor for simple/pure-tensor/single-column datasets, and a dictionary of tensors for multi-column datasets.
2022-07-22 10:09:36 -07:00
Eric Liang
9272bcbbca
[docs] Add ecosystem map to AIR guide (#26859) 2022-07-21 19:06:47 -07:00
matthewdeng
14e2b2548c
[air] update remaining dict scaling_configs (#26856) 2022-07-21 18:55:21 -07:00
Jiao
db027d86af
[P0][AIR] Fix train to serve notebooks (#26821)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2022-07-21 18:04:13 -07:00
Sihan Wang
27f1532a15
[Serve] Promote graceful shutdown and health check (#26682) 2022-07-21 17:37:10 -05:00
Jules S. Damji
6db2536971
[RayAIR] Minor tweaks to the why ray air for clarity (#26680) 2022-07-21 10:21:26 -07:00
Balaji Veeramani
ac1d21027d
[AIR] Add framework-specific checkpoints (#26777) 2022-07-20 19:33:27 -07:00
Richard Liaw
9f0d35b97c
[air/docs] add tensorflow benchmarks into table (#26800) 2022-07-20 17:12:40 -07:00
Eric Liang
d6f29eb9ca
[docs] Mark pipelined prediction as experimental for now (#26792) 2022-07-20 15:31:19 -07:00
xwjiang2010
e7957f4a3e
[air] update offline/online rl example and enable them. (#26786) 2022-07-20 14:06:03 -07:00
Siyuan (Ryans) Zhuang
0063d94166
[Core] Make "GetTimeoutError" a subclass of "TimeoutError" (#26771)
I am surprised by the fact that `GetTimeoutError` is not a subclass of `TimeoutError`, which is counter-intuitive and may discourage users from trying the timeout feature in `ray.get`, because you have to "guess" the correct error type. For most people, I believe the first error type in their mind would be `TimeoutError`.

This PR fixes this.
2022-07-20 14:37:39 -05:00
tomsunelite
d915529e9e
Add doc for custom lifetime of java actor (#26706)
Custom lifetime of java Actor is already supported, but the related document is not updated

Co-authored-by: sunkunjian1 <sunkunjian1@jd.com>
2022-07-20 22:19:44 +08:00
Tao Wang
4f2747f12a
[Core][C++ worker] Add GetNamespace api (#26509) 2022-07-20 11:17:14 +08:00
Tao Wang
cd521ed132
[Doc][namespaces][C++ worker]add document for c++ worker namespace and specifying namespace while creating/getting named actors (#26498)
We've supported namespace in c++ worker in https://github.com/ray-project/ray/pull/26327. Here we add doc for usage and also reinforce the documents of Java and Python, like adding explanation of specifying namespace while creating named actors.

- [x] add doc for basic c++ worker namespace usage
- [x] add explanation for specifying namespace while creating named actors, in Python, Java and C++
2022-07-20 10:58:41 +08:00
Dmitri Gekhtman
fdd5c53bfd
[KubeRay] Documentation structure and skeleton (#26589)
Adds outline and structure for new KubeRay-based Ray-on-Kubernetes docs.
2022-07-19 13:28:04 -07:00
Richard Liaw
6563c2762d
[air] add pytorch benchmark number (#26719) 2022-07-19 09:51:13 -07:00
Richard Liaw
7e62e1187c
[air/benchmark] Torch benchmarks for 4x4 (#26692)
Add benchmark data for 4x4 GPU setup.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 17:06:37 +01:00
Siyuan (Ryans) Zhuang
5b937167d3
[Workflow] Fix typo in workflow event doc (#26686)
Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-18 23:26:50 -07:00
Siyuan (Ryans) Zhuang
eb4ed49c1f
[Workflow] Unify the semantics of max_retries of workflow task and Ray task (#26350)
* workflow task retry

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* move and enhance tests

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* use "max_retries" of Ray task

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* add test for disabling lineage reconstruction in workflow

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-18 23:25:44 -07:00
Sumanth Ratna
759966781f
[air] Allow users to use instances of ScalingConfig (#25712)
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-18 15:46:58 -07:00
matthewdeng
6670708010
[air] add placement group max CPU to data benchmark (#26649)
Set experimental `_max_cpu_fraction_per_node` to prevent deadlock.

This should technically be a no-op with the SPREAD strategy.
2022-07-18 10:34:40 -07:00
Chen Shen
b20f5f51df
[Air][Data] Don't promote locality_hints for split (#26647)
Why are these changes needed?
Since locality_hints is an experimental feature, we stop promoting it in doc and don't enable it in AIR. See #26641 for more context
2022-07-17 22:18:30 -07:00
Jiao
98a07920d3
[AIR][CUJ] Make distributing training benchmark at silver tier (#26640) 2022-07-17 22:07:09 -07:00
Jules S. Damji
55368402ee
added summary why and when to use bulk vs streaming data ingest (#26637) 2022-07-17 18:46:58 -07:00
Eric Liang
12825fc5aa
[air] Add a warning if no CPUs are reserved for dataset execution (#26643) 2022-07-17 16:33:51 -07:00
Clark Zinzow
864af14f41
[Datasets] [Local Shuffle - 1/N] Add local shuffling option. (#26094)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matt@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-17 16:21:14 -07:00
Eric Liang
400330e9c0
[air] Add _max_cpu_fraction_per_node to ScalingConfig and documentation (#26634) 2022-07-16 21:55:51 -07:00
Amog Kamsetty
3a345a470c
[AIR/Docs] Add Predictor Docs (#25833) 2022-07-16 21:14:21 -07:00
Jiao
77e2ef2eb6
[AIR] Update Torch benchmarks with documentation (#26631)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-16 17:58:21 -07:00