Commit graph

2344 commits

Author SHA1 Message Date
Richard Liaw
6563c2762d
[air] add pytorch benchmark number (#26719) 2022-07-19 09:51:13 -07:00
Richard Liaw
7e62e1187c
[air/benchmark] Torch benchmarks for 4x4 (#26692)
Add benchmark data for 4x4 GPU setup.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 17:06:37 +01:00
Siyuan (Ryans) Zhuang
5b937167d3
[Workflow] Fix typo in workflow event doc (#26686)
Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-18 23:26:50 -07:00
Siyuan (Ryans) Zhuang
eb4ed49c1f
[Workflow] Unify the semantics of max_retries of workflow task and Ray task (#26350)
* workflow task retry

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* move and enhance tests

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* use "max_retries" of Ray task

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>

* add test for disabling lineage reconstruction in workflow

Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-18 23:25:44 -07:00
Sumanth Ratna
759966781f
[air] Allow users to use instances of ScalingConfig (#25712)
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-18 15:46:58 -07:00
matthewdeng
6670708010
[air] add placement group max CPU to data benchmark (#26649)
Set experimental `_max_cpu_fraction_per_node` to prevent deadlock.

This should technically be a no-op with the SPREAD strategy.
2022-07-18 10:34:40 -07:00
Chen Shen
b20f5f51df
[Air][Data] Don't promote locality_hints for split (#26647)
Why are these changes needed?
Since locality_hints is an experimental feature, we stop promoting it in doc and don't enable it in AIR. See #26641 for more context
2022-07-17 22:18:30 -07:00
Jiao
98a07920d3
[AIR][CUJ] Make distributing training benchmark at silver tier (#26640) 2022-07-17 22:07:09 -07:00
Jules S. Damji
55368402ee
added summary why and when to use bulk vs streaming data ingest (#26637) 2022-07-17 18:46:58 -07:00
Eric Liang
12825fc5aa
[air] Add a warning if no CPUs are reserved for dataset execution (#26643) 2022-07-17 16:33:51 -07:00
Clark Zinzow
864af14f41
[Datasets] [Local Shuffle - 1/N] Add local shuffling option. (#26094)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matt@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-17 16:21:14 -07:00
Eric Liang
400330e9c0
[air] Add _max_cpu_fraction_per_node to ScalingConfig and documentation (#26634) 2022-07-16 21:55:51 -07:00
Amog Kamsetty
3a345a470c
[AIR/Docs] Add Predictor Docs (#25833) 2022-07-16 21:14:21 -07:00
Jiao
77e2ef2eb6
[AIR] Update Torch benchmarks with documentation (#26631)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-16 17:58:21 -07:00
Eric Liang
0855bcb77e
[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633) 2022-07-16 17:37:06 -07:00
M Waleed Kadous
7c32993c15
[core/docs]Add a new section under Ray Core called Ray Gotchas (#26624)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-16 16:53:01 -07:00
Antoni Baum
fb6f3cf708
[AIR/Docs] Small improvements to Train user guide (#26577)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-07-16 16:51:17 -07:00
Eric Liang
6217138eb0
[docs] Move AIR benchmarks to top level (#26632) 2022-07-16 15:34:31 -07:00
Philipp Moritz
081bbfbff1
[Examples] Test OCR example in documentation tests (#26482)
Make sure the OCR example is tested in documentation after we discovered that example notebooks are not tested in CI.

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
2022-07-16 10:51:28 -07:00
Richard Liaw
799311b2f7
[air/docs] update examples to remove pandas again (#26598) 2022-07-16 08:40:44 -07:00
Balaji Veeramani
34cf1f17ea
[Datasets] Add ImageFolderDatasource (#24641)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-15 22:43:23 -07:00
matthewdeng
e3a096f412
[air] add bulk ingest benchmarks (#26618) 2022-07-15 22:01:23 -07:00
Richard Liaw
5ad4e75831
[air] Add initial benchmark section (#26608) 2022-07-15 15:33:48 -07:00
Jiao
647e12b6c7
[AIR] Fix convert_existing_pytorch_code_to_ray_air notebook (#26523) 2022-07-14 14:30:55 -07:00
Tim Gates
e42dc7943e
docs: Fix a few typos (#26556)
There are small typos in:
- doc/source/data/faq.rst
- python/ray/serve/replica.py

Fixes:
- Should read `successfully` rather than `succssifully`.
- Should read `pseudo` rather than `psuedo`.
2022-07-14 12:38:33 -07:00
Jiajun Yao
60dd77a2d3
Enable usage stats collection for ray.init iff nightly wheels (#26461)
For nightly wheels, we want to collect usage stats for local clusters started via ray.init() as well.
2022-07-14 12:14:01 -07:00
Amog Kamsetty
6595bd6e2d
[AIR] Introduce better scoring API for BatchPredictor (#26451)
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

As discussed offline, allow configurability for feature columns and keep columns in BatchPredictor for better scoring UX on test datasets.
2022-07-14 11:26:12 -07:00
Richard Liaw
a0ce3c111b
[air/data] Concatenator preprocessor (#26526) 2022-07-14 10:26:14 -07:00
Eric Liang
5f18c67ba3
Fix LINT (#26554)
Signed-off-by: Eric Liang <ekhliang@gmail.com>
2022-07-13 23:28:02 -07:00
Jiao
15dbc0362a
[AIR][Docs] Fix torch_image_example (#26453) 2022-07-13 21:59:24 -07:00
Scott Cheng
1bc44c13fb
Update Python3.10 in docs (#26463)
Make it clear to users that ray supports Python 3.10
2022-07-13 20:08:56 -07:00
Eric Liang
31c8c908f9
[docs] Improve AIR API ref organization (#26530) 2022-07-13 18:05:17 -07:00
Sihan Wang
b606169cb5
[Serve] Promote autoscaling feature (#26393)
1. get rid of the private attribute
2. fix unit test
3. docs and workflows
2022-07-13 14:38:38 -05:00
Antoni Baum
cc7115f6a2
[Tune/CI] Fix tune-sklearn notebook example (#26470)
Fixes the tune-sklearn notebook example as found in #26410

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 18:14:36 +01:00
Antoni Baum
5ed10ef921
[AIR/CI] Fix Hugging Face notebook example (#26475) 2022-07-13 09:16:42 -07:00
Antoni Baum
ddb5572040
[Tune/CI] Fix Hyperopt notebook example (#26469)
Fixes failing hyperopt notebook in CI (as found in #26410). The cause was a mismatch between keys in points to evaluate and the search space - now, an informative exception will be raised.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 16:50:11 +01:00
Antoni Baum
9b2cd29511
[CI] Install Horovod in doc tests to fix notebook (#26476)
Fixes the Horovod notebook example as found in #26410 by installing Horovod in doc tests jobs.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 16:27:20 +01:00
Antoni Baum
67a7ffa6b4
[Tune/CI] Fix BOHB notebook example (#26473)
Fixes the BOHB notebook example as found in #26410

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 10:35:38 +01:00
Avnish Narayan
5df66b917d
[Lint Check] Remove broken link (#26505)
The paper is not available anymore.
2022-07-13 10:30:20 +01:00
Antoni Baum
e48d381926
[Tune/CI] Fix Tune-Pytorch-CIFAR notebook example (#26474)
Fixes the Tune-Pytorch-CIFAR notebook example as found in #26410

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 10:28:30 +01:00
Christy Bergman
7c925fe99f
[RLlib; docs] Re-organize algorithms so TOC matches README. (#26339) 2022-07-13 10:46:36 +02:00
Eric Liang
9de1add073
[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883)
This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case:
- The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8.
- The parallelism is set to the estimated number of CPUs multiplied by 2.
- The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size.

These rules fix two common user problems:
1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster.
2. Overly large block sizes leading to OOMs when processing a single block.

TODO:
- [x] Unit tests
- [x] Docs update

Supercedes part of: https://github.com/ray-project/ray/pull/25708

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-07-12 21:08:49 -07:00
Pamphile Roy
53ecc28f9f
[docs] Install ray from conda-forge instead of PyPi when using conda (#25296) 2022-07-12 16:59:44 -07:00
Eric Liang
4c04c8d92c
[doc] Rename toc entry for libraries back to "Ray Libraries" (#26485) 2022-07-12 14:23:36 -07:00
Rohan Potdar
09ce4711fd
[RLlib]: Move OPE to evaluation config (#25911) 2022-07-12 11:04:34 -07:00
Richard Liaw
92efc85b3b
[air/docs] checkpoints (#25901) 2022-07-11 20:40:23 -07:00
Richard Liaw
1abe908c22
[air/docs] improve consistency of getting started (#26247) 2022-07-11 20:16:37 -07:00
Richard Liaw
191921f4ec
[docs] Fix pytest and add stacklevel (#26340) 2022-07-11 19:43:37 -07:00
Antoni Baum
65ea710e30
[Docs] Update Train user guide to use the new APIs (#26091) 2022-07-11 15:10:10 -07:00
Kai Fricke
753f5feaf4
[tune] Remove TrialCheckpoint class (#25406)
The old user-facing TrialCheckpoint class has been deprecated in favor of `ray.ml.Checkpoint` and will be removed with this PR.

The main change in this PR is to delete the old `TrialCheckpoint` class and replace remaining API calls (e.g. `checkpoint.local_path`) with the correct AIR equivalents.

One issue that comes up is that with Ray client usage, checkpoint directories are not available on the local node (the client). Thus, we can't construct `Checkpoint` objects easily. (Previously, the TrialCheckpoint object held a reference to the location, even if it is not locally available). There are ongoing discussions on how to resolve this in the future. For now, we print an error when such a checkpoint is requested.

Depends on #25805

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-11 20:08:10 +01:00