Commit graph

18 commits

Author SHA1 Message Date
Eric Liang
f7ae8923f6
[docs] Reorganize the tensor data support docs; general editing (#26952)
Why are these changes needed?
Editing pass over the tensor support docs for clarity:

Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
2022-08-01 17:31:41 -07:00
Eric Liang
1ac2a872e7
[docs] Editing pass over Dataset docs (#26935) 2022-07-24 19:48:29 -07:00
Eric Liang
63a6c1dfac
[docs] Cleanup the Datasets key concept docs (#26908)
Clean up the Datasets key concept doc to be suitable for consumption by a beginner level user and improving the diagrams.
2022-07-22 23:30:54 -07:00
Eric Liang
12825fc5aa
[air] Add a warning if no CPUs are reserved for dataset execution (#26643) 2022-07-17 16:33:51 -07:00
Eric Liang
400330e9c0
[air] Add _max_cpu_fraction_per_node to ScalingConfig and documentation (#26634) 2022-07-16 21:55:51 -07:00
Eric Liang
9de1add073
[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883)
This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case:
- The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8.
- The parallelism is set to the estimated number of CPUs multiplied by 2.
- The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size.

These rules fix two common user problems:
1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster.
2. Overly large block sizes leading to OOMs when processing a single block.

TODO:
- [x] Unit tests
- [x] Docs update

Supercedes part of: https://github.com/ray-project/ray/pull/25708

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-07-12 21:08:49 -07:00
Myeongju Kim
a1a78077ca
Fix a broken link in Ray Dataset doc (#25927)
Co-authored-by: Myeong Kim <myeongki@amazon.com>
2022-06-20 13:17:46 -07:00
Clark Zinzow
1701b923bc
[Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870)
This PR adds a NumPy "numpy" batch format for batch transformations and batch consumption that works with all block types. See #24811.
2022-06-17 16:01:02 -07:00
Stephanie Wang
473a962d89
[Datasets] [Docs] Add docs about fault tolerance in Datasets (#25371)
Adds description of fault tolerance guarantees for Datasets.

Related issue number

Closes #24856.
2022-06-02 15:53:50 -07:00
Kai Fricke
6fe91885b0
[docs/lint] Fix reference to dataset_tune (#25402) 2022-06-02 11:40:26 +01:00
Eric Liang
51b295ad74
[docs] Improve Tune + Datasets documentation (#25389) 2022-06-01 21:52:32 -07:00
Eric Liang
71717e59c4
[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262) 2022-06-01 13:50:46 -07:00
Clark Zinzow
2c8fac369a
Note that explicit resource allocation is experimental, fix typos (#25038) 2022-05-20 11:36:08 -07:00
Clark Zinzow
0b6505e8c6
[Datasets] Miscellaneous GA docs P0s. (#24891)
This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely:

- Documents Datasets resource allocation model.
- De-emphasizes global/windowed shuffling.
- Documents lazy execution mode, and expands our execution model docs in general.
2022-05-18 16:17:48 -07:00
Jian Xiao
6d93e9f0f5
Cleanup the DatasetPipeline references in Getting Started; rename Exchanging to Accessing (#23786) 2022-04-12 17:10:14 -07:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Clark Zinzow
fb0d6e6b0b
[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067) 2022-02-05 16:59:34 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00