Commit graph

8 commits

Author SHA1 Message Date
Eric Liang
f7ae8923f6
[docs] Reorganize the tensor data support docs; general editing (#26952)
Why are these changes needed?
Editing pass over the tensor support docs for clarity:

Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
2022-08-01 17:31:41 -07:00
Eric Liang
1ac2a872e7
[docs] Editing pass over Dataset docs (#26935) 2022-07-24 19:48:29 -07:00
Eric Liang
63a6c1dfac
[docs] Cleanup the Datasets key concept docs (#26908)
Clean up the Datasets key concept doc to be suitable for consumption by a beginner level user and improving the diagrams.
2022-07-22 23:30:54 -07:00
Eric Liang
9de1add073
[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883)
This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case:
- The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8.
- The parallelism is set to the estimated number of CPUs multiplied by 2.
- The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size.

These rules fix two common user problems:
1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster.
2. Overly large block sizes leading to OOMs when processing a single block.

TODO:
- [x] Unit tests
- [x] Docs update

Supercedes part of: https://github.com/ray-project/ray/pull/25708

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-07-12 21:08:49 -07:00
Cheng Su
4e674b6ad3
[Datasets] Update docs for drop_columns and fix typos (#26317)
We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.
2022-07-07 17:17:33 -07:00
Jian Xiao
9dd30d5f77
Proofread the some datasets docs (#25068)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-05-22 12:11:51 -07:00
Stephanie Wang
2931a23760
[doc] Add docs for push-based shuffle in Datasets (#24486)
Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets.
2022-05-05 14:59:33 -07:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00