Commit graph

25 commits

Author SHA1 Message Date
Cheng Su
bc5d8d9176
[AIR] Replace references of to_tf with iter_tf_batches (#27672) 2022-08-09 16:00:02 -07:00
Clark Zinzow
3b151c581e
[Datasets] Delay expensive tensor extension type import until Parquet reading. (#27653)
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.

I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
2022-08-08 17:06:25 -07:00
Cheng Su
aeb2346804
[AIR] Replace references of to_torch with iter_torch_batches (#27574) 2022-08-07 20:14:12 -07:00
Eric Liang
f7ae8923f6
[docs] Reorganize the tensor data support docs; general editing (#26952)
Why are these changes needed?
Editing pass over the tensor support docs for clarity:

Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
2022-08-01 17:31:41 -07:00
matthewdeng
3ea80f6aa1
[data] set iter_batches default batch_size (#26955)
Why are these changes needed?
Resubmitting #26869.

This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950
2022-07-25 08:34:25 -07:00
Kai Fricke
8fe439998e
[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)
Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig
2022-07-24 07:45:24 -07:00
Eric Liang
d692a55018
[data] Make lazy mode non-experimental (#26934) 2022-07-23 21:28:31 -07:00
matthewdeng
bcec60d898
Revert "[data] set iter_batches default batch_size #26869 " (#26938)
This reverts commit b048c6f659.
2022-07-23 17:46:45 -07:00
matthewdeng
b048c6f659
[data] set iter_batches default batch_size #26869
Why are these changes needed?
Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes
Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.
2022-07-23 13:44:53 -07:00
Eric Liang
63a6c1dfac
[docs] Cleanup the Datasets key concept docs (#26908)
Clean up the Datasets key concept doc to be suitable for consumption by a beginner level user and improving the diagrams.
2022-07-22 23:30:54 -07:00
Chen Shen
b20f5f51df
[Air][Data] Don't promote locality_hints for split (#26647)
Why are these changes needed?
Since locality_hints is an experimental feature, we stop promoting it in doc and don't enable it in AIR. See #26641 for more context
2022-07-17 22:18:30 -07:00
Eric Liang
400330e9c0
[air] Add _max_cpu_fraction_per_node to ScalingConfig and documentation (#26634) 2022-07-16 21:55:51 -07:00
Cheng Su
4e674b6ad3
[Datasets] Update docs for drop_columns and fix typos (#26317)
We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.
2022-07-07 17:17:33 -07:00
Clark Zinzow
1701b923bc
[Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870)
This PR adds a NumPy "numpy" batch format for batch transformations and batch consumption that works with all block types. See #24811.
2022-06-17 16:01:02 -07:00
Jian Xiao
50c854b1ad
Fix hyperlink in rst doc (#25427)
Hyperlink not working

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-06-08 13:46:23 -07:00
Jian Xiao
6589a4f8cb
[Datasets][UX Assessment] Add a section on how to write UDFs in Datasets (#25338)
The Datasets UX assessment showed that users had difficulties in writing UDFs: what's input/output types, how to write the function etc.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-06-02 20:00:50 -07:00
Eric Liang
51b295ad74
[docs] Improve Tune + Datasets documentation (#25389) 2022-06-01 21:52:32 -07:00
Jian Xiao
ad842ec9ab
Revamp the Transforming Datasets user guide (#25033) 2022-05-20 19:25:06 -07:00
Jian Xiao
e5838c4700
Fix range_arrow(), which is replaced by range_table() (#25036) 2022-05-20 19:24:49 -07:00
Jian Xiao
44fd7fd1d0
Revamp the Saving Datasets user guide (#24987) 2022-05-19 15:40:12 -07:00
Clark Zinzow
399334d53c
[Datasets] Overhaul "Accessing Datasets" feature guide. (#24963)
This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()).
2022-05-19 12:50:00 -07:00
Clark Zinzow
0b6505e8c6
[Datasets] Miscellaneous GA docs P0s. (#24891)
This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely:

- Documents Datasets resource allocation model.
- De-emphasizes global/windowed shuffling.
- Documents lazy execution mode, and expands our execution model docs in general.
2022-05-18 16:17:48 -07:00
Jian Xiao
9fe4dba4ad
Revamp the Getting Started page for Dataset (#24860)
This is part of the Dataset GA doc fix effort to update/improve the documentation.
This PR revamps the Getting Started page.

What are the changes:
- Focus on basic/core features that are bread-and-butter for users, leave the advanced features out
- Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API)
- Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case)
- Use the same data example throughout to make it context-switch free
- Use runnable code rather than faked
- Reference to the code from doc, instead of inlining them in the doc

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-05-18 13:46:23 -07:00
Clark Zinzow
4444150c29
[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831)
This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.
2022-05-17 16:23:42 -07:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00