Commit graph

160 commits

Author SHA1 Message Date
Jian Xiao
9fe4dba4ad
Revamp the Getting Started page for Dataset (#24860)
This is part of the Dataset GA doc fix effort to update/improve the documentation.
This PR revamps the Getting Started page.

What are the changes:
- Focus on basic/core features that are bread-and-butter for users, leave the advanced features out
- Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API)
- Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case)
- Use the same data example throughout to make it context-switch free
- Use runnable code rather than faked
- Reference to the code from doc, instead of inlining them in the doc

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-05-18 13:46:23 -07:00
Clark Zinzow
26ea82d3a6
[Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) 2022-05-17 20:57:42 -07:00
Clark Zinzow
4444150c29
[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831)
This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.
2022-05-17 16:23:42 -07:00
Clark Zinzow
ea635aecd2
[Datasets] Support tensor columns in to_tf and to_torch. (#24752)
This PR adds support for tensor columns in the to_tf() and to_torch() APIs.

For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor.

For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series.

In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.
2022-05-17 01:11:00 -07:00
Clark Zinzow
ef870e936c
[Datasets] Change range_arrow() API to range_table() (#24704)
This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.
2022-05-17 01:09:45 -07:00
Chen Shen
cc21979998
Revert "[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354)" (#24785)
This reverts commit e2ee2140f9.
2022-05-13 11:18:30 -07:00
Chen Shen
9b1154dce4
fix inter (#24761) 2022-05-13 08:18:22 -07:00
Patrick Ames
e2ee2140f9
[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354)
API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.
2022-05-12 10:19:33 -07:00
Zhe Zhang
909d463552
[docs] Fix import error in Ray Data "getting started" (#24424)
We did `import pandas as pd` but here we are using it as `pandas`
2022-05-10 15:46:15 -07:00
Antoni Baum
04e16f70a3
[Datasets] [Docs] Add a warning about from_huggingface (#24608)
Adds a warning to docs about the intended use of from_huggingface.
2022-05-10 13:08:25 -07:00
Chen Shen
f1f8ad6ca3
[Doc][Data] fix big-data-ingestion broken links (#24631)
The links were broken. Fixed it.
2022-05-10 09:04:41 -07:00
Antoni Baum
668049492c
[Datasets] Add from_huggingface for Hugging Face datasets integration (#24464)
Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.
2022-05-06 13:09:28 -07:00
Stephanie Wang
2931a23760
[doc] Add docs for push-based shuffle in Datasets (#24486)
Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets.
2022-05-05 14:59:33 -07:00
Balaji Veeramani
2190f7ff25
[Datsets] Add SimpleTensorFlowDatasource (#24022)
This PR makes it easier to use TensorFlow datasets with Ray Datasets.
2022-04-29 12:15:30 -07:00
Shawn
43ed78f6fd
[Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402)
Add Mars-on-Ray + Datasets integration; improve Mars-on-Ray docs and add tests.
2022-04-29 09:43:52 -07:00
Balaji Veeramani
2fdea6e24f
[Datasets] Add SimpleTorchDatasource (#23926)
It's difficult to use torchvision datasets with Ray ML. This PR makes it easier to use Torch datasets with Ray Data.
2022-04-28 11:56:45 -07:00
matthewdeng
cc08c01ade
[ml] add more preprocessors (#23904)
Adding some more common preprocessors:
* MaxAbsScaler
* RobustScaler
* PowerTransformer
* Normalizer
* FeatureHasher
* Tokenizer
* HashingVectorizer
* CountVectorizer

API docs: https://ray--23904.org.readthedocs.build/en/23904/ray-air/getting-started.html

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-04-25 21:12:59 +01:00
Jian Xiao
57f620bd05
[Datasets] Add missing public APIs to Datasets API docs (#23935) 2022-04-16 11:57:38 -07:00
Clark Zinzow
983ef1f2a7
[Datasets] Make from_numpy() more user-friendly. (#23871)
`ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).
2022-04-12 18:37:59 -07:00
Jian Xiao
6d93e9f0f5
Cleanup the DatasetPipeline references in Getting Started; rename Exchanging to Accessing (#23786) 2022-04-12 17:10:14 -07:00
Eric Liang
858d607b19
[data] Fix small doc issues (#23813) 2022-04-09 12:09:08 -07:00
Jian Xiao
f737731a5e
Remove dataset pipeline from the Getting Started page (#23756)
1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page
2. We already have a separate/dedicated page called Pipelining Compute to cover the same content
2022-04-07 12:52:04 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
Jian Xiao
0b1a2a44c0
[Dataset GA doc] Decompose the monolith of Getting Started page (and get them under User Guide) (#23311)
Improve the Dataset documentation for GA.
2022-03-18 11:25:43 -07:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
Eric Liang
08dc31e747
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Jian Xiao
8c9e3f6c2e
Move the third-party data integrations (non-Dataset stuff) out of the user guides which is for Dataset (#23162)
Improve documentation of Ray Dataset.
2022-03-17 11:27:40 -07:00
Eric Liang
678d23fe42
Remove beta label from Datasets (#23220) 2022-03-15 23:05:59 -07:00
Jian Xiao
10435d2d8f
Update dask version for Ray 1.12.0 (#23197) 2022-03-15 19:22:19 -07:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
Eric Liang
52491c87e2
Make a pass fixing Dataset API issues (#22886) 2022-03-08 13:07:55 -08:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Eric Liang
e228544d39
Undo revert of windowing dataset by bytes (#22735) 2022-03-01 12:24:04 -08:00
SangBin Cho
ba4f1423c7
Revert "Support creating a DatasetPipeline windowed by bytes (#22577)" (#22695)
This reverts commit b5b4460932.
2022-02-28 11:56:12 -08:00
Eric Liang
b5b4460932
Support creating a DatasetPipeline windowed by bytes (#22577) 2022-02-25 23:31:10 -08:00
Eric Liang
533a0440a6
Improve actor pool support in Datasets (#22574) 2022-02-24 12:01:36 -08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Clark Zinzow
53c4c7b1be
[Datasets] Expose TableRow as public API; minimize copies/type conversions on row-based ops. (#22305)
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
2022-02-14 12:56:17 -08:00
Balaji Veeramani
31ed9e5d02
[CI] Replace YAPF disables with Black disables (#21982) 2022-02-08 16:29:25 -08:00
Clark Zinzow
fb0d6e6b0b
[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067) 2022-02-05 16:59:34 -08:00
Clark Zinzow
743ce65da8
[Dask-on-Ray] Add support for Dask annotations. (#22057) 2022-02-03 22:15:38 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Clark Zinzow
09fab70991
[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937)
Fixes bug in Datasets locality-aware splitting example.
2022-01-27 14:46:04 -08:00
mwtian
559eefd06f
[Doc] update dask version for Ray 1.11.0 (#21933)
This is needed for release 1.11.0.
2022-01-27 13:15:01 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00
Max Pumperla
b34099e764
[docs] landing page (fixes #21750) (#21859)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-26 17:14:25 -08:00
Clark Zinzow
411bb308dc
[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889) 2022-01-26 12:05:27 -08:00
Max Pumperla
f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00