Commit graph

95 commits

Author SHA1 Message Date
Balaji Veeramani
2fdea6e24f
[Datasets] Add SimpleTorchDatasource (#23926)
It's difficult to use torchvision datasets with Ray ML. This PR makes it easier to use Torch datasets with Ray Data.
2022-04-28 11:56:45 -07:00
matthewdeng
cc08c01ade
[ml] add more preprocessors (#23904)
Adding some more common preprocessors:
* MaxAbsScaler
* RobustScaler
* PowerTransformer
* Normalizer
* FeatureHasher
* Tokenizer
* HashingVectorizer
* CountVectorizer

API docs: https://ray--23904.org.readthedocs.build/en/23904/ray-air/getting-started.html

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-04-25 21:12:59 +01:00
Jian Xiao
57f620bd05
[Datasets] Add missing public APIs to Datasets API docs (#23935) 2022-04-16 11:57:38 -07:00
Clark Zinzow
983ef1f2a7
[Datasets] Make from_numpy() more user-friendly. (#23871)
`ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).
2022-04-12 18:37:59 -07:00
Jian Xiao
6d93e9f0f5
Cleanup the DatasetPipeline references in Getting Started; rename Exchanging to Accessing (#23786) 2022-04-12 17:10:14 -07:00
Eric Liang
858d607b19
[data] Fix small doc issues (#23813) 2022-04-09 12:09:08 -07:00
Jian Xiao
f737731a5e
Remove dataset pipeline from the Getting Started page (#23756)
1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page
2. We already have a separate/dedicated page called Pipelining Compute to cover the same content
2022-04-07 12:52:04 -07:00
Philipp Moritz
886cc4d674
Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
Jian Xiao
0b1a2a44c0
[Dataset GA doc] Decompose the monolith of Getting Started page (and get them under User Guide) (#23311)
Improve the Dataset documentation for GA.
2022-03-18 11:25:43 -07:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
Eric Liang
08dc31e747
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
Jian Xiao
8c9e3f6c2e
Move the third-party data integrations (non-Dataset stuff) out of the user guides which is for Dataset (#23162)
Improve documentation of Ray Dataset.
2022-03-17 11:27:40 -07:00
Eric Liang
678d23fe42
Remove beta label from Datasets (#23220) 2022-03-15 23:05:59 -07:00
Jian Xiao
10435d2d8f
Update dask version for Ray 1.12.0 (#23197) 2022-03-15 19:22:19 -07:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
Eric Liang
52491c87e2
Make a pass fixing Dataset API issues (#22886) 2022-03-08 13:07:55 -08:00
Eric Liang
5a0b7a7ee0
Document Dataset pipeline stage fusion (#22737) 2022-03-01 14:38:09 -08:00
Eric Liang
e228544d39
Undo revert of windowing dataset by bytes (#22735) 2022-03-01 12:24:04 -08:00
SangBin Cho
ba4f1423c7
Revert "Support creating a DatasetPipeline windowed by bytes (#22577)" (#22695)
This reverts commit b5b4460932.
2022-02-28 11:56:12 -08:00
Eric Liang
b5b4460932
Support creating a DatasetPipeline windowed by bytes (#22577) 2022-02-25 23:31:10 -08:00
Eric Liang
533a0440a6
Improve actor pool support in Datasets (#22574) 2022-02-24 12:01:36 -08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Clark Zinzow
53c4c7b1be
[Datasets] Expose TableRow as public API; minimize copies/type conversions on row-based ops. (#22305)
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
2022-02-14 12:56:17 -08:00
Balaji Veeramani
31ed9e5d02
[CI] Replace YAPF disables with Black disables (#21982) 2022-02-08 16:29:25 -08:00
Clark Zinzow
fb0d6e6b0b
[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067) 2022-02-05 16:59:34 -08:00
Clark Zinzow
743ce65da8
[Dask-on-Ray] Add support for Dask annotations. (#22057) 2022-02-03 22:15:38 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Clark Zinzow
09fab70991
[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937)
Fixes bug in Datasets locality-aware splitting example.
2022-01-27 14:46:04 -08:00
mwtian
559eefd06f
[Doc] update dask version for Ray 1.11.0 (#21933)
This is needed for release 1.11.0.
2022-01-27 13:15:01 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00
Max Pumperla
b34099e764
[docs] landing page (fixes #21750) (#21859)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-26 17:14:25 -08:00
Clark Zinzow
411bb308dc
[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889) 2022-01-26 12:05:27 -08:00
Max Pumperla
f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
Archit Kulkarni
7d74a9face
[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360)
I updated this version compatibility table on the release branch but didn't update it on master.  This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.
2022-01-19 18:55:05 -08:00
Eric Liang
a69ae1d886
Add blogs to dataset materials (#21546) 2022-01-11 22:09:57 -08:00
Eric Liang
e9068c45fa
[data] Instrument most remaining dataset functions and add docs (#21412)
This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-06 17:08:56 -08:00
Clark Zinzow
c3d68fa0c1
[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. (#21114)
Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.

This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.
2021-12-17 13:16:37 -08:00
Eric Liang
22ccc6b300
Initial stats framework for datasets (#20867)
This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets.

Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.
2021-12-08 16:13:57 -08:00
Clark Zinzow
b872fdaaac
[Datasets] Last-mile preprocessing docs. (#20712)
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
2021-11-29 23:23:27 -08:00
Yi Cheng
e24cee80e8
[docs] add dask compatibility for 1.9.0 (#20707) 2021-11-24 15:00:17 -08:00
Eric Liang
163620ba94
[data] Make block splitting feature flagged off by default (#20660)
block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default:
- We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting).
- We're confident lineage-based reconstruction can work with block splitting.
2021-11-23 19:46:18 -08:00
Eric Liang
65a8698e82
Raise the dataset block size limit to 2GiB (#20551)
The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.
2021-11-18 19:36:10 -08:00
Amog Kamsetty
9796ae56d5
[Train][Data] Change usages of iter_datasets to iter_epochs (#20487) 2021-11-17 18:05:51 -08:00
Richard Liaw
cf357f6bce
[docs] Add a talks section for ray.data (#20444) 2021-11-16 14:30:08 -08:00
Eric Liang
460cf86858
Split blocks automatically into 500MB chunks on file read and transformation (#20235)
This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.
2021-11-15 22:25:11 -08:00
Eric Liang
6102912494
Dataset doc updates (#19815) 2021-11-04 18:13:40 -07:00
Philipp Moritz
0a5942d8b0
[Documentation] Fix quotes for windows installations (#19859)
* [Documentation] Fix quotes for windows installations

* update

* formatting
2021-10-29 10:54:38 -07:00