Commit graph

61 commits

Author SHA1 Message Date
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
Archit Kulkarni
7d74a9face
[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360)
I updated this version compatibility table on the release branch but didn't update it on master.  This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.
2022-01-19 18:55:05 -08:00
Eric Liang
a69ae1d886
Add blogs to dataset materials (#21546) 2022-01-11 22:09:57 -08:00
Eric Liang
e9068c45fa
[data] Instrument most remaining dataset functions and add docs (#21412)
This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-06 17:08:56 -08:00
Clark Zinzow
c3d68fa0c1
[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. (#21114)
Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.

This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.
2021-12-17 13:16:37 -08:00
Eric Liang
22ccc6b300
Initial stats framework for datasets (#20867)
This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets.

Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.
2021-12-08 16:13:57 -08:00
Clark Zinzow
b872fdaaac
[Datasets] Last-mile preprocessing docs. (#20712)
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
2021-11-29 23:23:27 -08:00
Yi Cheng
e24cee80e8
[docs] add dask compatibility for 1.9.0 (#20707) 2021-11-24 15:00:17 -08:00
Eric Liang
163620ba94
[data] Make block splitting feature flagged off by default (#20660)
block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default:
- We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting).
- We're confident lineage-based reconstruction can work with block splitting.
2021-11-23 19:46:18 -08:00
Eric Liang
65a8698e82
Raise the dataset block size limit to 2GiB (#20551)
The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.
2021-11-18 19:36:10 -08:00
Amog Kamsetty
9796ae56d5
[Train][Data] Change usages of iter_datasets to iter_epochs (#20487) 2021-11-17 18:05:51 -08:00
Richard Liaw
cf357f6bce
[docs] Add a talks section for ray.data (#20444) 2021-11-16 14:30:08 -08:00
Eric Liang
460cf86858
Split blocks automatically into 500MB chunks on file read and transformation (#20235)
This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.
2021-11-15 22:25:11 -08:00
Eric Liang
6102912494
Dataset doc updates (#19815) 2021-11-04 18:13:40 -07:00
Philipp Moritz
0a5942d8b0
[Documentation] Fix quotes for windows installations (#19859)
* [Documentation] Fix quotes for windows installations

* update

* formatting
2021-10-29 10:54:38 -07:00
Yi Cheng
68ec652be7
[gcs] New option to increase gcs grpc client threads and fix issues in hybrid scheduling (#19663)
## Why are these changes needed?

- Since broadcasting is moving to grpc, introducing the option to increase the client side thread number
- For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled

With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s

## Related issue number
2021-10-28 22:40:18 -07:00
Eric Liang
27a5b546ad
Make ArrowRow less scary (#19686) 2021-10-25 12:18:42 -07:00
Eric Liang
875d19f838
[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620) 2021-10-23 12:20:23 -07:00
matthewdeng
b3b739266e
[docs] add dask compatibility for 1.8.0 (#19578) 2021-10-21 07:26:07 -07:00
Jiajun Yao
4fc5b11c68
Simple block dataset groupBy (#19435) 2021-10-19 19:53:13 -07:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00
Eric Liang
13d4ad6100
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00
Eric Liang
430a5f4a21
[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332) 2021-10-12 18:32:29 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Eric Liang
0ab6749602
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00
Chen Shen
c740aae54c
[Core][Dataset] adding example for large scale data ingestion (#18998) 2021-10-11 15:37:09 -07:00
Eric Liang
86cbe3e833
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00
Jiajun Yao
7ccf737f97
Add compatible dask version for ray 1.6.0 and 1.7.0 (#19080) 2021-10-05 10:23:06 +09:00
Eric Liang
032a420ee6
Rename Dataset.pipeline to Dataset.window (#19050) 2021-10-01 19:55:29 -07:00
Clark Zinzow
d22f838795
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992) 2021-10-01 13:08:25 -07:00
Alex Wu
5709c6501b
[dataset][usability] Dataset dependencies (#18346) 2021-09-29 17:29:31 -07:00
Eric Liang
caf34a452c
Unify ArrowTensorType tables and Tensor blocks (#18867) 2021-09-27 16:24:09 -07:00
Eric Liang
4d2065352b
Increase dataset read parallelism by default (#18420) 2021-09-09 15:07:49 -07:00
Clark Zinzow
b30c41759d
[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301) 2021-09-08 10:09:01 -07:00
Eric Liang
cbdafa0b63
[doc] Fix various workflow doc bugs (#18357) 2021-09-06 01:39:08 -07:00
Eric Liang
7dcae690b9
Mark datasets as still in alpha for now (#18321) 2021-09-02 17:07:33 -07:00
Wesley Gifford
6133a561e9
Dataset from modin (#18122) 2021-08-31 11:19:35 -07:00
Eric Liang
95b5ad12ba
Initial version of workflow documentation (#18138) 2021-08-27 16:20:48 -07:00
Clark Zinzow
c0598de82a
[Datasets] Port write APIs to use file-based datasources. (#18135) 2021-08-27 15:24:54 -07:00
Clark Zinzow
aee7ba2510
[Datasets] Add from_numpy() and to_numpy() APIs (#18146) 2021-08-27 13:33:11 -07:00
Eric Liang
e1f69ceb5e
Add documentation for DatasetPipeline.from_iterable (#18106) 2021-08-25 22:31:23 -07:00
Eric Liang
71b3183038
Add implicit init note to Ray docs & dataset version note (#17751) 2021-08-11 13:13:22 -07:00
Eric Liang
d4f9d3620e
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00
Chris K. W
a33cbec12a
[client][docs] update docs for new client support in init (#17333)
* start

* check formatting

* undo changes from base branch

* Client builder API docs

* indent

* 8

* minor fixes

* absolute path to runtime env docs

* fix runtime_env link

* Update worker.init docs

* drop clientbuilder docs, link to 1.4.1 docs instead. Specify local:// behavior when address passed

* add debug info for ray.init("local")

* local:// attaches a driver directly

* update ray.init return wording

* remote init.connect() from example

* drop local:// docs, add section on when to use ray client

* link to 1.4.1 docs in code example instead of mentioning clientbuilder

* fix backticks, doc mentions of ray.util.connect

* remove ray.util.connect mentions from examples and comments

* update tune example

* wording

* localhost:<port> also works if you're on the head node

* add quotes

* drop mentions of ray client from ray.init docstring

* local->remote

* fix section ref

* update ray start output

* fix section link

* try to fix doc again

* fix link wording

* drop local:// from docs and special handling from code

* update ray start message

* lint

* doc lint

* remove local:// codepath

* remove 'internal_config'

* Update doc/source/cluster/ray-client.rst

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* doc suggestion

* Update doc/source/cluster/ray-client.rst

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-08-04 05:31:44 +03:00
Eric Liang
748cbbb23d
[hotfix] Parquet S3 reads broken due to pyarrow.lib.ArrowInvalid: S3 subsystem not initialized (#17492) 2021-08-02 11:48:48 -07:00
Eric Liang
e812691909
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00
Eric Liang
cd13059691
[dataset] Implement random_shuffle() and split(equal=True) (#17448) 2021-07-30 09:51:21 -07:00
Eric Liang
7ed62ea0ad
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00
Jiao
9b6be6f1c8
update dask compatibility for 1.5.0 (#17302)
* update dask compatibility for 1.5.0

* change to right file

* add pip install pytest

Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-07-23 17:31:42 -07:00