hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
Archit Kulkarni	7d74a9face	[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360 ) I updated this version compatibility table on the release branch but didn't update it on master. This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.	2022-01-19 18:55:05 -08:00
Eric Liang	a69ae1d886	Add blogs to dataset materials (#21546 )	2022-01-11 22:09:57 -08:00
Eric Liang	e9068c45fa	[data] Instrument most remaining dataset functions and add docs (#21412 ) This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-01-06 17:08:56 -08:00
Clark Zinzow	c3d68fa0c1	[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. (#21114 ) Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well. This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.	2021-12-17 13:16:37 -08:00
Eric Liang	22ccc6b300	Initial stats framework for datasets (#20867 ) This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets. Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.	2021-12-08 16:13:57 -08:00
Clark Zinzow	b872fdaaac	[Datasets] Last-mile preprocessing docs. (#20712 ) Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.	2021-11-29 23:23:27 -08:00
Yi Cheng	e24cee80e8	[docs] add dask compatibility for 1.9.0 (#20707 )	2021-11-24 15:00:17 -08:00
Eric Liang	163620ba94	[data] Make block splitting feature flagged off by default (#20660 ) block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default: - We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting). - We're confident lineage-based reconstruction can work with block splitting.	2021-11-23 19:46:18 -08:00
Eric Liang	65a8698e82	Raise the dataset block size limit to 2GiB (#20551 ) The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.	2021-11-18 19:36:10 -08:00
Amog Kamsetty	9796ae56d5	[Train][Data] Change usages of `iter_datasets` to `iter_epochs` (#20487 )	2021-11-17 18:05:51 -08:00
Richard Liaw	cf357f6bce	[docs] Add a talks section for ray.data (#20444 )	2021-11-16 14:30:08 -08:00
Eric Liang	460cf86858	Split blocks automatically into 500MB chunks on file read and transformation (#20235 ) This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.	2021-11-15 22:25:11 -08:00
Eric Liang	6102912494	Dataset doc updates (#19815 )	2021-11-04 18:13:40 -07:00
Philipp Moritz	0a5942d8b0	[Documentation] Fix quotes for windows installations (#19859 ) * [Documentation] Fix quotes for windows installations * update * formatting	2021-10-29 10:54:38 -07:00
Yi Cheng	68ec652be7	[gcs] New option to increase gcs grpc client threads and fix issues in hybrid scheduling (#19663 ) ## Why are these changes needed? - Since broadcasting is moving to grpc, introducing the option to increase the client side thread number - For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s ## Related issue number	2021-10-28 22:40:18 -07:00
Eric Liang	27a5b546ad	Make ArrowRow less scary (#19686 )	2021-10-25 12:18:42 -07:00
Eric Liang	875d19f838	[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620 )	2021-10-23 12:20:23 -07:00
matthewdeng	b3b739266e	[docs] add dask compatibility for 1.8.0 (#19578 )	2021-10-21 07:26:07 -07:00
Jiajun Yao	4fc5b11c68	Simple block dataset groupBy (#19435 )	2021-10-19 19:53:13 -07:00
matthewdeng	4674c78050	[Train] Rename Ray SGD v2 to Ray Train (#19436 )	2021-10-18 22:27:46 -07:00
Eric Liang	13d4ad6100	[data] Preserve epoch by default when using rewindow() (#19359 )	2021-10-14 09:17:36 -07:00
Eric Liang	430a5f4a21	[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332 )	2021-10-12 18:32:29 -07:00
Amog Kamsetty	f6f2435b91	[SGD] Sgd v2 Dataset Integration (#17626 ) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2021-10-12 14:03:10 -07:00
Eric Liang	0ab6749602	Support iter_epochs for Datasets (#19217 )	2021-10-12 11:05:00 -07:00
Chen Shen	c740aae54c	[Core][Dataset] adding example for large scale data ingestion (#18998 )	2021-10-11 15:37:09 -07:00
Eric Liang	86cbe3e833	[data] Add support for repeating and re-windowing a DatasetPipeline (#19091 )	2021-10-06 20:13:43 -07:00
Jiajun Yao	7ccf737f97	Add compatible dask version for ray 1.6.0 and 1.7.0 (#19080 )	2021-10-05 10:23:06 +09:00
Eric Liang	032a420ee6	Rename Dataset.pipeline to Dataset.window (#19050 )	2021-10-01 19:55:29 -07:00
Clark Zinzow	d22f838795	[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992 )	2021-10-01 13:08:25 -07:00
Alex Wu	5709c6501b	[dataset][usability] Dataset dependencies (#18346 )	2021-09-29 17:29:31 -07:00
Eric Liang	caf34a452c	Unify ArrowTensorType tables and Tensor blocks (#18867 )	2021-09-27 16:24:09 -07:00
Eric Liang	4d2065352b	Increase dataset read parallelism by default (#18420 )	2021-09-09 15:07:49 -07:00
Clark Zinzow	b30c41759d	[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301 )	2021-09-08 10:09:01 -07:00
Eric Liang	cbdafa0b63	[doc] Fix various workflow doc bugs (#18357 )	2021-09-06 01:39:08 -07:00
Eric Liang	7dcae690b9	Mark datasets as still in alpha for now (#18321 )	2021-09-02 17:07:33 -07:00
Wesley Gifford	6133a561e9	Dataset from modin (#18122 )	2021-08-31 11:19:35 -07:00
Eric Liang	95b5ad12ba	Initial version of workflow documentation (#18138 )	2021-08-27 16:20:48 -07:00
Clark Zinzow	c0598de82a	[Datasets] Port write APIs to use file-based datasources. (#18135 )	2021-08-27 15:24:54 -07:00
Clark Zinzow	aee7ba2510	[Datasets] Add from_numpy() and to_numpy() APIs (#18146 )	2021-08-27 13:33:11 -07:00
Eric Liang	e1f69ceb5e	Add documentation for DatasetPipeline.from_iterable (#18106 )	2021-08-25 22:31:23 -07:00
Eric Liang	71b3183038	Add implicit init note to Ray docs & dataset version note (#17751 )	2021-08-11 13:13:22 -07:00
Eric Liang	d4f9d3620e	Move ray.data out of experimental (#17560 )	2021-08-04 13:31:10 -07:00
Chris K. W	a33cbec12a	[client][docs] update docs for new client support in init (#17333 ) * start * check formatting * undo changes from base branch * Client builder API docs * indent * 8 * minor fixes * absolute path to runtime env docs * fix runtime_env link * Update worker.init docs * drop clientbuilder docs, link to 1.4.1 docs instead. Specify local:// behavior when address passed * add debug info for ray.init("local") * local:// attaches a driver directly * update ray.init return wording * remote init.connect() from example * drop local:// docs, add section on when to use ray client * link to 1.4.1 docs in code example instead of mentioning clientbuilder * fix backticks, doc mentions of ray.util.connect * remove ray.util.connect mentions from examples and comments * update tune example * wording * localhost:<port> also works if you're on the head node * add quotes * drop mentions of ray client from ray.init docstring * local->remote * fix section ref * update ray start output * fix section link * try to fix doc again * fix link wording * drop local:// from docs and special handling from code * update ray start message * lint * doc lint * remove local:// codepath * remove 'internal_config' * Update doc/source/cluster/ray-client.rst Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu> * doc suggestion * Update doc/source/cluster/ray-client.rst Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>	2021-08-04 05:31:44 +03:00
Eric Liang	748cbbb23d	[hotfix] Parquet S3 reads broken due to pyarrow.lib.ArrowInvalid: S3 subsystem not initialized (#17492 )	2021-08-02 11:48:48 -07:00
Eric Liang	e812691909	Support top-level tensor values in dataset (#17439 )	2021-08-01 22:45:21 -07:00
Eric Liang	cd13059691	[dataset] Implement random_shuffle() and split(equal=True) (#17448 )	2021-07-30 09:51:21 -07:00
Eric Liang	7ed62ea0ad	Initial implementation of Dataset pipelining and docs (#17309 )	2021-07-28 21:12:01 -07:00
Jiao	9b6be6f1c8	update dask compatibility for 1.5.0 (#17302 ) * update dask compatibility for 1.5.0 * change to right file * add pip install pytest Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-07-23 17:31:42 -07:00

1 2

61 commits