hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Max Pumperla	29d94a2211	[docs] sphinx gallery removal, migrate to ipynb (#22467 )	2022-02-19 01:19:07 -08:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
Balaji Veeramani	31ed9e5d02	[CI] Replace YAPF disables with Black disables (#21982 )	2022-02-08 16:29:25 -08:00
Clark Zinzow	fb0d6e6b0b	[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067 )	2022-02-05 16:59:34 -08:00
Clark Zinzow	743ce65da8	[Dask-on-Ray] Add support for Dask annotations. (#22057 )	2022-02-03 22:15:38 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Clark Zinzow	09fab70991	[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937 ) Fixes bug in Datasets locality-aware splitting example.	2022-01-27 14:46:04 -08:00
mwtian	559eefd06f	[Doc] update dask version for Ray 1.11.0 (#21933 ) This is needed for release 1.11.0.	2022-01-27 13:15:01 -08:00
Max Pumperla	4dd221f848	[Docs] Ray Data docs target state (#21931 ) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.	2022-01-27 13:14:36 -08:00
Max Pumperla	b34099e764	[docs] landing page (fixes #21750 ) (#21859 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-01-26 17:14:25 -08:00
Clark Zinzow	411bb308dc	[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889 )	2022-01-26 12:05:27 -08:00
Max Pumperla	f9b71a8bf6	[docs] new structure (#21776 ) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).	2022-01-21 15:42:05 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
Archit Kulkarni	7d74a9face	[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360 ) I updated this version compatibility table on the release branch but didn't update it on master. This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.	2022-01-19 18:55:05 -08:00
Eric Liang	a69ae1d886	Add blogs to dataset materials (#21546 )	2022-01-11 22:09:57 -08:00
Eric Liang	e9068c45fa	[data] Instrument most remaining dataset functions and add docs (#21412 ) This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-01-06 17:08:56 -08:00
Clark Zinzow	c3d68fa0c1	[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. (#21114 ) Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well. This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.	2021-12-17 13:16:37 -08:00
Eric Liang	22ccc6b300	Initial stats framework for datasets (#20867 ) This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets. Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.	2021-12-08 16:13:57 -08:00
Clark Zinzow	b872fdaaac	[Datasets] Last-mile preprocessing docs. (#20712 ) Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.	2021-11-29 23:23:27 -08:00
Yi Cheng	e24cee80e8	[docs] add dask compatibility for 1.9.0 (#20707 )	2021-11-24 15:00:17 -08:00
Eric Liang	163620ba94	[data] Make block splitting feature flagged off by default (#20660 ) block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default: - We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting). - We're confident lineage-based reconstruction can work with block splitting.	2021-11-23 19:46:18 -08:00
Eric Liang	65a8698e82	Raise the dataset block size limit to 2GiB (#20551 ) The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.	2021-11-18 19:36:10 -08:00
Amog Kamsetty	9796ae56d5	[Train][Data] Change usages of `iter_datasets` to `iter_epochs` (#20487 )	2021-11-17 18:05:51 -08:00
Richard Liaw	cf357f6bce	[docs] Add a talks section for ray.data (#20444 )	2021-11-16 14:30:08 -08:00
Eric Liang	460cf86858	Split blocks automatically into 500MB chunks on file read and transformation (#20235 ) This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.	2021-11-15 22:25:11 -08:00
Eric Liang	6102912494	Dataset doc updates (#19815 )	2021-11-04 18:13:40 -07:00
Philipp Moritz	0a5942d8b0	[Documentation] Fix quotes for windows installations (#19859 ) * [Documentation] Fix quotes for windows installations * update * formatting	2021-10-29 10:54:38 -07:00
Yi Cheng	68ec652be7	[gcs] New option to increase gcs grpc client threads and fix issues in hybrid scheduling (#19663 ) ## Why are these changes needed? - Since broadcasting is moving to grpc, introducing the option to increase the client side thread number - For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s ## Related issue number	2021-10-28 22:40:18 -07:00
Eric Liang	27a5b546ad	Make ArrowRow less scary (#19686 )	2021-10-25 12:18:42 -07:00
Eric Liang	875d19f838	[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620 )	2021-10-23 12:20:23 -07:00
matthewdeng	b3b739266e	[docs] add dask compatibility for 1.8.0 (#19578 )	2021-10-21 07:26:07 -07:00
Jiajun Yao	4fc5b11c68	Simple block dataset groupBy (#19435 )	2021-10-19 19:53:13 -07:00
matthewdeng	4674c78050	[Train] Rename Ray SGD v2 to Ray Train (#19436 )	2021-10-18 22:27:46 -07:00
Eric Liang	13d4ad6100	[data] Preserve epoch by default when using rewindow() (#19359 )	2021-10-14 09:17:36 -07:00
Eric Liang	430a5f4a21	[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332 )	2021-10-12 18:32:29 -07:00
Amog Kamsetty	f6f2435b91	[SGD] Sgd v2 Dataset Integration (#17626 ) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2021-10-12 14:03:10 -07:00
Eric Liang	0ab6749602	Support iter_epochs for Datasets (#19217 )	2021-10-12 11:05:00 -07:00
Chen Shen	c740aae54c	[Core][Dataset] adding example for large scale data ingestion (#18998 )	2021-10-11 15:37:09 -07:00
Eric Liang	86cbe3e833	[data] Add support for repeating and re-windowing a DatasetPipeline (#19091 )	2021-10-06 20:13:43 -07:00
Jiajun Yao	7ccf737f97	Add compatible dask version for ray 1.6.0 and 1.7.0 (#19080 )	2021-10-05 10:23:06 +09:00
Eric Liang	032a420ee6	Rename Dataset.pipeline to Dataset.window (#19050 )	2021-10-01 19:55:29 -07:00
Clark Zinzow	d22f838795	[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992 )	2021-10-01 13:08:25 -07:00
Alex Wu	5709c6501b	[dataset][usability] Dataset dependencies (#18346 )	2021-09-29 17:29:31 -07:00
Eric Liang	caf34a452c	Unify ArrowTensorType tables and Tensor blocks (#18867 )	2021-09-27 16:24:09 -07:00
Eric Liang	4d2065352b	Increase dataset read parallelism by default (#18420 )	2021-09-09 15:07:49 -07:00
Clark Zinzow	b30c41759d	[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301 )	2021-09-08 10:09:01 -07:00
Eric Liang	cbdafa0b63	[doc] Fix various workflow doc bugs (#18357 )	2021-09-06 01:39:08 -07:00
Eric Liang	7dcae690b9	Mark datasets as still in alpha for now (#18321 )	2021-09-02 17:07:33 -07:00
Wesley Gifford	6133a561e9	Dataset from modin (#18122 )	2021-08-31 11:19:35 -07:00

1 2

73 commits