hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-08 19:41:38 -05:00

Author	SHA1	Message	Date
Clark Zinzow	26ea82d3a6	[Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346 )	2022-05-17 20:57:42 -07:00
Antoni Baum	668049492c	[Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464 ) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.	2022-05-06 13:09:28 -07:00
Shawn	43ed78f6fd	[Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402 ) Add Mars-on-Ray + Datasets integration; improve Mars-on-Ray docs and add tests.	2022-04-29 09:43:52 -07:00
Jian Xiao	f737731a5e	Remove dataset pipeline from the Getting Started page (#23756 ) 1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page 2. We already have a separate/dedicated page called Pipelining Compute to cover the same content	2022-04-07 12:52:04 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
Eric Liang	678d23fe42	Remove beta label from Datasets (#23220 )	2022-03-15 23:05:59 -07:00
Max Pumperla	11c40e363d	[docs] external promo content (#22823 )	2022-03-10 11:39:44 -08:00
Clark Zinzow	fb0d6e6b0b	[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067 )	2022-02-05 16:59:34 -08:00
Max Pumperla	4dd221f848	[Docs] Ray Data docs target state (#21931 ) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.	2022-01-27 13:14:36 -08:00
Clark Zinzow	411bb308dc	[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889 )	2022-01-26 12:05:27 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
Eric Liang	a69ae1d886	Add blogs to dataset materials (#21546 )	2022-01-11 22:09:57 -08:00
Clark Zinzow	b872fdaaac	[Datasets] Last-mile preprocessing docs. (#20712 ) Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.	2021-11-29 23:23:27 -08:00
Richard Liaw	cf357f6bce	[docs] Add a talks section for ray.data (#20444 )	2021-11-16 14:30:08 -08:00
Eric Liang	6102912494	Dataset doc updates (#19815 )	2021-11-04 18:13:40 -07:00
Philipp Moritz	0a5942d8b0	[Documentation] Fix quotes for windows installations (#19859 ) * [Documentation] Fix quotes for windows installations * update * formatting	2021-10-29 10:54:38 -07:00
Eric Liang	27a5b546ad	Make ArrowRow less scary (#19686 )	2021-10-25 12:18:42 -07:00
Eric Liang	875d19f838	[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620 )	2021-10-23 12:20:23 -07:00
matthewdeng	4674c78050	[Train] Rename Ray SGD v2 to Ray Train (#19436 )	2021-10-18 22:27:46 -07:00
Eric Liang	430a5f4a21	[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332 )	2021-10-12 18:32:29 -07:00
Clark Zinzow	d22f838795	[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992 )	2021-10-01 13:08:25 -07:00
Alex Wu	5709c6501b	[dataset][usability] Dataset dependencies (#18346 )	2021-09-29 17:29:31 -07:00
Eric Liang	caf34a452c	Unify ArrowTensorType tables and Tensor blocks (#18867 )	2021-09-27 16:24:09 -07:00
Eric Liang	4d2065352b	Increase dataset read parallelism by default (#18420 )	2021-09-09 15:07:49 -07:00
Clark Zinzow	b30c41759d	[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301 )	2021-09-08 10:09:01 -07:00
Eric Liang	cbdafa0b63	[doc] Fix various workflow doc bugs (#18357 )	2021-09-06 01:39:08 -07:00
Eric Liang	7dcae690b9	Mark datasets as still in alpha for now (#18321 )	2021-09-02 17:07:33 -07:00
Wesley Gifford	6133a561e9	Dataset from modin (#18122 )	2021-08-31 11:19:35 -07:00
Eric Liang	95b5ad12ba	Initial version of workflow documentation (#18138 )	2021-08-27 16:20:48 -07:00
Clark Zinzow	aee7ba2510	[Datasets] Add from_numpy() and to_numpy() APIs (#18146 )	2021-08-27 13:33:11 -07:00
Eric Liang	71b3183038	Add implicit init note to Ray docs & dataset version note (#17751 )	2021-08-11 13:13:22 -07:00
Eric Liang	d4f9d3620e	Move ray.data out of experimental (#17560 )	2021-08-04 13:31:10 -07:00
Eric Liang	748cbbb23d	[hotfix] Parquet S3 reads broken due to pyarrow.lib.ArrowInvalid: S3 subsystem not initialized (#17492 )	2021-08-02 11:48:48 -07:00
Eric Liang	e812691909	Support top-level tensor values in dataset (#17439 )	2021-08-01 22:45:21 -07:00
Eric Liang	7ed62ea0ad	Initial implementation of Dataset pipelining and docs (#17309 )	2021-07-28 21:12:01 -07:00
Clark Zinzow	b5194ca9f9	Add imports to docs examples to make the code more runnable. (#17240 )	2021-07-21 11:18:45 -07:00
Eric Liang	fabba96fad	Re-merge large function def, skipping test failing on Windows (#17191 )	2021-07-19 18:03:26 -07:00
architkulkarni	4069686e0f	Revert "Improve error message for oversized function (#17133 )" (#17184 ) This reverts commit `3e53619d64`.	2021-07-19 09:28:33 -07:00
Eric Liang	3e53619d64	Improve error message for oversized function (#17133 )	2021-07-17 11:04:05 -07:00
Eric Liang	94f17ec099	[RFC] API stability annotations (#17100 )	2021-07-16 17:09:20 -07:00
Eric Liang	26a286655b	Add link to datasets preview docs	2021-07-16 12:31:52 -07:00
Eric Liang	f03b43c532	[dataset] Support callable classes to simplify state initialization (#17136 )	2021-07-15 23:06:14 -07:00
Eric Liang	3d764d7b4b	[data] Fix the ObjectRef type in the dataset docs (#17111 ) * fix reft * remove exp * fix	2021-07-15 09:50:37 -07:00
Eric Liang	38bddc3f2b	First cut at dataset documentation (#16956 )	2021-07-14 23:27:13 -07:00

45 commits