hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Clark Zinzow	526e12074a	[Datasets] Make it clear that `read_parquet()` does not support multiple directories. (#25747 ) Unfortunately, ray.data.read_parquet() doesn't work with multiple directories since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html This PR makes this clear in the docs, and as a driveby, adds ray.data.read_parquet_bulk() to the API docs.	2022-06-15 13:19:39 -07:00
Balaji Veeramani	50c31b8466	[Data] Add partitioning classes to Data API reference (#24203 )	2022-05-23 09:34:41 -07:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Chen Shen	cc21979998	Revert "[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 )" (#24785 ) This reverts commit `e2ee2140f9`.	2022-05-13 11:18:30 -07:00
Chen Shen	9b1154dce4	fix inter (#24761 )	2022-05-13 08:18:22 -07:00
Patrick Ames	e2ee2140f9	[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 ) API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.	2022-05-12 10:19:33 -07:00
Antoni Baum	668049492c	[Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464 ) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.	2022-05-06 13:09:28 -07:00
Balaji Veeramani	2190f7ff25	[Datsets] Add SimpleTensorFlowDatasource (#24022 ) This PR makes it easier to use TensorFlow datasets with Ray Datasets.	2022-04-29 12:15:30 -07:00
Balaji Veeramani	2fdea6e24f	[Datasets] Add `SimpleTorchDatasource` (#23926 ) It's difficult to use torchvision datasets with Ray ML. This PR makes it easier to use Torch datasets with Ray Data.	2022-04-28 11:56:45 -07:00
matthewdeng	cc08c01ade	[ml] add more preprocessors (#23904 ) Adding some more common preprocessors: * MaxAbsScaler * RobustScaler * PowerTransformer * Normalizer * FeatureHasher * Tokenizer * HashingVectorizer * CountVectorizer API docs: https://ray--23904.org.readthedocs.build/en/23904/ray-air/getting-started.html Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-04-25 21:12:59 +01:00
Jian Xiao	57f620bd05	[Datasets] Add missing public APIs to Datasets API docs (#23935 )	2022-04-16 11:57:38 -07:00
Clark Zinzow	983ef1f2a7	[Datasets] Make `from_numpy()` more user-friendly. (#23871 ) `ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).	2022-04-12 18:37:59 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
Clark Zinzow	fb0d6e6b0b	[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067 )	2022-02-05 16:59:34 -08:00
Max Pumperla	4dd221f848	[Docs] Ray Data docs target state (#21931 ) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.	2022-01-27 13:14:36 -08:00
Max Pumperla	b34099e764	[docs] landing page (fixes #21750 ) (#21859 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-01-26 17:14:25 -08:00
Max Pumperla	f9b71a8bf6	[docs] new structure (#21776 ) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).	2022-01-21 15:42:05 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
Jiajun Yao	4fc5b11c68	Simple block dataset groupBy (#19435 )	2021-10-19 19:53:13 -07:00
Amog Kamsetty	f6f2435b91	[SGD] Sgd v2 Dataset Integration (#17626 ) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2021-10-12 14:03:10 -07:00
Clark Zinzow	d22f838795	[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992 )	2021-10-01 13:08:25 -07:00
Clark Zinzow	b30c41759d	[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301 )	2021-09-08 10:09:01 -07:00
Clark Zinzow	c0598de82a	[Datasets] Port write APIs to use file-based datasources. (#18135 )	2021-08-27 15:24:54 -07:00
Clark Zinzow	aee7ba2510	[Datasets] Add from_numpy() and to_numpy() APIs (#18146 )	2021-08-27 13:33:11 -07:00
Eric Liang	d4f9d3620e	Move ray.data out of experimental (#17560 )	2021-08-04 13:31:10 -07:00
Eric Liang	e812691909	Support top-level tensor values in dataset (#17439 )	2021-08-01 22:45:21 -07:00
Eric Liang	cd13059691	[dataset] Implement random_shuffle() and split(equal=True) (#17448 )	2021-07-30 09:51:21 -07:00
Eric Liang	7ed62ea0ad	Initial implementation of Dataset pipelining and docs (#17309 )	2021-07-28 21:12:01 -07:00
Eric Liang	3d764d7b4b	[data] Fix the ObjectRef type in the dataset docs (#17111 ) * fix reft * remove exp * fix	2021-07-15 09:50:37 -07:00
Eric Liang	38bddc3f2b	First cut at dataset documentation (#16956 )	2021-07-14 23:27:13 -07:00

32 commits