hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Jian Xiao	9fe4dba4ad	Revamp the Getting Started page for Dataset (#24860 ) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-18 13:46:23 -07:00
Clark Zinzow	26ea82d3a6	[Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346 )	2022-05-17 20:57:42 -07:00
Clark Zinzow	4444150c29	[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831 ) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.	2022-05-17 16:23:42 -07:00
Clark Zinzow	ea635aecd2	[Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752 ) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.	2022-05-17 01:11:00 -07:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Chen Shen	cc21979998	Revert "[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 )" (#24785 ) This reverts commit `e2ee2140f9`.	2022-05-13 11:18:30 -07:00
Chen Shen	9b1154dce4	fix inter (#24761 )	2022-05-13 08:18:22 -07:00
Patrick Ames	e2ee2140f9	[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 ) API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.	2022-05-12 10:19:33 -07:00
Zhe Zhang	909d463552	[docs] Fix import error in Ray Data "getting started" (#24424 ) We did `import pandas as pd` but here we are using it as `pandas`	2022-05-10 15:46:15 -07:00
Antoni Baum	04e16f70a3	[Datasets] [Docs] Add a warning about from_huggingface (#24608 ) Adds a warning to docs about the intended use of from_huggingface.	2022-05-10 13:08:25 -07:00
Chen Shen	f1f8ad6ca3	[Doc][Data] fix big-data-ingestion broken links (#24631 ) The links were broken. Fixed it.	2022-05-10 09:04:41 -07:00
Antoni Baum	668049492c	[Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464 ) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.	2022-05-06 13:09:28 -07:00
Stephanie Wang	2931a23760	[doc] Add docs for push-based shuffle in Datasets (#24486 ) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets.	2022-05-05 14:59:33 -07:00
Balaji Veeramani	2190f7ff25	[Datsets] Add SimpleTensorFlowDatasource (#24022 ) This PR makes it easier to use TensorFlow datasets with Ray Datasets.	2022-04-29 12:15:30 -07:00
Shawn	43ed78f6fd	[Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402 ) Add Mars-on-Ray + Datasets integration; improve Mars-on-Ray docs and add tests.	2022-04-29 09:43:52 -07:00
Balaji Veeramani	2fdea6e24f	[Datasets] Add `SimpleTorchDatasource` (#23926 ) It's difficult to use torchvision datasets with Ray ML. This PR makes it easier to use Torch datasets with Ray Data.	2022-04-28 11:56:45 -07:00
matthewdeng	cc08c01ade	[ml] add more preprocessors (#23904 ) Adding some more common preprocessors: * MaxAbsScaler * RobustScaler * PowerTransformer * Normalizer * FeatureHasher * Tokenizer * HashingVectorizer * CountVectorizer API docs: https://ray--23904.org.readthedocs.build/en/23904/ray-air/getting-started.html Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-04-25 21:12:59 +01:00
Jian Xiao	57f620bd05	[Datasets] Add missing public APIs to Datasets API docs (#23935 )	2022-04-16 11:57:38 -07:00
Clark Zinzow	983ef1f2a7	[Datasets] Make `from_numpy()` more user-friendly. (#23871 ) `ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).	2022-04-12 18:37:59 -07:00
Jian Xiao	6d93e9f0f5	Cleanup the DatasetPipeline references in Getting Started; rename Exchanging to Accessing (#23786 )	2022-04-12 17:10:14 -07:00
Eric Liang	858d607b19	[data] Fix small doc issues (#23813 )	2022-04-09 12:09:08 -07:00
Jian Xiao	f737731a5e	Remove dataset pipeline from the Getting Started page (#23756 ) 1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page 2. We already have a separate/dedicated page called Pipelining Compute to cover the same content	2022-04-07 12:52:04 -07:00
Philipp Moritz	886cc4d674	Fix broken links in documentation and put linkcheck linter in place on CI (#23340 )	2022-03-18 21:02:52 -07:00
Jian Xiao	0b1a2a44c0	[Dataset GA doc] Decompose the monolith of Getting Started page (and get them under User Guide) (#23311 ) Improve the Dataset documentation for GA.	2022-03-18 11:25:43 -07:00
Archit Kulkarni	76bb5396c7	[Doc] [jobs] Add links to Job Submission and improve doc (#23209 ) - Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested. - Adds docstrings for the Jobs SDK, which automatically show up in the API reference - Improve the Job Submission main page - Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 12:52:13 -05:00
Eric Liang	08dc31e747	[minor] Fix incorrect link to ray core user guide (#23316 )	2022-03-17 20:58:56 -07:00
Eric Liang	015181ab9a	Add random access support for Datasets (experimental feature) (#22749 ) This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.	2022-03-17 15:01:12 -07:00
Jian Xiao	8c9e3f6c2e	Move the third-party data integrations (non-Dataset stuff) out of the user guides which is for Dataset (#23162 ) Improve documentation of Ray Dataset.	2022-03-17 11:27:40 -07:00
Eric Liang	678d23fe42	Remove beta label from Datasets (#23220 )	2022-03-15 23:05:59 -07:00
Jian Xiao	10435d2d8f	Update dask version for Ray 1.12.0 (#23197 )	2022-03-15 19:22:19 -07:00
Max Pumperla	11c40e363d	[docs] external promo content (#22823 )	2022-03-10 11:39:44 -08:00
Eric Liang	52491c87e2	Make a pass fixing Dataset API issues (#22886 )	2022-03-08 13:07:55 -08:00
Eric Liang	5a0b7a7ee0	Document Dataset pipeline stage fusion (#22737 )	2022-03-01 14:38:09 -08:00
Eric Liang	e228544d39	Undo revert of windowing dataset by bytes (#22735 )	2022-03-01 12:24:04 -08:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Eric Liang	b5b4460932	Support creating a DatasetPipeline windowed by bytes (#22577 )	2022-02-25 23:31:10 -08:00
Eric Liang	533a0440a6	Improve actor pool support in Datasets (#22574 )	2022-02-24 12:01:36 -08:00
Max Pumperla	29d94a2211	[docs] sphinx gallery removal, migrate to ipynb (#22467 )	2022-02-19 01:19:07 -08:00
Clark Zinzow	53c4c7b1be	[Datasets] Expose `TableRow` as public API; minimize copies/type conversions on row-based ops. (#22305 ) This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made: 1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions. 2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.	2022-02-14 12:56:17 -08:00
Balaji Veeramani	31ed9e5d02	[CI] Replace YAPF disables with Black disables (#21982 )	2022-02-08 16:29:25 -08:00
Clark Zinzow	fb0d6e6b0b	[Datasets] [Docs] Datasets library branding + positioning tweaks (#22067 )	2022-02-05 16:59:34 -08:00
Clark Zinzow	743ce65da8	[Dask-on-Ray] Add support for Dask annotations. (#22057 )	2022-02-03 22:15:38 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Clark Zinzow	09fab70991	[Datasets] [Docs] Fix bug in Datasets locality-aware splitting example (#21937 ) Fixes bug in Datasets locality-aware splitting example.	2022-01-27 14:46:04 -08:00
mwtian	559eefd06f	[Doc] update dask version for Ray 1.11.0 (#21933 ) This is needed for release 1.11.0.	2022-01-27 13:15:01 -08:00
Max Pumperla	4dd221f848	[Docs] Ray Data docs target state (#21931 ) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.	2022-01-27 13:14:36 -08:00
Max Pumperla	b34099e764	[docs] landing page (fixes #21750 ) (#21859 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-01-26 17:14:25 -08:00
Clark Zinzow	411bb308dc	[Datasets] [Docs] Add API docs links to I/O compatibility matrix (#21889 )	2022-01-26 12:05:27 -08:00
Max Pumperla	f9b71a8bf6	[docs] new structure (#21776 ) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).	2022-01-21 15:42:05 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00

1 2 3 4

160 commits