hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Clark Zinzow	526e12074a	[Datasets] Make it clear that `read_parquet()` does not support multiple directories. (#25747 ) Unfortunately, ray.data.read_parquet() doesn't work with multiple directories since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html This PR makes this clear in the docs, and as a driveby, adds ray.data.read_parquet_bulk() to the API docs.	2022-06-15 13:19:39 -07:00
matthewdeng	ba0a2a022a	[datasets] add `Dataset.randomize_block_order` (#25568 ) This exposes a low-cost way to perform a pseudo global shuffle. For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk. This allows the workload to be spread across the nodes on which the dataset blocks are on.	2022-06-08 18:39:15 -07:00
Jian Xiao	50c854b1ad	Fix hyperlink in rst doc (#25427 ) Hyperlink not working Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-06-08 13:46:23 -07:00
Clark Zinzow	9dc0bb3d5e	[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031 )" (#25531 ) Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)" This reverts commit `fb2933a78f`. * Skip shuffle memory release test.	2022-06-08 10:33:25 -07:00
Jian Xiao	6589a4f8cb	[Datasets][UX Assessment] Add a section on how to write UDFs in Datasets (#25338 ) The Datasets UX assessment showed that users had difficulties in writing UDFs: what's input/output types, how to write the function etc. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-06-02 20:00:50 -07:00
Stephanie Wang	473a962d89	[Datasets] [Docs] Add docs about fault tolerance in Datasets (#25371 ) Adds description of fault tolerance guarantees for Datasets. Related issue number Closes #24856.	2022-06-02 15:53:50 -07:00
Kai Fricke	6fe91885b0	[docs/lint] Fix reference to `dataset_tune` (#25402 )	2022-06-02 11:40:26 +01:00
Eric Liang	51b295ad74	[docs] Improve Tune + Datasets documentation (#25389 )	2022-06-01 21:52:32 -07:00
Eric Liang	71717e59c4	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00
Eric Liang	5545bc5f45	[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265 ) Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here.	2022-05-31 16:01:00 -07:00
mwtian	fb2933a78f	Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031 )" (#25057 ) Reverts #25031 It looks to be still somewhat flaky.	2022-05-25 19:43:22 -07:00
Zhe Zhang	873c44d984	[Docs] Add "Examples" block to Ray Data landing page, and consistently use bold font (#24994 )	2022-05-23 21:22:00 -07:00
Balaji Veeramani	50c31b8466	[Data] Add partitioning classes to Data API reference (#24203 )	2022-05-23 09:34:41 -07:00
Jian Xiao	9dd30d5f77	Proofread the some datasets docs (#25068 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-05-22 12:11:51 -07:00
Jian Xiao	ad842ec9ab	Revamp the Transforming Datasets user guide (#25033 )	2022-05-20 19:25:06 -07:00
Jian Xiao	e5838c4700	Fix range_arrow(), which is replaced by range_table() (#25036 )	2022-05-20 19:24:49 -07:00
Clark Zinzow	9ea5a8ec4b	Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031 ) Fixes the check ingest utility to handle non-Pandas native batches.	2022-05-20 11:47:29 -07:00
Clark Zinzow	2c8fac369a	Note that explicit resource allocation is experimental, fix typos (#25038 )	2022-05-20 11:36:08 -07:00
Kai Fricke	fbfb134b8c	Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812 )" (#25017 ) This reverts commit `841f7c81ff`. Reverts #24812 Broke e.g. ML tests: https://buildkite.com/ray-project/ray-builders-branch/builds/7667#55e7473e-f6a8-4d72-a875-cd68acf8b0c4	2022-05-20 15:37:40 +01:00
Clark Zinzow	841f7c81ff	[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812 ) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR.	2022-05-19 22:40:04 -07:00
Clark Zinzow	822cbc420b	[Datasets] Add FAQ to Datasets docs. (#24932 ) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-19 15:44:22 -07:00
Jian Xiao	44fd7fd1d0	Revamp the Saving Datasets user guide (#24987 )	2022-05-19 15:40:12 -07:00
Clark Zinzow	6c0a457d7a	[Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874 ) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data.	2022-05-19 12:54:25 -07:00
Clark Zinzow	399334d53c	[Datasets] Overhaul "Accessing Datasets" feature guide. (#24963 ) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()).	2022-05-19 12:50:00 -07:00
Clark Zinzow	0b6505e8c6	[Datasets] Miscellaneous GA docs P0s. (#24891 ) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general.	2022-05-18 16:17:48 -07:00
Jian Xiao	9fe4dba4ad	Revamp the Getting Started page for Dataset (#24860 ) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-18 13:46:23 -07:00
Clark Zinzow	26ea82d3a6	[Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346 )	2022-05-17 20:57:42 -07:00
Clark Zinzow	4444150c29	[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831 ) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.	2022-05-17 16:23:42 -07:00
Clark Zinzow	ea635aecd2	[Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752 ) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.	2022-05-17 01:11:00 -07:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Chen Shen	cc21979998	Revert "[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 )" (#24785 ) This reverts commit `e2ee2140f9`.	2022-05-13 11:18:30 -07:00
Chen Shen	9b1154dce4	fix inter (#24761 )	2022-05-13 08:18:22 -07:00
Patrick Ames	e2ee2140f9	[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 ) API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.	2022-05-12 10:19:33 -07:00
Zhe Zhang	909d463552	[docs] Fix import error in Ray Data "getting started" (#24424 ) We did `import pandas as pd` but here we are using it as `pandas`	2022-05-10 15:46:15 -07:00
Antoni Baum	04e16f70a3	[Datasets] [Docs] Add a warning about from_huggingface (#24608 ) Adds a warning to docs about the intended use of from_huggingface.	2022-05-10 13:08:25 -07:00
Chen Shen	f1f8ad6ca3	[Doc][Data] fix big-data-ingestion broken links (#24631 ) The links were broken. Fixed it.	2022-05-10 09:04:41 -07:00
Antoni Baum	668049492c	[Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464 ) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.	2022-05-06 13:09:28 -07:00
Stephanie Wang	2931a23760	[doc] Add docs for push-based shuffle in Datasets (#24486 ) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets.	2022-05-05 14:59:33 -07:00
Balaji Veeramani	2190f7ff25	[Datsets] Add SimpleTensorFlowDatasource (#24022 ) This PR makes it easier to use TensorFlow datasets with Ray Datasets.	2022-04-29 12:15:30 -07:00
Shawn	43ed78f6fd	[Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402 ) Add Mars-on-Ray + Datasets integration; improve Mars-on-Ray docs and add tests.	2022-04-29 09:43:52 -07:00
Balaji Veeramani	2fdea6e24f	[Datasets] Add `SimpleTorchDatasource` (#23926 ) It's difficult to use torchvision datasets with Ray ML. This PR makes it easier to use Torch datasets with Ray Data.	2022-04-28 11:56:45 -07:00
matthewdeng	cc08c01ade	[ml] add more preprocessors (#23904 ) Adding some more common preprocessors: * MaxAbsScaler * RobustScaler * PowerTransformer * Normalizer * FeatureHasher * Tokenizer * HashingVectorizer * CountVectorizer API docs: https://ray--23904.org.readthedocs.build/en/23904/ray-air/getting-started.html Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-04-25 21:12:59 +01:00
Jian Xiao	57f620bd05	[Datasets] Add missing public APIs to Datasets API docs (#23935 )	2022-04-16 11:57:38 -07:00
Clark Zinzow	983ef1f2a7	[Datasets] Make `from_numpy()` more user-friendly. (#23871 ) `ray.data.from_numpy()` currently expects to be given a list of ndarray futures, instead of handling concrete ndarrays, as expected (and as allowed by other `from_*` APIs, e.g. `from_pandas`). This PR renames the existing `from_numpy` API to `from_numpy_refs`, and exposes `ray.data.from_numpy`, which takes concrete ndarrays (not object references).	2022-04-12 18:37:59 -07:00
Jian Xiao	6d93e9f0f5	Cleanup the DatasetPipeline references in Getting Started; rename Exchanging to Accessing (#23786 )	2022-04-12 17:10:14 -07:00
Eric Liang	858d607b19	[data] Fix small doc issues (#23813 )	2022-04-09 12:09:08 -07:00
Jian Xiao	f737731a5e	Remove dataset pipeline from the Getting Started page (#23756 ) 1. Dataset pipeline is advanced usage of Ray Dataset, which should not jam into the Getting Started page 2. We already have a separate/dedicated page called Pipelining Compute to cover the same content	2022-04-07 12:52:04 -07:00
Philipp Moritz	886cc4d674	Fix broken links in documentation and put linkcheck linter in place on CI (#23340 )	2022-03-18 21:02:52 -07:00
Jian Xiao	0b1a2a44c0	[Dataset GA doc] Decompose the monolith of Getting Started page (and get them under User Guide) (#23311 ) Improve the Dataset documentation for GA.	2022-03-18 11:25:43 -07:00
Archit Kulkarni	76bb5396c7	[Doc] [jobs] Add links to Job Submission and improve doc (#23209 ) - Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested. - Adds docstrings for the Jobs SDK, which automatically show up in the API reference - Improve the Job Submission main page - Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-18 12:52:13 -05:00

1 2 3

135 commits