hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Eric Liang	12825fc5aa	[air] Add a warning if no CPUs are reserved for dataset execution (#26643 )	2022-07-17 16:33:51 -07:00
Eric Liang	400330e9c0	[air] Add _max_cpu_fraction_per_node to ScalingConfig and documentation (#26634 )	2022-07-16 21:55:51 -07:00
Philipp Moritz	081bbfbff1	[Examples] Test OCR example in documentation tests (#26482 ) Make sure the OCR example is tested in documentation after we discovered that example notebooks are not tested in CI. Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>	2022-07-16 10:51:28 -07:00
Balaji Veeramani	34cf1f17ea	[Datasets] Add `ImageFolderDatasource` (#24641 ) Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-15 22:43:23 -07:00
Tim Gates	e42dc7943e	docs: Fix a few typos (#26556 ) There are small typos in: - doc/source/data/faq.rst - python/ray/serve/replica.py Fixes: - Should read `successfully` rather than `succssifully`. - Should read `pseudo` rather than `psuedo`.	2022-07-14 12:38:33 -07:00
Eric Liang	9de1add073	[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883 ) This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case: - The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8. - The parallelism is set to the estimated number of CPUs multiplied by 2. - The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size. These rules fix two common user problems: 1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster. 2. Overly large block sizes leading to OOMs when processing a single block. TODO: - [x] Unit tests - [x] Docs update Supercedes part of: https://github.com/ray-project/ray/pull/25708 Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-07-12 21:08:49 -07:00
Richard Liaw	5892a76a44	[air/tune] Documentation testing fixes (#26409 )	2022-07-09 19:47:21 -07:00
ej	636105e8e2	[Docs] [Serve] Has a consistent landing page style (#26029 )	2022-07-08 11:58:21 -07:00
Cheng Su	4e674b6ad3	[Datasets] Update docs for drop_columns and fix typos (#26317 ) We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.	2022-07-07 17:17:33 -07:00
Philipp Moritz	1ba8c8cc67	[Examples] OCR Ray Datasets example (#25930 ) This is a simple example that shows how to do OCR with Ray Datasets. It includes: - How to upload and download the dataset to and from S3 - How to run OCR on the dataset with tesseract - How to use actors to keep around and re-use a spaCy context for doing NLP on the data Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-07-06 13:11:26 -07:00
Myeongju Kim	a1a78077ca	Fix a broken link in Ray Dataset doc (#25927 ) Co-authored-by: Myeong Kim <myeongki@amazon.com>	2022-06-20 13:17:46 -07:00
Clark Zinzow	1701b923bc	[Datasets] [Tensor Story - 2/2] Add `"numpy"` batch format for batch mapping and batch consumption. (#24870 ) This PR adds a NumPy "numpy" batch format for batch transformations and batch consumption that works with all block types. See #24811.	2022-06-17 16:01:02 -07:00
Chen Shen	8e7e89a178	[Data] fix broken link (#25867 ) update the broken spark link.	2022-06-16 14:01:38 -07:00
Clark Zinzow	526e12074a	[Datasets] Make it clear that `read_parquet()` does not support multiple directories. (#25747 ) Unfortunately, ray.data.read_parquet() doesn't work with multiple directories since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html This PR makes this clear in the docs, and as a driveby, adds ray.data.read_parquet_bulk() to the API docs.	2022-06-15 13:19:39 -07:00
matthewdeng	ba0a2a022a	[datasets] add `Dataset.randomize_block_order` (#25568 ) This exposes a low-cost way to perform a pseudo global shuffle. For extremely large datasets that span multiple nodes, contiguous blocks will often be colocated on the same node. This leads to hot spots during iteration of the dataset in which single nodes (1) must send a lot of data over the network, and (2) perform lots of disk reads if the dataset is spilled to disk. This allows the workload to be spread across the nodes on which the dataset blocks are on.	2022-06-08 18:39:15 -07:00
Jian Xiao	50c854b1ad	Fix hyperlink in rst doc (#25427 ) Hyperlink not working Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-06-08 13:46:23 -07:00
Clark Zinzow	9dc0bb3d5e	[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031 )" (#25531 ) Unreverts #24812, skipping the memory releasing tests that are already flaky. We have a separate issue tracking the unskipping of these memory releasing tests, once we find a more reliable way to test them. * Revert "Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031)" (#25057)" This reverts commit `fb2933a78f`. * Skip shuffle memory release test.	2022-06-08 10:33:25 -07:00
Jian Xiao	6589a4f8cb	[Datasets][UX Assessment] Add a section on how to write UDFs in Datasets (#25338 ) The Datasets UX assessment showed that users had difficulties in writing UDFs: what's input/output types, how to write the function etc. Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-06-02 20:00:50 -07:00
Stephanie Wang	473a962d89	[Datasets] [Docs] Add docs about fault tolerance in Datasets (#25371 ) Adds description of fault tolerance guarantees for Datasets. Related issue number Closes #24856.	2022-06-02 15:53:50 -07:00
Kai Fricke	6fe91885b0	[docs/lint] Fix reference to `dataset_tune` (#25402 )	2022-06-02 11:40:26 +01:00
Eric Liang	51b295ad74	[docs] Improve Tune + Datasets documentation (#25389 )	2022-06-01 21:52:32 -07:00
Eric Liang	71717e59c4	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00
Eric Liang	5545bc5f45	[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265 ) Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here.	2022-05-31 16:01:00 -07:00
mwtian	fb2933a78f	Revert "Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031 )" (#25057 ) Reverts #25031 It looks to be still somewhat flaky.	2022-05-25 19:43:22 -07:00
Zhe Zhang	873c44d984	[Docs] Add "Examples" block to Ray Data landing page, and consistently use bold font (#24994 )	2022-05-23 21:22:00 -07:00
Balaji Veeramani	50c31b8466	[Data] Add partitioning classes to Data API reference (#24203 )	2022-05-23 09:34:41 -07:00
Jian Xiao	9dd30d5f77	Proofread the some datasets docs (#25068 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>	2022-05-22 12:11:51 -07:00
Jian Xiao	ad842ec9ab	Revamp the Transforming Datasets user guide (#25033 )	2022-05-20 19:25:06 -07:00
Jian Xiao	e5838c4700	Fix range_arrow(), which is replaced by range_table() (#25036 )	2022-05-20 19:24:49 -07:00
Clark Zinzow	9ea5a8ec4b	Revert "Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets."" (#25031 ) Fixes the check ingest utility to handle non-Pandas native batches.	2022-05-20 11:47:29 -07:00
Clark Zinzow	2c8fac369a	Note that explicit resource allocation is experimental, fix typos (#25038 )	2022-05-20 11:36:08 -07:00
Kai Fricke	fbfb134b8c	Revert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812 )" (#25017 ) This reverts commit `841f7c81ff`. Reverts #24812 Broke e.g. ML tests: https://buildkite.com/ray-project/ray-builders-branch/builds/7667#55e7473e-f6a8-4d72-a875-cd68acf8b0c4	2022-05-20 15:37:40 +01:00
Clark Zinzow	841f7c81ff	[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812 ) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR.	2022-05-19 22:40:04 -07:00
Clark Zinzow	822cbc420b	[Datasets] Add FAQ to Datasets docs. (#24932 ) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-19 15:44:22 -07:00
Jian Xiao	44fd7fd1d0	Revamp the Saving Datasets user guide (#24987 )	2022-05-19 15:40:12 -07:00
Clark Zinzow	6c0a457d7a	[Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874 ) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data.	2022-05-19 12:54:25 -07:00
Clark Zinzow	399334d53c	[Datasets] Overhaul "Accessing Datasets" feature guide. (#24963 ) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()).	2022-05-19 12:50:00 -07:00
Clark Zinzow	0b6505e8c6	[Datasets] Miscellaneous GA docs P0s. (#24891 ) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general.	2022-05-18 16:17:48 -07:00
Jian Xiao	9fe4dba4ad	Revamp the Getting Started page for Dataset (#24860 ) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-18 13:46:23 -07:00
Clark Zinzow	26ea82d3a6	[Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346 )	2022-05-17 20:57:42 -07:00
Clark Zinzow	4444150c29	[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831 ) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.	2022-05-17 16:23:42 -07:00
Clark Zinzow	ea635aecd2	[Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752 ) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.	2022-05-17 01:11:00 -07:00
Clark Zinzow	ef870e936c	[Datasets] Change `range_arrow()` API to `range_table()` (#24704 ) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail.	2022-05-17 01:09:45 -07:00
Chen Shen	cc21979998	Revert "[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 )" (#24785 ) This reverts commit `e2ee2140f9`.	2022-05-13 11:18:30 -07:00
Chen Shen	9b1154dce4	fix inter (#24761 )	2022-05-13 08:18:22 -07:00
Patrick Ames	e2ee2140f9	[Datasets] Add documentation for bulk parquet read API and file metadata providers. (#24354 ) API doc updates for #23179 and #24094. All data docs related to #23179 should be up-to-date once this PR and #24203 are merged.	2022-05-12 10:19:33 -07:00
Zhe Zhang	909d463552	[docs] Fix import error in Ray Data "getting started" (#24424 ) We did `import pandas as pd` but here we are using it as `pandas`	2022-05-10 15:46:15 -07:00
Antoni Baum	04e16f70a3	[Datasets] [Docs] Add a warning about from_huggingface (#24608 ) Adds a warning to docs about the intended use of from_huggingface.	2022-05-10 13:08:25 -07:00
Chen Shen	f1f8ad6ca3	[Doc][Data] fix big-data-ingestion broken links (#24631 ) The links were broken. Fixed it.	2022-05-10 09:04:41 -07:00
Antoni Baum	668049492c	[Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464 ) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial.	2022-05-06 13:09:28 -07:00

1 2 3

148 commits