This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()).
This PR moves all exception classes from runtime module to api module. It's aiming to eliminate the confusion about ray exceptions. It means that Ray users don't need to touch runtime module when API programming after this PR.
Note that this should be merged onto 2.0.
This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely:
- Documents Datasets resource allocation model.
- De-emphasizes global/windowed shuffling.
- Documents lazy execution mode, and expands our execution model docs in general.
This is part of the Dataset GA doc fix effort to update/improve the documentation.
This PR revamps the Getting Started page.
What are the changes:
- Focus on basic/core features that are bread-and-butter for users, leave the advanced features out
- Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API)
- Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case)
- Use the same data example throughout to make it context-switch free
- Use runnable code rather than faked
- Reference to the code from doc, instead of inlining them in the doc
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This example simply doesn't run as is. We can bring it back up again later, if it makes sense. But it's not clear what the variables used there, like actor are. Fixes#21328
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
as per discussion on slack, this should avoid having old versions of the docs indexed by search engines, see readthedocs/readthedocs.org#2430 for reference.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily.
Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by:
* Running the `py_test_run_all_notebooks` macro inside the relevant subpackages
* Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks`
* Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early
* Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files
This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.
This PR adds support for tensor columns in the to_tf() and to_torch() APIs.
For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor.
For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series.
In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR.
Adds a Dataset.split_proportionately method that allows the user to split a dataset using proportions. This is a very common use-case for eg. train-test splitting. The implementation is a thin wrapper over Dataset.split_at_indices.
Additionally, this PR adds a ray.ml.train_test_split function intended to provide a familiar API to ML practitioners.
Updates the landing page to match the format and content of Tune's. Added some shorter quickstarts and sharpened up the messaging in our "Why choose Serve?" section, those are the main content changes.
I also moved all of the `doc_code` into one directory and added a bazel target that should run all of the examples added there. Split into a separate PR: https://github.com/ray-project/ray/pull/24736.
Why are these changes needed?
Current documentation code in Message passing using Ray Queue can be enhanced, for better demonstration of the message queue.
It creates 10 tasks but only 2 consumers, and each consumer consumes one task then exit. Therefore, the output is a bit vague:
(consumer pid=1022727) got work 0
(consumer pid=1022595) got work 1
So I make consumer working until the queue is empty. The output shows consumer 1 and 2 working in parallel:
(consumer pid=1030876) consumer 0 got work 0
(consumer pid=1030876) consumer 0 got work 1
(consumer pid=1030876) consumer 0 got work 3
(consumer pid=1030876) consumer 0 got work 5
(consumer pid=1030876) consumer 0 got work 7
(consumer pid=1030876) consumer 0 got work 9
(consumer pid=1030949) consumer 1 got work 2
(consumer pid=1030949) consumer 1 got work 4
(consumer pid=1030949) consumer 1 got work 6
(consumer pid=1030949) consumer 1 got work 8
P.S. Also fix a typo in doc.
This is a notebook showing how to tune an xgboost model and analyze the results.
Also adds a `get_dataframe()` method to `ResultsGrid` to fetch the trial results.
Depends on #24483 for toctree.
This is a small update for the structure of the docs about building Ray from source.
My idea was to isolate steps that are shared and then steps required per platform/system. Also consolidating the instructions to clone with git, install, directory structure, etc.
I'm still figuring out the building steps (installing the dependencies for docs in an M1), but I wanted to start the draft right away.