This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Update ray lightning api docs to reflect new changes in ray lightning master.
Making this quick change to fix CI and unblock the release, but will follow up on a proper fix for this.
Closes#21426
This PR adds documentation for Workflow Metadata, which we recently added support in https://github.com/ray-project/ray/pull/19372.
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.
This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.
* updating azure autoscaler versions and backwards compatibility, and moving to azure-identity based authentication
* adding azure sdk rqmts for tests
* updating azure test requirements and adding wrapper function for azure sdk function resolution
* adding docstring to get_azure_sdk_function
Co-authored-by: Scott Graham <scgraham@microsoft.com>
Added hyperameters to the concetp section since it's important to explain what they are and added diagrams help readeer visualize the difference between model and hyperparameters
Signed-off-by: Jules S.Damji <jules@anyscale.com>
Co-authored-by: Jules S.Damji <jules@anyscale.com>
Signed-off-by: Jules S.Damji jules@anyscale.com
Why are these changes needed?
The code snippet referenced a python function that was not defined, therefore the code snippet as is won't work. All complete or self-contained code in our docs should run.
The changes made were adding the undefined function, iterating over a list of different random large arrays to show the difference between local or distributed sort's execution time, and print them.
Closes#20960
This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets.
Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.
Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.
To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`
This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
So I have a AMD machine with many cores and 32GB of memory. When I do `pip install -e .`, my machine crashes since bazel tries to use all the cores, but quickly runs out of memory. It seems there is no native way to set environment variables to tell bazel to limit its resource consumption, but there is a `--local_cpu_resources` command-line option.
This PR exposes that to the `pip install` via an environment variable. I also went through the setup.py and documented all the environment variables I could find.
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
This PR adds support for publishing and subscribing to logs in Python via GCS pubsub. It also refactors the Python threaded subscriber to support subscribing and calling `close()` from multiple threads.
We can also move tests and logging support to another PR, but it will make the purpose of the refactoring seems less obvious.
block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default:
- We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting).
- We're confident lineage-based reconstruction can work with block splitting.
This PR introduces a TrialCheckpoint class which is returned e.g. by ExperimentAnalysis.best_checkpoint. The class enables easy access to cloud storage locations (rather than just local directories before). It also comes with utilities to download, upload, and save trial checkpoints to local and cloud targets.
Address followup comments from https://github.com/ray-project/ray/pull/19863
- Add short "Concepts" section
- Add more section headings to break up the text
- Add "Workflow: Local Files" example
- Add "Workflow: Library development" example