I updated this version compatibility table on the release branch but didn't update it on master. This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.
This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.
This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.
This adds an initial Dataset.stats() framework for debugging dataset performance. At a high level, execution stats for tasks (e.g., CPU time) are attached to block metadata objects. Datasets have stats objects that hold references to these stats and parent dataset stats (this avoids stats holding references to parent datasets, allowing them to be gc'ed). Similarly, DatasetPipelines hold stats from recently computed datasets.
Currently only basic ops like map / map_batches are instrumented. TODO placeholders are left for future PRs.
Datasets docs for last-mile preprocessing, particularly geared towards ML ingest. This gives groupby, aggregations, and random shuffling examples in the overview page (not present previously), adds some concreteness to our last-mile preprocessing positioning, and provides some preprocessing recipes for a few common transformations.
block splitting and makes it off by default. This makes it easier to debug problems potentially related to this feature. Criteria for enabling by default:
- We're confident all nightly tests pass (currently, there may be an issue with large-scale groupby with block splitting).
- We're confident lineage-based reconstruction can work with block splitting.
The default block size of 500MiB seems too low for some common workloads, e.g. shuffling 500GB. This creates 1000 blocks which means 1 million intermediate shuffle objects until we implement #20500.
This PR adds support for automatic block splitting on read and map transforms, to keep block size bounded to ~500MiB. This avoids potential OOM situations where a map task may consume too much intermediate Python heap memory, or too much object store shared memory for one block.
## Why are these changes needed?
- Since broadcasting is moving to grpc, introducing the option to increase the client side thread number
- For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled
With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s
## Related issue number
* start
* check formatting
* undo changes from base branch
* Client builder API docs
* indent
* 8
* minor fixes
* absolute path to runtime env docs
* fix runtime_env link
* Update worker.init docs
* drop clientbuilder docs, link to 1.4.1 docs instead. Specify local:// behavior when address passed
* add debug info for ray.init("local")
* local:// attaches a driver directly
* update ray.init return wording
* remote init.connect() from example
* drop local:// docs, add section on when to use ray client
* link to 1.4.1 docs in code example instead of mentioning clientbuilder
* fix backticks, doc mentions of ray.util.connect
* remove ray.util.connect mentions from examples and comments
* update tune example
* wording
* localhost:<port> also works if you're on the head node
* add quotes
* drop mentions of ray client from ray.init docstring
* local->remote
* fix section ref
* update ray start output
* fix section link
* try to fix doc again
* fix link wording
* drop local:// from docs and special handling from code
* update ray start message
* lint
* doc lint
* remove local:// codepath
* remove 'internal_config'
* Update doc/source/cluster/ray-client.rst
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
* doc suggestion
* Update doc/source/cluster/ray-client.rst
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>