This PR revamps and aligns the README and Ray intro doc page:
New "What is Ray" diagram that introduces AIR vs Ray core (diagram TBD finalized, this is the working placeholder)
Update the description of Ray
Link out to the user guides for key libraries and key concepts
Remove old / broken links, as well as the inline library descriptions from the README
Support a GPU column for the new dashboard
Have first node be default expanded
Signed-off-by: Alan Guo aguo@anyscale.comfixes#13889
Addresses comment from #26996
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
This is an important feature to prevent regression of feature set when user migrating from 1.0 to 2.0.
Update API references to beta. Needed as we are going to beta in 2.0.
I left out RL/Scikit-Learn/HuggingFace.
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [air] fix xgboost_benchmark script by passing in args (#27146)
* [tune/docs] Update custom syncer example (#27252)
There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes.
Signed-off-by: Kai Fricke <kai@anyscale.com>
* [tune/release] Do not use spot instances in k8s tests (#27250)
Spot instances are not being booted up, so let's go without them.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Editing pass over the tensor support docs for clarity:
Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
More replacements of tune.run() in examples/docstrings for Tuner.fit()
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
We need to check the time after acquiring the lock to make sure the correctness. Otherwise, it might wait for the lock and the heartbeat has been updated.
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
Currently running into an issue:
Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Following up from #27098, this PR renames the baseworker mixin and declutters training output by only logging for rank 0 actors.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread
This change cuts off support for deprecated schema fields. It intentionally breaks backwards compatibility with old configs which set a global min_workers, use head_node or worker_nodes, autoscaling_mode, initial_workers, target_utilization_fraction, and default_worker_node_type fields.
Co-authored-by: Alex alex@anyscale.com
Fix for a unintentional backwards-compatibility breakage for #25902
job submit api should still accept job_id as a parameter
Signed-off-by: Alan Guo aguo@anyscale.com
fb54679 introduced a bug by calling ray.put in the remote _split_single_block. This changes the ownership from driver to the worker who runs _split_single_block, which breaks dataset's lineage requirement and failed the chaos test.
To fix the issue we need to ensure the split block refs are created by the driver, which we can achieved by creating the block_refs as part of function returns.