Commit graph

13734 commits

Author SHA1 Message Date
Eric Liang
ce0ca572b9
[docs] Change data tagline to "Distributed Data Preprocessing" (#27434) (#27478) 2022-08-03 17:27:35 -07:00
Eric Liang
b10cf9027c
[docs] Update colors and styling of ray diagrams (#27474) (#27476) 2022-08-03 16:52:16 -07:00
Eric Liang
c5d51fca25
[docs] Improve the AIR introductory page (#27347) (#27472) 2022-08-03 16:05:03 -07:00
shrekris-anyscale
ba76e4b1a4
[Serve] Make serve.run() and deployment.bind() beta APIs (#27433) 2022-08-03 16:02:43 -07:00
Eric Liang
ca7d3285a7
[docs] Revamp README and Ray intro doc page (#27405) (#27458)
This PR revamps and aligns the README and Ray intro doc page:

New "What is Ray" diagram that introduces AIR vs Ray core (diagram TBD finalized, this is the working placeholder)
Update the description of Ray
Link out to the user guides for key libraries and key concepts
Remove old / broken links, as well as the inline library descriptions from the README
2022-08-03 14:48:33 -07:00
Alan Guo
8ad147864e
bump jobs version after making a backwards-incompatible change (#27281) (#27316)
Backwards incompatible change was #25902

2.0.0 cherry-pick but not a rc0 blocker

Signed-off-by: Alan Guo <aguo@anyscale.com>
2022-08-03 14:18:25 -07:00
Jiajun Yao
fea259593b
[Cherry Pick] Support placement_group=None in PlacementGroupSchedulingStrategy (#27370) (#27416) 2022-08-03 12:14:51 -07:00
Alan Guo
2ad2cb259d
Add GPU info to new dashboard (#27074) (#27399)
Support a GPU column for the new dashboard

Have first node be default expanded

Signed-off-by: Alan Guo aguo@anyscale.com

fixes #13889

Addresses comment from #26996
2022-08-03 11:54:53 -07:00
Simon Mo
f0d7ce9080
[Serve] [Pick] Fix Graph Repeated Invocation (#27417) (#27420) 2022-08-03 10:25:18 -07:00
Jimmy Yao
365446265b
[ray 2.0 release] fix the release test of ray lightning master (#27395) 2022-08-03 09:40:57 -07:00
Simon Mo
de0c70714f
[Serve] ServeHandle detects ActorError and drop replicas from target group (#26685) 2022-08-02 23:09:00 -07:00
Yi Cheng
c50b9ac2fa
[workflow] Change step to task in workflow (#27330) (#27403)
Recently we deprecate step in workflow. This PR wrap up everything and replace step to task in workflow to reflect the recent changes.
2022-08-02 17:22:30 -07:00
Simon Mo
64794c88ee
[Serve] Support Multiple DAG Entrypoints in DAGDriver (#26573) (#27349)
Co-authored-by: Sihan Wang <sihanwang41@gmail.com>
This is an important feature to prevent regression of feature set when user migrating from 1.0 to 2.0.
2022-08-02 17:17:17 -07:00
Eric Liang
ccc5f44513
[air] Update to beta (#27393) (#27407)
Update API references to beta. Needed as we are going to beta in 2.0.

I left out RL/Scikit-Learn/HuggingFace.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-02 17:12:54 -07:00
Eric Liang
61b2a26035
[air] Fix BatchPredictor.predict_pipelined not working with GPU stage (#27232) (#27398) 2022-08-02 15:40:44 -07:00
Kai Fricke
0d4d4e14a9
[release/tune/2.0.0] Fix k8s release test + node-to-node syncing (#27365)
* [air] fix xgboost_benchmark script by passing in args (#27146)

* [tune/docs] Update custom syncer example (#27252)

There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes.

Signed-off-by: Kai Fricke <kai@anyscale.com>

* [tune/release] Do not use spot instances in k8s tests (#27250)

Spot instances are not being booted up, so let's go without them.

Signed-off-by: Kai Fricke <kai@anyscale.com>

Co-authored-by: matthewdeng <matt@anyscale.com>
2022-08-02 14:46:12 -07:00
Eric Liang
b1f933a17d
[docs] Reorganize the tensor data support docs; general editing (#26952) (#27355)
Editing pass over the tensor support docs for clarity:

Make heavy use of tabbed guides to condense the content
Rewrite examples to be more organized around creating vs reading tensors
Use doc_code for testing
2022-08-02 12:39:57 -07:00
Kai Fricke
0fa2806554
[docs/2.0.0] Fix Tune custom syncer example (#27253)
Co-authored-by: matthewdeng <matt@anyscale.com>
2022-08-01 20:29:04 -07:00
Siyuan (Ryans) Zhuang
990a4534af
[Workflow] Cleanup workflow docs (#27217)
Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-08-01 18:04:33 -07:00
xwjiang2010
18ec3afdc6
[ air ] clean up some more tune.run (#27117) (#27321)
More replacements of tune.run() in examples/docstrings for Tuner.fit()

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

Co-authored-by: Kai Fricke <kai@anyscale.com>

Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-08-01 17:29:42 -07:00
matthewdeng
1a62a8f855 [tune] pin pymoo (#27311)
Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-07-31 01:16:31 -07:00
Yi Cheng
a3f428e330
[ray2.0][ci] Fix test_gcs_ha_e2e.py (#27263) (#27280)
This PR fix the broken test. The test failed because it's not installing the latest wheel.
2022-07-30 00:12:24 -07:00
Yi Cheng
9391008bc0
[ci] Deflakey gcs_heartbeat_test in windows. (#27275) (#27294)
We need to check the time after acquiring the lock to make sure the correctness. Otherwise, it might wait for the lock and the heartbeat has been updated.
2022-07-30 00:10:53 -07:00
scv119
c0fd69a33b Revert "[autoscaler] Remove deprecated fields from schema (#27040) (#27200)"
This reverts commit cd1ba2da80.
2022-07-29 15:37:52 -07:00
Jun Gong
24976ef23a
[RLlib] Revert 41c9ef70. (#27243) (#27270)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
2022-07-29 12:02:23 -07:00
Guyang Song
860fe6ccdd
[Hotfix] Fix the failure of C++ tests (#27249) (#27260)
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
2022-07-29 11:59:42 -07:00
xwjiang2010
4bf33efd5c
[air] Add annotation for Tune module. (#27060) (#27210)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>

As a follow up to #27060.
2022-07-29 11:11:45 -07:00
Eric Liang
59f7a82821
Fix logger initialization (#27238) (#27264)
Cherry-pick #27238 into 2.0 branch.
2022-07-29 11:09:48 -07:00
Jiao
f07f2d8621 [AIR][Data] Fix nyc_taxi_basic_processing notebook (#26983) 2022-07-29 10:24:25 -07:00
matthewdeng
86718071fe
[tune] Increase volume size for long running pbt failure (#27163) (#27247)
Currently running into an issue:

Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than  snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-29 01:16:40 -07:00
Kai Fricke
c680837289
[air/train/release/2.0.0] Rename BaseWorkerMixin, only log info torch loop for rank 0 (#27228)
Following up from #27098, this PR renames the baseworker mixin and declutters training output by only logging for rank 0 actors.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-29 09:04:02 +01:00
Jian Xiao
0916807fb1 Fix the ray version to doc version mapping (#27191)
Why are these changes needed?
It doesn't work if the ray version is something like "2.0.0rc0"
2022-07-28 23:36:52 -07:00
Chen Shen
8b7ccf8502 [CI][hotfix] remove no-index
--no-index will not try to install pip packages from pypi. this breaks CI because it failed to find grpcio==1.43.0 as it's missing from cache.
2022-07-28 23:31:24 -07:00
SangBin Cho
1b1787a9aa [Test] Try fixing a flaky gcs heartbeat manager test. (#27096)
Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread
2022-07-28 22:42:49 -07:00
Chen Shen
96f9b9506f Revert "Allow grpcio >= 1.48 (#26765)" (#27244)
This reverts commit 6acd0a4c9b.
2022-07-28 22:37:41 -07:00
Siyuan (Ryans) Zhuang
f371e17a7f
Fix flaky workflow events CI test by extending timeout (#27231)
Signed-off-by: Siyuan Zhuang <suquark@gmail.com>
2022-07-28 19:04:34 -07:00
Jimmy Yao
2a0a086ffa
[hot fix] Cherry pick/hot fix 0728 ray lightning (#27225)
unblock linter
2022-07-28 17:55:26 -07:00
Alex Wu
cd1ba2da80
[autoscaler] Remove deprecated fields from schema (#27040) (#27200)
This change cuts off support for deprecated schema fields. It intentionally breaks backwards compatibility with old configs which set a global min_workers, use head_node or worker_nodes, autoscaling_mode, initial_workers, target_utilization_fraction, and default_worker_node_type fields.

Co-authored-by: Alex alex@anyscale.com
2022-07-28 17:09:43 -07:00
Clark Zinzow
f7b46b3ecc
[AIR - Datasets] Fix AIR release tests dealing with tensor columns. (#27221) (#27224)
This PR fixes some AIR release tests that deal with tensor columns.
2022-07-28 16:40:48 -07:00
Guyang Song
950939c7dc
[hotfix] Fix the failure of java test (#27183) (#27192)
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
2022-07-29 07:28:07 +08:00
Yi Cheng
f50729b2ed
[ci] Move test_storage to large test because of windows timeout. #27212 (#27230)
Windows actually can pass the test, but it'll need > 300s. Move it to large test.

Signed-off-by: Yi Cheng <chengyidna@gmail.com>
2022-07-28 15:15:32 -07:00
Kai Fricke
55e9e44a87
[tune/release/2.0.0] Gracefully fail in lstat lookup (#27226) 2022-07-28 15:15:16 -07:00
Kai Fricke
abca0ba165
[tune/release/2.0.0] Fix tune_cloud_aws_durable_upload_rllib_* release tests (#27180) 2022-07-28 15:14:49 -07:00
Kai Fricke
24ed249d7c
[air] fix xgboost_benchmark script by passing in args (#27146) (#27158)
Co-authored-by: matthewdeng <matt@anyscale.com>
2022-07-28 15:05:31 -07:00
Alan Guo
6014087505
[Dashboard] Fix node rows not being removed correctly when using filters (#27205) (#27223)
Cherry pick of #27205
2022-07-28 14:43:39 -07:00
Alan Guo
adedfdb0ba
Add back job_id to submit_job API to maintain backwards-compatibility (#27110) (#27202)
Fix for a unintentional backwards-compatibility breakage for #25902
job submit api should still accept job_id as a parameter

Signed-off-by: Alan Guo aguo@anyscale.com
2022-07-28 14:27:48 -07:00
Kai Fricke
dc0b445323
[rllib/release/2.0.0] Fix rllib connect test (#27162)
Why are these changes needed?
Follow-up from #27155 - this will let the connect test pass
2022-07-28 14:23:23 -07:00
Clark Zinzow
22ca30cd92
[Cherry-pick] [AIR - Datasets] Hide tensor extension from UDFs. (#27196) 2022-07-28 13:59:19 -07:00
Chen Shen
94cb7aca29
[Data][Split] Fix split ownership (#27149) (#27195)
fb54679 introduced a bug by calling ray.put in the remote _split_single_block. This changes the ownership from driver to the worker who runs _split_single_block, which breaks dataset's lineage requirement and failed the chaos test.

To fix the issue we need to ensure the split block refs are created by the driver, which we can achieved by creating the block_refs as part of function returns.
2022-07-28 13:02:32 -07:00
Simon Mo
13c3400117
[Serve] Remove release tests for checkpoint_path (#27194) (#27206)
Cherry pick commit 8beb887 to address
#27189
2022-07-28 13:00:03 -07:00