Commit graph

4844 commits

Author SHA1 Message Date
Yi Cheng
9136bb95d9
[workflow] Allow function without __module__ and __qualname__ (#17804) 2021-08-14 11:18:07 -07:00
Hasan Genc
6957ce66f6
Revert "Shutdown clusters when large number of nodes (#17642)" (#17836)
This reverts commit a33dc75105.
2021-08-14 04:57:22 +03:00
Thomas Desrosiers
3e48df89f7
[Client] Fix mismatched debug log ID formats (#17597) 2021-08-13 13:28:20 -07:00
Amog Kamsetty
9f5dc5ec9f
[Docker] Downgrade to CUDA 11.0 (#17806) 2021-08-13 20:39:06 +02:00
architkulkarni
fcac416933
[Serve] [Dashboard] Add start times and replica tags to cluster snapshot (#17749) 2021-08-13 09:49:12 -07:00
Eric Liang
7ec52ca311
Make the namespace argument explicit instead of implicit in actor names (#17758) 2021-08-13 09:24:13 -07:00
Hasan Genc
a33dc75105
Shutdown clusters when large number of nodes (#17642)
* Allow clusters with over 1000 nodes to be shut down

* Add unit-test for terminating large number of nodes on AWS

* Fix lint

* Add max_terminate_nodes to the NodeProvider abstract class, and refactor terminate_nodes to reduce repetition

* lint

* Update comment

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* lint

* lint

* Unit test previously required internet access. This commit removes that requirement.

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-08-13 17:09:19 +03:00
Kai Fricke
96b620bc01
[docker] Pin matplotlib, fix docker build (#17819) 2021-08-13 14:59:50 +01:00
xwjiang2010
0be9f06ab6
[tune] Output insufficent resources warning msg when trials are in pending for extended amount of time. (#17533) 2021-08-13 01:37:56 -07:00
Hao Zhang
61de23cbae
[Collective] silent the pygloo warning as it is not commonly used (#17792) 2021-08-13 00:47:45 -07:00
qicosmos
a2a1c46c83
[C++ Worker]Fix for mac (#17633)
* linkopts shared

* replace gflags with absl flags

* fix

* add test option

* fix

* add cpp worker to mac ci

* fix

* support empty redis password;mod arc argv

* add encoding

* test

* ignore example test on mac

* support mac

* fix

* fix and update doc

* fix

* fix run.sh

* fix init

* fix typo

* fix run.sh

* fix lint

Co-authored-by: 久龙 <guyang.sgy@antfin.com>
2021-08-13 12:22:37 +08:00
Simon Mo
242a5d1a8d
[Serve] Add support for root_url (#17765) 2021-08-12 17:54:53 -07:00
Simon Mo
22b030d79f
[Serve] Remove serve.start(http_*) arguments (#17762) 2021-08-12 17:50:12 -07:00
Eric Liang
7fc62a1529
Support dataset union (#17793) 2021-08-12 14:01:40 -07:00
Chen Shen
9565fa549e
[Core][RFC] limit the total number of inlined bytes in task request rpc
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-08-12 13:55:54 -07:00
Simon Mo
6879293b6b
[CI] Mark some tests exclusive (#17650) 2021-08-12 10:28:03 -07:00
SangBin Cho
8fd7e025be
Skip raylet kill windows #17682 (#17683)
* Try fixing it?

* Done

* skip raylet signal
2021-08-12 09:35:44 -07:00
matthewdeng
55680a1f9e
[SGD] v2 initial checkpoint functionality (#17632)
* [SGD] initial checkpoint functionality

* remove thread implementation and merge with fetch_next_result

* Update comment

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* address comments

* add additional tests

* fix imports

* load most recently saved checkpoint

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-08-12 08:52:04 -07:00
Clark Zinzow
d6eeb5dc70
[Datasets] Add local and S3 filesystem test coverage for file-based datasources. (#17158) 2021-08-12 08:39:31 -07:00
architkulkarni
00f6b30684
[Serve] [Dashboard] Support nondetached and multiple Serve instances in cluster snapshot (#17747) 2021-08-11 22:26:54 -05:00
Eric Liang
ce171f10a1
Remove legacy plasma unlimited and pull manager pinning flag (#17753) 2021-08-11 20:19:12 -07:00
Clark Zinzow
623db7c47b
[Datasets] Add support for reading partitioned Parquet datasets. (#17716) 2021-08-11 15:55:49 -07:00
Jiao
3c64a1a3c1
Add micro benchmark to releaser repo (#17727) 2021-08-11 15:15:33 -07:00
architkulkarni
9a70e83e90
[hotfix] pin tensorflow==2.5.1 (#17760)
* pin tensorflow==1.5.1

* Update python/requirements.txt

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-11 15:15:22 -07:00
Yi Cheng
aa96e59faf
[workflow] Examples of function chaining (#17715) 2021-08-11 13:15:51 -07:00
Yi Cheng
02e79f3fe5
Revert "[Observability] Export useful metrics (#17578)" (#17752)
This reverts commit bd4db53df2.
2021-08-11 12:21:50 -07:00
Jiao
e38db5875b
Add serve external kv store (#17622) 2021-08-11 12:06:14 -07:00
Amog Kamsetty
ed24bae644
[SGD] Fail if num_workers is not greater than 0 (#17723) 2021-08-11 10:05:19 -07:00
Ian Rodney
97f7ae5e06
[Cluster Launcher] Allow attach/exec on uninitialized head node (#17688) 2021-08-11 09:43:23 -07:00
chenk008
f0fc26960d
[sgd] Wait for placement_group deletion when shutdown worker_group (#17698)
* fix

* fix ut

* delete sleep

* fix according to comment

* fix according to comment

* use pg in test_resize

* fix
2021-08-11 08:47:49 -07:00
J K Terry
48e32555c8
[rllib] Update PettingZoo dependency versions (#17702)
* update pettingzoo dependency versions

* pettingzoo verison

* fix tests
2021-08-11 01:19:19 -07:00
Shantanu
abc593561c
[client] fix ClientRemoteMethod error message (#17726)
Co-authored-by: hauntsaninja <>
2021-08-11 00:43:17 -07:00
Yi Cheng
bd4db53df2
[Observability] Export useful metrics (#17578)
* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* checkpoint

* up

* up

* up

* up

* fix

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* add comments

* up

* up

* up

* up

* add tests
2021-08-10 17:14:42 -07:00
SongGuyang
63c15d7ced
[core] make 'PopWorker' to be an async function (#17202)
* make 'PopWorker' to be an async function

* pop worker async works

* fix

* address comments

* bugfix

* fix cluster_task_manager_test

* fix

* bugfix of detached actor

* address comments

* fix

* address comments

* fix aioredis

* Revert "fix aioredis"

This reverts commit 041b983eac95b105ab0e853e84c4cf2647008431.

* bug fix

* fix

* fix test_step_resources test

* format

* add unit test

* fix

* add test case PopWorkerStatus

* address commit

* fix lint

* address comments

* add python test

* address comments

* make an independent function

* Update test_basic_3.py

Co-authored-by: Hao Chen <chenh1024@gmail.com>
2021-08-10 17:03:17 -07:00
xwjiang2010
932f038644
[tune] Type hint TrialExecutor. Use Abstract Base Class. (#17584) 2021-08-10 14:17:22 -07:00
Clark Zinzow
78d23434e6
[Datasets] Fix write_json so roundtrip writing + reading works. (#17691)
* Write out dataset blocks as newline-delimited JSON.

* Add roundtrip JSON reading + writing test.

* Formatting.
2021-08-10 13:24:33 -07:00
SangBin Cho
705a7192b3
Unflake multi node 3 (#17694) 2021-08-10 13:16:52 -07:00
SangBin Cho
6160c06c69
[Core] Fix a bug where get_actor crashes gcs if the actor is already killed. (#17670)
* Fix a bug where get_actor crashes gcs if the actor is already killed.

* Test the restart code path.

* Add an additional test

* Add a comment

* addressed code review.
2021-08-10 09:58:09 -07:00
yuduber
446ee1ad24
[autoscaler] Support Peloton node provider (#17312)
* modify updater to make it work with uber peloton node provider

* working solution for using NodeID as unique ID in peloton node provideer but need run ray.init

* working solution of using resource cmd to pass in node_id

* cleanup

* cleanup 2

* removed updater.py change to make sure of the disable_node_updaters flag

* add newliine to end of updater.py to undo all the change

* undo change in autoscaler.py

* use use_node_id_as_ip as field name in monitor

* lint

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix-for-monitor-without-autoscaler

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-08-10 12:48:11 -04:00
Antoni Baum
13f39b2cb7
[SGD] v2 JSON logger callback & callback groundwork (#17619)
* finish session

* finish

* formatting

* tests

* wip

* remove pdb

* remove import

* add tests

* raise from None

* Address comments

* Exception

* remove from None

* fix test

* address comments

* SGDv2 JSON Logging Callback

* Revert testing change

* Prefix autofilled metrics

* Move env var check to local

* Fix exception

* Improve docstrings, default filename

* Add unit test

* Implement feedback

* SGDLoggingCallback to SGDSingleFileLoggingCallback

* Use env_integer

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-08-09 21:18:46 -07:00
Yi Cheng
473740b739
[gcs] Fix actor killing hang due to race condition (#17634)
* Revert "Revert "[gcs] Fix actor killing race condition (#17456)" (#17599)"

This reverts commit 381ffdb6d0.

* update

* format

* up
2021-08-09 21:11:26 -07:00
SangBin Cho
d05571af2d
Fix a progress bar issue and add it to the nightly (#17627)
* Fix a progress bar issue and add it to the nightly

* Trial

* in progress

* Fix issues.
2021-08-09 19:31:47 -07:00
Dmitri Gekhtman
c1b9f921a6
[autoscaler] Add option of returning node metadata from non_terminated_nodes. (#17273)
* Optional return from nonterminated

* format

* terminated node signature

* add return
2021-08-09 20:23:31 -04:00
Ian Rodney
6475fe1b82
[Autoscaler][Docker] Warn if a file is passed in Docker File Mounts (#16515)
Co-authored-by: Ian Rodney <ilr@anyscale.com>
2021-08-09 15:13:58 -07:00
Richard Liaw
bde14f2de6
[tune] add developer/stability annotations (#17442)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-09 14:50:59 -07:00
Dmitri Gekhtman
07a42a5bdb
resolve (#17495) 2021-08-09 17:33:16 -04:00
Siyuan (Ryans) Zhuang
68e884ee43
[workflow] Test fault tolerance with storage (#17641)
* new test

* update storage

* enhance test

* fix s3
2021-08-09 11:19:14 -07:00
wanxing
8312628c30
Remove unused Spill function (#17607) 2021-08-09 10:10:03 -07:00
Simon Mo
7a0b8982f3
[serve] Return Client on serve.start() when connecting (#17552) 2021-08-09 10:55:05 -05:00
architkulkarni
bbcb06d45b
[doc] [runtime_env] Remove "experimental" label, add beta stability annotation (#17651) 2021-08-09 10:54:28 -05:00