Commit graph

4798 commits

Author SHA1 Message Date
Siyuan (Ryans) Zhuang
68e884ee43
[workflow] Test fault tolerance with storage (#17641)
* new test

* update storage

* enhance test

* fix s3
2021-08-09 11:19:14 -07:00
wanxing
8312628c30
Remove unused Spill function (#17607) 2021-08-09 10:10:03 -07:00
Simon Mo
7a0b8982f3
[serve] Return Client on serve.start() when connecting (#17552) 2021-08-09 10:55:05 -05:00
architkulkarni
bbcb06d45b
[doc] [runtime_env] Remove "experimental" label, add beta stability annotation (#17651) 2021-08-09 10:54:28 -05:00
Tao Wang
5990b60f8b
[Core]Cache named actor in local in case of getting them from GCS frequently. (#17339)
* [Core]Cach named actor in local in case of getting them from GCS frequently

* lint

* fix nullptr

* typo

* add namespace to cache

* lint

* lock, reference and others

* lint

* fix comments and add test

* lint

* lint

* optimize test

* add necessary fields in pub for caching

* add removing test

* fix test
2021-08-09 14:01:57 +08:00
SangBin Cho
1bcab9a7bb
[Object Spilling] Better error message for nightly test debugging (#17645)
* Fix

* Addressed code review.

* Addressed code review.
2021-08-08 20:44:49 -07:00
Hao Chen
0858f0e4f2
Change core worker C++ namespace to ray::core (#17610) 2021-08-08 23:34:25 +08:00
Simon Mo
c315596ed2
[Buildkite] Migrate macOS wheel builds (#16913) 2021-08-07 21:54:34 -07:00
Qing Wang
4cc34588db
[Core] Support ConcurrentGroup part1 (#16795)
* Core change and Java change.

* Fix void call.

* Address comments and fix cases.

* Fix asyncio
2021-08-07 22:41:33 +08:00
architkulkarni
f4c70be7f7
[Serve] Add replica tag to request counter and error counter (#17613) 2021-08-06 15:35:34 -07:00
architkulkarni
6d975b821b
[Serve] [Dashboard] Initial PR for exporting Serve data to cluster snapshot (#17489) 2021-08-06 15:03:29 -07:00
Edward Oakes
57b190c987
[serve] Remove logic to automatically infer conda env name (#17639) 2021-08-06 13:27:23 -05:00
Amog Kamsetty
f0cca063ad
[SGD v2] Reduce time for HF smoke test (#17623)
* reduce

* switch back model

* Update python/ray/util/sgd/v2/BUILD
2021-08-05 21:04:34 -07:00
Stephanie Wang
a06d71477f
[core] Do not spill back tasks blocked on args to blocked nodes (#17550)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-08-05 20:43:32 -07:00
Amog Kamsetty
add6ceb3ec
[Dependencies] Fix missing dependency UX (#17420)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-05 20:18:42 -07:00
Amog Kamsetty
14b02c3341
Add ray.data symlink to setup-dev.py (#17624) 2021-08-05 19:51:15 -07:00
Chen Shen
0fd3f761b9
[ci][rfc] build debug wheels and run python test on debug build (#17399)
* enable debug mode

* add

* :upload debug wheels

* upload debug wheels

* add

* fix bug

* add dbg

* Update python/setup.py

Co-authored-by: Simon Mo <simon.mo@hey.com>

* skip windows

Co-authored-by: Simon Mo <simon.mo@hey.com>
2021-08-05 17:58:19 -07:00
SangBin Cho
8bc9286296
Remove an unused profile event code from object manager. (#17529)
* Remove an unused profile event code from object manager.

* Addressed code review.

* Temporarily skip a test

* lint
2021-08-05 17:13:16 -07:00
SangBin Cho
d59d6ad653
[RFC][Usability] Improve general Ray stacktrace including adding Actor repr (#17389)
* 1. Added a label to the stack trace. 2. Remove ray code from user stacktrace. Improve stacktrace message.

* Add a test to the build

* Fix the issue

* Addressed code review.

* Addressed code review and debugging

* fix

* Try fixing tests.

* Fixed the issue.

* Fixed a bug for real. Tests need to be re-written

* Try one test.

* Formatting

* Addressed code review.

* Addressed the last code review.
2021-08-05 17:12:24 -07:00
SangBin Cho
99b26b476d
Fix flaky windows reconstruction test (#17564) 2021-08-05 17:10:54 -07:00
Amog Kamsetty
e4cf26ea6e
[SGD] v2 Prototype sgd.report() implementation (#17536)
* finish session

* finish

* formatting

* tests

* wip

* remove pdb

* remove import

* add tests

* raise from None

* Address comments

* Exception

* remove from None

* fix test

* address comments

* Update python/ray/util/sgd/v2/constants.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* add tests for session

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-08-05 16:03:21 -07:00
SangBin Cho
381ffdb6d0
Revert "[gcs] Fix actor killing race condition (#17456)" (#17599)
This reverts commit 521457b51b.
2021-08-05 15:54:03 -07:00
Edward Oakes
839ceba6db
[serve] Replace "backend" with "deployment" in metrics & logging (#17434) 2021-08-05 17:37:21 -05:00
architkulkarni
e84ae6caa5
[Core] [runtime env] Avoid spurious worker startup (#17422) 2021-08-05 15:46:23 -05:00
SangBin Cho
667851f0ad
Prototype done. (#17603) 2021-08-05 13:32:44 -07:00
Eric Liang
8ff3fce4ba
Add a warning if the number of queued tasks to an actor exceeds 5k (#17581) 2021-08-05 12:03:48 -07:00
Amog Kamsetty
be238e159d
[Tune] Update docs for with_parameters (#17441)
* with_parameters_doc

* update docstring

* address comments
2021-08-05 08:48:34 -07:00
architkulkarni
3ae5229b44
[core] Skip adding "script directory" to workers' sys.path when in interactive shell (#17556) 2021-08-05 10:05:19 -05:00
Siyuan (Ryans) Zhuang
ffe5b45cc1
[workflow] Enable test (#17585) 2021-08-04 21:18:50 -07:00
matthewdeng
1eca6ac154
[SGD] v2 alpha: Tensorflow Backend (#17532)
* [SGD] Implement Tensorflow Backend

* addres comments

* address comments

* format
2021-08-04 16:49:50 -07:00
Eric Liang
6db63990af
Don't capture child tasks in placement groups by default (#17527) 2021-08-04 16:09:45 -07:00
Chen Shen
53a0c74413
[nightly-test] fix non_streaming_shuffle_1tb_5000_partitions 2021-08-04 16:06:53 -07:00
Eric Liang
d4f9d3620e
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00
architkulkarni
63708468df
[runtime env] [Doc] Runtime env doc and messaging improvements (#17547) 2021-08-04 12:28:42 -07:00
Siyuan (Ryans) Zhuang
e3c09b0af1
[Workflow] Fix nested virtual actor (#17565)
* fix nested actor

* fix nested actor serialization

* one more example

* update exception message
2021-08-04 10:46:45 -07:00
SangBin Cho
3d13781e67
[Test ]Unflake raylet signal test (#17563) 2021-08-04 10:38:59 -07:00
Yi Cheng
521457b51b
[gcs] Fix actor killing race condition (#17456) 2021-08-04 10:37:56 -07:00
Eric Liang
cb48f3a712
Be more conservative in warning about too many workers (#17531) 2021-08-03 22:30:18 -07:00
Chris K. W
a33cbec12a
[client][docs] update docs for new client support in init (#17333)
* start

* check formatting

* undo changes from base branch

* Client builder API docs

* indent

* 8

* minor fixes

* absolute path to runtime env docs

* fix runtime_env link

* Update worker.init docs

* drop clientbuilder docs, link to 1.4.1 docs instead. Specify local:// behavior when address passed

* add debug info for ray.init("local")

* local:// attaches a driver directly

* update ray.init return wording

* remote init.connect() from example

* drop local:// docs, add section on when to use ray client

* link to 1.4.1 docs in code example instead of mentioning clientbuilder

* fix backticks, doc mentions of ray.util.connect

* remove ray.util.connect mentions from examples and comments

* update tune example

* wording

* localhost:<port> also works if you're on the head node

* add quotes

* drop mentions of ray client from ray.init docstring

* local->remote

* fix section ref

* update ray start output

* fix section link

* try to fix doc again

* fix link wording

* drop local:// from docs and special handling from code

* update ray start message

* lint

* doc lint

* remove local:// codepath

* remove 'internal_config'

* Update doc/source/cluster/ray-client.rst

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* doc suggestion

* Update doc/source/cluster/ray-client.rst

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-08-04 05:31:44 +03:00
James Mishra
6240d22060
Validate Redis addresses before making the client (#17481) 2021-08-03 16:56:53 -07:00
Siyuan (Ryans) Zhuang
bef519b373
[Workflow] Simplify storage and bug fix (#17453)
* simplify storage

* bug fix

* use a key-value like naming

* update workflow API

* fix s3

* add test
2021-08-03 16:38:54 -07:00
Ian Rodney
f3acae6eb6
[Autoscaler] Sync Files before Starting Docker (#17361) 2021-08-03 13:25:08 -07:00
Alex Wu
8efa6be913
[Dataset] Fix reading parquet from gcs (#17528)
* .

* .

* comments

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-08-03 10:10:42 -07:00
Sasha Sobol
5dbbaf7261
[autoscaler] Enforce per-node-type max workers (#17352)
* Enforce per-node-type max workers

* type annonation

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* cleanup. comments. type annotations

* additional type annotation

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* additional cleanup. comments. type annotations

* _get_nodes_needed_for_request_resources to use FrozenSet

* comments

* whitespace

* [Placement Group] Fix resource index assignment between with bundle index and without bundle index pg (#17318)

* [serve] Add Ray API stability annotations (#17295)

* Support streaming output of runtime env setup to logger/driver (#17306)

* [SGD] v2 prototype: ``WorkerGroup`` implementation (#17330)

* wip

* formatting

* increase timeouts

* address comments

* comments

* fix

* address comments

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* address comments

* formatting

* fix

* avoid race condition

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* [RLlib] Discussion 3001: Fix comment on internal state shape (must be [B x S=state dim]). (#17341)

* [autoscaler] GCP TPU VM autoscaler (#17278)

* [Rllib] set self._allow_unknown_config (#17335)

Co-authored-by: Sven Mika <sven@anyscale.io>

* [RLlib] Discussion 2294: Custom vector env example and fix. (#16083)

* [docs] Link broken in Tune's page (#17394) (#17407)

* [Serve] Fix response_model for class based view routes as well (#17376)

* [serve] Fix single deployment nightly test (#17368)

* [RLlib] SAC tuple observation space fix (#17356)

* Support schema on read for csv/json (#17354)

* [RLlib] New and changed version of parametric actions cartpole example + small suggested update in policy_client.py (#15664)

* [gcs] Fix GCS related issues: ByteSizeLong and redis connection (#17373)

* [runtime_env] Gracefully fail tasks when an environment fails to be set up (#17249)

* [docs] update docs with pip requirements (#17317)

* removed nodes_to_keep. cleanup

* formatting

* +comment

* treat max_workers=0 as 0 workers (as opposed to unlimited)

* fix wrong comment

* warning for inconsistent config

* terminate nodes with no matching node type right away

* quotes

* special handling for head node when enforcing max_workers per type. tests. cleanup

* cleanup comments and prints

* comments

* cleanup. removed special handling of head node.

* adding an eplicit non-None check in schedule_node_termination

* raise the exception

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
Co-authored-by: DK.Pino <loushang.ls@antfin.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Rohan138 <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: amavilla <takashi.tameshige.jj@hitachi.com>
Co-authored-by: Jiao <sophchess@gmail.com>
Co-authored-by: Julius Frost <33183774+juliusfrost@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: kk-55 <63732956+kk-55@users.noreply.github.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
2021-08-03 11:31:32 -04:00
Antoni Baum
c40555c82b
[tune] Add define-by-run support to OptunaSearcher (#17464) 2021-08-03 16:11:58 +01:00
Antoni Baum
df2fce9ab6
[tune] Allow to pass searcher/scheduler string names to tune.run (#17517) 2021-08-03 09:28:03 +01:00
Eric Liang
f9552765cb
Avoid re-exporting same function repeatedly in dataset (#17522) 2021-08-02 18:15:25 -07:00
SangBin Cho
f1ccadbb27
Skip flaky windows object spilling tests (#17510) 2021-08-02 15:53:07 -07:00
matthewdeng
e89195bfb9
[SGD] add SGDv2 Trainer prototype implementation (#17440)
* wip

* formatting

* increase timeouts

* wip

* address comments

* comments

* fix

* address comments

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/sgd/v2/worker_group.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* address comments

* formatting

* fix

* wip

* finish

* fix

* formatting

* remove reporting

* split TorchBackend

* fix tests

* address comments

* add file

* more fixes

* remove default value

* update run method doc

* add comment

* minor doc fixes

* lint

* add args to BaseWorker.execute

* address comments

* remove extra parentheses

* properly instantiate backend

* fix some of the tests

* fix torch setup

* fix type hint

* [SGD] add SGDv2 Trainer prototype implementation

* add fashion mnist test

* add HuggingFace example

* format

* formatting

* address comment

* address comments

* update comment

* Update python/ray/util/sgd/v2/examples/transformers/cluster.yaml

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* update huggingface transformers

* update hugging face transformers

* fix shutdown on worker failure

* Update python/requirements/tune/requirements_tune.txt

* Update python/requirements/tune/requirements_tune.txt

* Update python/requirements/tune/requirements_tune.txt

* Update python/requirements/tune/requirements_tune.txt

* address comment and fix test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-08-02 15:27:42 -07:00
Eric Liang
748cbbb23d
[hotfix] Parquet S3 reads broken due to pyarrow.lib.ArrowInvalid: S3 subsystem not initialized (#17492) 2021-08-02 11:48:48 -07:00