Commit graph

9841 commits

Author SHA1 Message Date
Chen Shen
1ed5f622c2
[Core] QuickExit CoreWorker when GetCoreWorker is called after shutdown 2021-10-06 15:07:57 -07:00
Edward Oakes
0f915820e1
[serve] Rename backend_worker -> replica (#19150) 2021-10-06 16:39:17 -05:00
Chris K. W
d1517c33ab
[client] deflake test_object_ref_cleanup (#19153) 2021-10-06 14:06:43 -07:00
Kai Fricke
9f77cd8d28
[tune] Deflake PBT Async test (#19135) 2021-10-06 12:24:22 -07:00
Edward Oakes
9316a9977f
[serve] Support kwargs to deployment constructor (#19023) 2021-10-06 14:16:23 -05:00
Frank Luan
77d0a08c38
[docker] Fix missing space in docker.py warning (#19128) 2021-10-06 12:09:26 -07:00
Ian Rodney
8cab8d3ae9
[Datasets] Clean Up docs around pipelining -> windowing rename (#19142) 2021-10-06 11:07:55 -07:00
Chris K. W
db1105fa83
[client] Skip test_valid_actor_state tests on windows (#19114)
* skip test_wrapped_actor_creation on windows

* rerun windows ci

* mark test_valid_actor_state_2 as flaky

* mark test_valid_actor_state

* rerun
2021-10-06 09:17:59 -07:00
Simon Mo
4beba3f727
[Doc] Document existing runtime env's container support (#19076) 2021-10-06 10:25:57 -05:00
architkulkarni
281fcaa91a
[Serve] [Doc] Add note about serving multiple deployments defined by the same class (#19118) 2021-10-06 10:24:42 -05:00
Kai Fricke
234b015b42
[ci] Clean wheels directory before build, validate wheel commit strings (#19097) 2021-10-06 13:48:24 +01:00
Sven Mika
1f0646f658
[RLlib] Issue 18418: SAC w/ dict space not working. (#19101) 2021-10-06 09:05:50 +02:00
Eric Liang
f8a91c7fad
Revert "[Lint] run clang-tidy in scripts/format.h, update clang-tidy rules (#19055)" (#19119)
This reverts commit 5d9e3a0121.
2021-10-05 16:33:12 -07:00
Eric Liang
0702974f21
Add CODEOWNERS for format.sh script (#19121) 2021-10-05 16:31:08 -07:00
Amog Kamsetty
db0483a29a
[SGD] SGD Namespace Consistency (#19048)
* wip

* update

* add callbacks

* fix

* fix

* update

* add

* address comments
2021-10-05 15:56:42 -07:00
Philipp Moritz
53f1d5de61
Fix C++17 support on some windows machines (#19088) 2021-10-05 15:15:59 -07:00
Matti Picus
63dd22c7c2
add msvcp140.dll to the wheel on windows (#19062)
* add msvcp140.dll to the wheel on windows

* fixes from review

* be more verbose

* Update setup.py

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-10-05 15:12:46 -07:00
mwtian
5d9e3a0121
[Lint] run clang-tidy in scripts/format.h, update clang-tidy rules (#19055) 2021-10-05 14:03:27 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker (#18167)
* Convert worker pool to queue

* Start up to backlog size more workers

* fixes

* Prestart workers according to num available CPUs

* lint

* x

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* dedicated workers

* Fix tests

* x

* fix

* asan

* asan

* Workers can only exec tasks with same job ID

* size_t for runtime env hash, fix unit tests

* include job ID in runtime env hash, remove from worker registration msg

* x

* conflict

* debug

* Schedule and dispatch periodically, skip if no new tasks

* Update src/ray/common/task/task_spec.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Yi Cheng
ecf7b86585
[workflow] Avoid running workflow step multiple times. (#19090)
When workflow recover, it'll try to reconstruct the DAG. However, it's step scoped, which means if a workflow is passed to multiple steps, it'll be executed multiple times which breaks the exactly-once semantic.

For ObjectRef it's ok since it'll be cached with serialization context, but we also need a similar thing for Workflow input.

This logic is put in workflow layer instead of serialization layer because it's dedupe on app layer.

Issue #18997 has race conditions, and it's also related to this one. The reason is that multiple steps will try to issue writes to virtual actors at the same time which is not allowed right now and can lead to race condition.
2021-10-05 13:43:27 -07:00
Kai Fricke
42116badba
[ci/release] Check test result alerts after test finished (#19105) 2021-10-05 21:27:27 +01:00
Kai Fricke
957f9e9d99
[client] Undo PySpark's monkey patching of namedtuples for PickleStub (#19034) 2021-10-05 10:43:50 -07:00
matthewdeng
3fbe135a24
[docs] add modin_xgboost and dask_xgboost notebook tutorials (#18775)
* Add xgboost-dask golden notebook

* [examples] add modin-xgboost Jupyter notebook

* Add xgboost dast gn

* update modin notebook to sphinx-gallery compatible python file

* fix build file

* fix test

* fix test

* Add modin notebook anyscale connect test

* Add missing file

* add dask_xgboost notebook

* Add the new modin golden notebook to CI

* fix lint and filter out tests with py37

* Update release/golden_notebook_tests_new/golden_notebook_tests.yaml

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Add dask, wait for cluster client, remove pytest

* Replace folder

* Fix

* Update dask_xgboost_app_config.yaml

* Update modin_xgboost_app_config.yaml

* comment on filtered out tests

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-10-05 09:17:33 -07:00
Chen Shen
1efcf5c3d5
[Core][CoreWorker ThreadSafety 1/n] Ensure global_worker_ is protected by mutex #19073 2021-10-05 05:32:28 -07:00
Yi Cheng
2cff293810
fix (#19094) 2021-10-05 01:53:05 -07:00
Yi Cheng
1eecb7d80b
up (#19092) 2021-10-04 23:54:31 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
Dmitri Gekhtman
beaba4782a
[k8s][doc] Fix service name in K8s static deployment example (#19065) 2021-10-04 20:23:54 -05:00
Jiajun Yao
7ccf737f97
Add compatible dask version for ray 1.6.0 and 1.7.0 (#19080) 2021-10-05 10:23:06 +09:00
Jiajun Yao
1b286640c6
Add release logs for 1.7.0 (#18931) 2021-10-04 14:02:39 -07:00
Jiajun Yao
3cb2b3e23a
Fix test_single_node json report (#19075) 2021-10-04 13:05:32 -07:00
SangBin Cho
83cb992d5b
Revert pull retry (#19068)
* Revert "[Object manager] fix comments"

This reverts commit 56debfc063.

* Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)"

This reverts commit d12e35ce53.

* Fix a lint issue
2021-10-04 11:20:43 -07:00
SangBin Cho
7fcf1bf57e
[Dashboard] Refine the dashboard restart logic. (#18973)
* in progress

* Refine the dashboard agent retry logic

* refine

* done

* lint
2021-10-04 05:01:51 -07:00
Sven Mika
b4300dd532
[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. (#18937) 2021-10-04 13:29:00 +02:00
Sven Mika
73f5c4039b
[RLlib] Fix flakey test_a3c, test_maml, test_apex_dqn. (#19035) 2021-10-04 13:23:51 +02:00
Jiajun Yao
7588bfd315
[Lint] Add flake8-bugbear (#19053)
* Add flake8-bugbear

* Add flake8-bugbear
2021-10-03 23:24:11 -07:00
Jiajun Yao
2b44e9a3e1
Increase disk for long running tests (#19064) 2021-10-03 22:52:44 -07:00
Jiajun Yao
b8ef4f0a34
[CI] Add a retry helper to e2e.py (#19045) 2021-10-02 09:54:41 -07:00
Siyuan (Ryans) Zhuang
28d905dcb0
[Workflow] Move arguments into workflow step context (#19003)
* refactor

* improve documentation

* fix comments

* Use dataclass for workflow context

* update docs
2021-10-01 23:48:57 -07:00
Eric Liang
032a420ee6
Rename Dataset.pipeline to Dataset.window (#19050) 2021-10-01 19:55:29 -07:00
Kai Fricke
3dc176c42e
[ci/tune] Add SGD and Tune GPU pipeline step to CI (#18469)
* [ci/tune] Add Tune GPU pipeline step to CI

* cont.

* add sgd gpu tests

* format yaml, fix imports

* install horovod; fix line wrapping

* set GPU per worker to 0.5

* fix import

* move test to 4gpu machine

* fix lint

* lint

* set visible devices

* pull in tf gpu fix

* Fix Tune GPU pipeline step

* nit

* Disable GPU tests until we have some

* Re-add empty rllib tests

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-10-01 18:34:05 -07:00
Simon Mo
9b2a368c8c
[Runtime Env] Implement basic runtime env plugin mechanism (#19044) 2021-10-01 17:22:54 -07:00
Edward Oakes
cac6f9d75c
skip test on windows (#19047) 2021-10-01 15:56:37 -07:00
Ian Rodney
a4ebe2697c
[Autoscaler] Improve assert_called (#19036)
* improvements

* fix invocations

* improve not_has_call
2021-10-01 14:08:31 -07:00
Clark Zinzow
d22f838795
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992) 2021-10-01 13:08:25 -07:00
Frank Luan
f885060efa
Disable distributed sort test on Windows (#19041)
* [WIP] Sorting benchmark

* Separate num_mappers and num_reducers

* Add tests

* Fix tests

* Tracing

* Separate num_mappers and num_reducers

* Two-stage reduce

* Back pressure to avoid excessive spilling

* Make merger_concurrency an option

* Fix tests

* Tweaks

* Remote writers

* Format

* WIP

* Address comments

* Fix tests and address comments

* Lint

* Fix mount points for testing

* Simplify code path

* Address comments

* Disable distributed sort test on Windows
2021-10-01 12:17:28 -07:00
mwtian
56debfc063
[Object manager] fix comments 2021-10-01 11:42:07 -07:00
Stephanie Wang
c052395f4e
[core] Remove "plasma promotion" for serialized ObjectRefs 2021-10-01 10:39:55 -07:00
architkulkarni
b0a5564f4e
[Serve] Integrate metrics with minimal autoscaling algorithm and add e2e test (#18793) 2021-10-01 10:21:12 -07:00
Antoni Baum
cc3199b814
[docs] Provide information about resource deadlocks, early stopping in Tune docs (#18947)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-10-01 13:52:47 +01:00