Edward Oakes
9316a9977f
[serve] Support kwargs to deployment constructor ( #19023 )
2021-10-06 14:16:23 -05:00
Frank Luan
77d0a08c38
[docker] Fix missing space in docker.py warning ( #19128 )
2021-10-06 12:09:26 -07:00
Ian Rodney
8cab8d3ae9
[Datasets] Clean Up docs around pipelining -> windowing rename ( #19142 )
2021-10-06 11:07:55 -07:00
Chris K. W
db1105fa83
[client] Skip test_valid_actor_state tests on windows ( #19114 )
...
* skip test_wrapped_actor_creation on windows
* rerun windows ci
* mark test_valid_actor_state_2 as flaky
* mark test_valid_actor_state
* rerun
2021-10-06 09:17:59 -07:00
Amog Kamsetty
db0483a29a
[SGD] SGD Namespace Consistency ( #19048 )
...
* wip
* update
* add callbacks
* fix
* fix
* update
* add
* address comments
2021-10-05 15:56:42 -07:00
Matti Picus
63dd22c7c2
add msvcp140.dll to the wheel on windows ( #19062 )
...
* add msvcp140.dll to the wheel on windows
* fixes from review
* be more verbose
* Update setup.py
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-10-05 15:12:46 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker ( #18167 )
...
* Convert worker pool to queue
* Start up to backlog size more workers
* fixes
* Prestart workers according to num available CPUs
* lint
* x
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* dedicated workers
* Fix tests
* x
* fix
* asan
* asan
* Workers can only exec tasks with same job ID
* size_t for runtime env hash, fix unit tests
* include job ID in runtime env hash, remove from worker registration msg
* x
* conflict
* debug
* Schedule and dispatch periodically, skip if no new tasks
* Update src/ray/common/task/task_spec.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/scheduling/cluster_task_manager.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Yi Cheng
ecf7b86585
[workflow] Avoid running workflow step multiple times. ( #19090 )
...
When workflow recover, it'll try to reconstruct the DAG. However, it's step scoped, which means if a workflow is passed to multiple steps, it'll be executed multiple times which breaks the exactly-once semantic.
For ObjectRef it's ok since it'll be cached with serialization context, but we also need a similar thing for Workflow input.
This logic is put in workflow layer instead of serialization layer because it's dedupe on app layer.
Issue #18997 has race conditions, and it's also related to this one. The reason is that multiple steps will try to issue writes to virtual actors at the same time which is not allowed right now and can lead to race condition.
2021-10-05 13:43:27 -07:00
Kai Fricke
957f9e9d99
[client] Undo PySpark's monkey patching of namedtuples for PickleStub ( #19034 )
2021-10-05 10:43:50 -07:00
SangBin Cho
83cb992d5b
Revert pull retry ( #19068 )
...
* Revert "[Object manager] fix comments"
This reverts commit 56debfc063
.
* Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955 )"
This reverts commit d12e35ce53
.
* Fix a lint issue
2021-10-04 11:20:43 -07:00
SangBin Cho
7fcf1bf57e
[Dashboard] Refine the dashboard restart logic. ( #18973 )
...
* in progress
* Refine the dashboard agent retry logic
* refine
* done
* lint
2021-10-04 05:01:51 -07:00
Jiajun Yao
7588bfd315
[Lint] Add flake8-bugbear ( #19053 )
...
* Add flake8-bugbear
* Add flake8-bugbear
2021-10-03 23:24:11 -07:00
Siyuan (Ryans) Zhuang
28d905dcb0
[Workflow] Move arguments into workflow step context ( #19003 )
...
* refactor
* improve documentation
* fix comments
* Use dataclass for workflow context
* update docs
2021-10-01 23:48:57 -07:00
Eric Liang
032a420ee6
Rename Dataset.pipeline to Dataset.window ( #19050 )
2021-10-01 19:55:29 -07:00
Kai Fricke
3dc176c42e
[ci/tune] Add SGD and Tune GPU pipeline step to CI ( #18469 )
...
* [ci/tune] Add Tune GPU pipeline step to CI
* cont.
* add sgd gpu tests
* format yaml, fix imports
* install horovod; fix line wrapping
* set GPU per worker to 0.5
* fix import
* move test to 4gpu machine
* fix lint
* lint
* set visible devices
* pull in tf gpu fix
* Fix Tune GPU pipeline step
* nit
* Disable GPU tests until we have some
* Re-add empty rllib tests
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-10-01 18:34:05 -07:00
Simon Mo
9b2a368c8c
[Runtime Env] Implement basic runtime env plugin mechanism ( #19044 )
2021-10-01 17:22:54 -07:00
Edward Oakes
cac6f9d75c
skip test on windows ( #19047 )
2021-10-01 15:56:37 -07:00
Ian Rodney
a4ebe2697c
[Autoscaler] Improve assert_called
( #19036 )
...
* improvements
* fix invocations
* improve not_has_call
2021-10-01 14:08:31 -07:00
Clark Zinzow
d22f838795
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. ( #18992 )
2021-10-01 13:08:25 -07:00
Frank Luan
f885060efa
Disable distributed sort test on Windows ( #19041 )
...
* [WIP] Sorting benchmark
* Separate num_mappers and num_reducers
* Add tests
* Fix tests
* Tracing
* Separate num_mappers and num_reducers
* Two-stage reduce
* Back pressure to avoid excessive spilling
* Make merger_concurrency an option
* Fix tests
* Tweaks
* Remote writers
* Format
* WIP
* Address comments
* Fix tests and address comments
* Lint
* Fix mount points for testing
* Simplify code path
* Address comments
* Disable distributed sort test on Windows
2021-10-01 12:17:28 -07:00
mwtian
56debfc063
[Object manager] fix comments
2021-10-01 11:42:07 -07:00
Stephanie Wang
c052395f4e
[core] Remove "plasma promotion" for serialized ObjectRefs
2021-10-01 10:39:55 -07:00
architkulkarni
b0a5564f4e
[Serve] Integrate metrics with minimal autoscaling algorithm and add e2e test ( #18793 )
2021-10-01 10:21:12 -07:00
mwtian
49a57aa477
[Scheduling] Report resource demand for infeasible 1-CPU tasks ( #19000 )
2021-09-30 22:03:02 -07:00
Jiajun Yao
d64872dd67
Fix python mutable default argument anti-pattern ( #19028 )
2021-10-01 13:05:02 +09:00
Edward Oakes
8e5d48d668
[runtime_env] Remove deprecated override_environment_variables and worker_env fields ( #18213 )
2021-09-30 18:55:24 -05:00
Jiajun Yao
81b052f222
[core] Fix port collision between metrics agent port and metrics export port ( #19016 )
2021-09-30 16:15:42 -07:00
Ian Rodney
02d1f659ba
[Workflows] Use RAY_ADDRESS in Tests ( #19012 )
2021-09-30 13:05:51 -07:00
Chris K. W
61d058fe66
[client] skip test_wrapped_actor_creation on windows ( #19013 )
...
* skip test_wrapped_actor_creation on windows
* rerun windows ci
2021-09-30 13:04:43 -07:00
Frank Luan
732af42ae9
[Sort benchmark] Two-stage reduce ( #17055 )
...
* [WIP] Sorting benchmark
* Separate num_mappers and num_reducers
* Add tests
* Fix tests
* Tracing
* Separate num_mappers and num_reducers
* Two-stage reduce
* Back pressure to avoid excessive spilling
* Make merger_concurrency an option
* Fix tests
* Tweaks
* Remote writers
* Format
* WIP
* Address comments
* Fix tests and address comments
* Lint
* Fix mount points for testing
* Simplify code path
* Address comments
2021-09-30 12:39:11 -07:00
architkulkarni
0f0b161ea1
Revert "Revert "[Serve] [doc] Improve runtime env doc"" ( #18943 )
...
* Revert "Revert "[Serve] [doc] Improve runtime env doc (#18782 )" (#18935 )"
This reverts commit e4f4c79252
.
2021-09-30 13:28:44 -05:00
Clark Zinzow
e384a6c91f
(TaskPool) Cancel all transformation tasks when one task fails or when SIGINT is received. ( #18991 )
2021-09-30 10:56:30 -07:00
gjoliver
e61f2c72d7
Upgrade bazel version to 4.2.1 ( #18996 )
2021-09-30 10:50:54 -07:00
mwtian
d12e35ce53
[Object manager] don't abort entire pull request on race condition in concurrent chunk receive ( #18955 )
2021-09-30 10:19:54 -07:00
Simon Mo
910553c3bb
[Core] Add private method to retrieve current task queue length ( #18964 )
2021-09-30 09:20:04 -07:00
Amog Kamsetty
98ac3f601c
[SGD] v1 to v2 Migration Guide ( #18887 )
...
* wip
* add guide
* fix test
* address comments
* add to docs
* fix
* remove markdown
* add warning to all pages
* formatting
* fix
* links
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* address comments
* address comments
* fix
* address comments
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-30 09:15:21 -07:00
architkulkarni
bf6e50813c
[runtime env] Parse local pip/conda requirements files locally upon task/actor definition ( #18988 )
2021-09-30 09:47:15 -05:00
Chris K. W
291fd36dee
_ray_trace_ctx fix follow-up ( #18950 )
...
* sanity check
* add test case
* fix assert
* refactor
* check kwargs instead of _kwargs
* format
2021-09-29 23:53:04 -07:00
Clark Zinzow
74b5d3d8f7
[Datasets] Minimize truncation on balanced splits. ( #18953 )
...
* Minimize truncation on balanced splits.
* Refactor into subroutines.
* Feedback and fixes.
2021-09-29 21:57:08 -07:00
Alex Wu
5709c6501b
[dataset][usability] Dataset dependencies ( #18346 )
2021-09-29 17:29:31 -07:00
Clark Zinzow
73a6cda812
Handle empty datasets properly in most Dataset transformations. ( #18983 )
2021-09-29 17:27:03 -07:00
Eric Liang
aa985e1a9c
Fix false positive error message from autoscaler events ( #18981 )
2021-09-29 15:51:18 -07:00
Antoni Baum
573c66a755
[GCP] Update GCP TPU config ( #18634 )
...
* [autoscaler] Update GCP TPU config
* Preemptible by default
* Remove libtpu link from head node
* Workaround
2021-09-29 12:41:26 -07:00
Jiajun Yao
ed9118393c
Listen to 127.0.0.1 by default on mac osx ( #18904 )
2021-09-29 11:40:19 -07:00
Eric Liang
3665c99896
Deflake test_failure_2.py::test_warning_for_infeasible_zero_cpu_actor
2021-09-29 11:39:16 -07:00
Dmitri Gekhtman
944309c017
Revert "[nightly] Deflaky nightly test many_nodes_actor_test ( #18582 )" ( #18954 )
...
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582 )"
This reverts commit fc6a739e4b
.
* move to large test
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 11:02:14 -04:00
Chong-Li
42744f29ee
[GCS] Make Gcs-based actor scheduler's bookkeeping consistent ( #18546 )
...
* Make Gcs-based scheduler's bookkeeping consistent
* Remove this from lambda function
* Fix lambda function
* Trigger SchedulePendingActors
* Test for acquiring/releasing resources
* Reorganize structure
* Avoid overloading post
* Fix gcs_actor_manager_test
* Fix post counter and rename some func
* Fix unique_ptr
* Fix unique_ptr
* Fix book lint error
* Lint
Co-authored-by: Chong-Li <lc300133@antgroup.com>
2021-09-29 05:53:34 -07:00
matthewdeng
91a5f67261
[SGD] add share_cuda_visible_devices config flag ( #18958 )
2021-09-29 00:21:46 -07:00
Eric Liang
4d763d3ffd
Increase metrics fetch timeout in autoscaler for large clusters
2021-09-28 15:24:44 -07:00
Edward Oakes
73b8936aa8
[runtime_env] Unify rpc::RuntimeEnv with serialized_runtime_env field ( #18641 )
2021-09-28 15:13:15 -05:00