architkulkarni
36c26578a7
[runtime env] [test] Add nightly test to verify Ray wheel URLs are valid ( #17938 )
2021-08-19 15:48:37 -07:00
Chen Shen
a16a25852a
[Core] fix event race condition ( #17947 )
2021-08-19 14:20:34 -07:00
matthewdeng
d081ee9d87
[SGD v2] Save checkpoints to disk ( #17807 )
...
* [SGD] save checkpoints to disk
* fix test; add logs
* rename log_dir to logdir for consistency with tune
* address comments: add run level directories, add CheckpointConfig
* check for empty strings
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
* address comments - refactor CheckpointStrategy, remove run_dir and checkpoint_dir configurability
* fix Trainer docs
* Update python/ray/util/sgd/v2/checkpoint.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* remove construct_path_with_default
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-19 14:18:51 -07:00
Sven Mika
a2d96c513a
[RLlib] Expand machine for nightly multi-gpu learning tests. ( #17955 )
2021-08-19 22:27:30 +02:00
Eric Liang
238941f857
Ray workflow comparison examples + add to tests ( #17880 )
2021-08-19 12:19:08 -07:00
architkulkarni
5ed3f0ce35
[Serve] [Dashboard] Add end times and DELETED state for endpoints ( #17898 )
2021-08-19 11:10:42 -05:00
Kai Fricke
21d90a0e9a
Increase disk for serve tests ( #17606 )
2021-08-19 17:51:19 +02:00
Kai Fricke
651aae76b9
[release] Ask for configuration in buildkite ( #17948 )
2021-08-19 17:51:05 +02:00
Alex Wu
318ba6fae0
Revert "[RLlib] Add example script for how to have n remote (parallel) envs with inference happening on "main" (possibly GPU) node. ( #17410 )" ( #17951 )
...
This reverts commit 8fc16b9a18
.
2021-08-19 07:55:10 -07:00
Kai Fricke
622f724f61
Update release process ( #17888 )
2021-08-19 13:34:51 +02:00
souravraha
f5fcb3c576
Fixes bug #17424 . ( #17437 )
2021-08-19 12:23:36 +02:00
Sven Mika
8fc16b9a18
[RLlib] Add example script for how to have n remote (parallel) envs with inference happening on "main" (possibly GPU) node. ( #17410 )
2021-08-19 12:14:50 +02:00
Kai Fricke
0eee355d2e
Terminate session instead of stop ( #17946 )
2021-08-19 10:26:59 +02:00
Alex Wu
497446063c
[hotfix] Fix test owners lint ( #17945 )
...
Co-authored-by: Alex <alex@anyscale.com>
2021-08-18 23:41:58 -07:00
Chong-Li
5e22257cec
[GCS] Fix: GCS Based Actor Scheduler ( #17944 )
2021-08-18 23:40:35 -07:00
Clark Zinzow
d958457d07
[Core] Second pass at privatizing APIs. ( #17885 )
...
* gcs_utils
* resource_spec
* profiling
* ray_perf and ray_cluster_perf
* test_utils
2021-08-18 20:56:33 -07:00
architkulkarni
4c6a695dab
[Doc] Runtime env docstring fix monospace formatting ( #17929 )
2021-08-18 20:53:41 -07:00
Simon Mo
b573864928
[CI] Add test owners ( #17893 )
2021-08-18 18:38:31 -07:00
Eric Liang
a9073d16f4
Revert "[Core] Unified worker initiators ( #17401 )" ( #17935 )
...
This reverts commit c3764ffd7d
.
2021-08-18 18:06:24 -07:00
Chen Shen
89d83228f6
[Core][Plasma-store] add stats-collector that eagerly collect stats
2021-08-18 13:47:50 -07:00
Chong-Li
a9b4545502
[GCS] GCS Based Actor Scheduler ( #16580 )
2021-08-18 13:44:59 -07:00
Clark Zinzow
e2c7706f76
Add support for an app config override to the release test script, allowing better integration with compile-on-product. ( #17913 )
2021-08-18 13:35:27 -07:00
Yi Cheng
ddc2e59af5
[workflow] Simplify the workflow storage layer ( #17883 )
2021-08-18 13:26:50 -07:00
Kai Fricke
bf3eaa9264
[RLlib] Dreamer fixes and reinstate Dreamer test. ( #17821 )
...
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-08-18 18:47:08 +02:00
architkulkarni
6e8ff30de4
[Doc] [runtime env] Add note to install ray[default] ( #17869 )
2021-08-18 10:57:45 -05:00
Simon Mo
8fe970f4e7
[Buildkite] Cleanup test wheel environment ( #17912 )
...
The macOS builders are shared and reused across commits.
@clarkzinzow found a bug that the installed version of the wheel
is not the on in PR. This should fix it.
https://buildkite.com/ray-project/ray-builders-pr/builds/11628#be6c5fd6-14a2-449c-8f35-e3382a6ee647
2021-08-18 08:32:35 -07:00
Sven Mika
a428f10ebe
[RLlib] Add multi-GPU learning tests to nightly. ( #17778 )
2021-08-18 17:21:01 +02:00
architkulkarni
7e109a3266
[hotfix] [runtime env] change MacOS wheel URL from 10_13 to 10_15 ( #17902 )
2021-08-18 09:16:09 +02:00
Holden Karau
b9dae93bfa
Add ephemeral-storage: 1Gi requests but no limits. ( #17854 )
...
* Add ephemeral-storage: 1Gi requests but no limits. This is useful when scheduling in a storage constrained env since ray assumes it has ephemeral storage to use.
* Add ephemeral-storage: 1Gi to b/deploy/charts/ray/templates/operator_cluster_scoped.yaml b/deploy/charts/ray/templates/operator_namespaced.yaml
2021-08-17 21:10:39 -04:00
Eric Liang
5536c5fff6
Add namespace
argument to Ray client get actor call ( #17878 )
2021-08-17 16:41:18 -07:00
Richard Liaw
c2c855b38b
Add codeowners for setup.py ( #17884 )
...
* add-czar
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
* setup
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-17 16:29:32 -07:00
Chen Shen
880797d5c2
[Core][Test] Add ubsan support for C++ tests ( #17812 )
...
* support ubsan
* update
2021-08-17 10:22:03 -07:00
SangBin Cho
4971e13941
[Build] Asan wheel test ( #17685 )
...
* in progerss
* ASAN tests.
* d
* in progress
* in progress without the asan wheel
* Support the asan wheel.
* Support the asan wheels
* Not build a binary for asan
* Fix issues
* Remove a wrong build
* Separate out asan wheel build
* Try preparing more deps.
* ip
* Try different version
* done
* d
* Trial
* Another try
* Another try
* skip cpp build to see what happens
* add more des
* ip
* abc
* Try next
* completed
* try
* Try without static libasan
* dbg
* Try static link
* Fix issues
* abc
2021-08-17 10:21:41 -07:00
Sven Mika
f18213712f
[RLlib] Redo: "fix self play example scripts" PR (17566) ( #17895 )
...
* wip.
* wip.
* wip.
* wip.
* wip.
* wip.
* wip.
* wip.
* wip.
2021-08-17 09:13:35 -07:00
Antoni Baum
2b7d907762
Print description in --help ( #17871 )
2021-08-17 17:29:01 +02:00
Hasan Genc
adc0c47b4f
Shutdown clusters on AWS with >1000 nodes ( #17841 )
...
* Revert "Revert "Shutdown clusters when large number of nodes (#17642 )" (#17836 )"
This reverts commit 6957ce66f6
.
* Update unit test and fix terminate_nodes
2021-08-17 16:26:10 +03:00
Chris Bamford
58a73821fb
[RLlib] IMPALA sample throughput calculation and full queue slowdown fixes ( #17822 )
2021-08-17 14:01:41 +02:00
chenk008
c3764ffd7d
[Core] Unified worker initiators ( #17401 )
...
* use setup_worker as starter
* use setup_worker as starter
* add java test
* fix
* fix
* lint
* sleep in ci
* sleep in ci
* fix ut
* fix
* fix
* fix
* fix
* fix
* fix
* change test size
* test
* fix
* fix
* fix ut
* restore sgd test
* change test size
* fix merge confict
* restore cpp worker flag
* fix
* fix
* add worker-languange in setup_runtime_env.py
* lint
* fix java command
Co-authored-by: root <chenk008>
2021-08-17 19:37:26 +08:00
simonsays1980
7b33dc21dc
[RLlib] Fix update model view requirements from init state for bare-metal policies with custom view-reqs. ( #17867 )
...
* Changed '_update_model_view_requirements_from_init_state()' to adopt the 'shift' in view_requirements from a user-defined policy that inherits directly from Policy.
* Added slightly modifed version of Sven's suggestion. Like this any user-defined attributes of the ViewRequirement of the state get conserved.
* I saw that the code in _update_model_view_requirements_from_init_state() had changed and is not identical to my locally installed version. In the new version view_requirements from the model and the policy get united and therefore a loop runs through this unified list. Code should run now in the present version
* Apply suggestions from code review
2021-08-17 11:49:24 +02:00
gjoliver
1dbe7fc26a
[RLlib] Config dict should use true instad of True in docs/examples. ( #17889 )
2021-08-17 11:46:10 +02:00
Guyang Song
8227e24424
[event] event framework integration in raylet, gcs server and core worker ( #17671 )
2021-08-17 11:21:23 +08:00
Hao Chen
ddb0dc8ad2
Fix client_test_enabled ( #17699 )
...
* Fix client_test_enabled
* fix
* trigger CI
2021-08-17 10:59:50 +08:00
Chen Shen
a9757a86b3
[Core] Fix nested ref count bug: add NestedIds to reference_counter once a task returns ( #17802 )
...
* add nested reference
* fix bug
2021-08-16 19:02:26 -07:00
Alex Wu
dde8250744
Better error message on docker wheel build ( #17881 )
...
* Better error message
* Apply suggestions from code review
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-08-16 18:07:10 -07:00
Ian Rodney
8fe7111a7b
[Client] Bump Proto Version ( #17879 )
2021-08-16 17:08:36 -07:00
Yi Cheng
03a82d733a
Revert "Revert "Export useful metrics"" ( #17755 )
...
* Revert "Revert "[Observability] Export useful metrics (#17578 )" (#17752 )"
This reverts commit 02e79f3fe5
.
* Update metric.h
* up
* up
* Update server_call.h
* Update test_metrics_agent.py
* up
* fix comment
2021-08-16 17:05:56 -07:00
Navneet Nandan
35d86ebfee
Added support to use tolerations for head and worker nodes ( #17608 )
...
* Added support to use tolerations for head and worker nodes
* removed the imagePullSecret configuration
* Update comments
* minor comment change
* add back rayproject/ray:nightly comment
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-08-16 17:06:15 -04:00
Thomas Lecat
c02f91fa2d
[RLlib] Ape-X doesn't take the value of prioritized_replay
into account ( #17541 )
2021-08-16 22:18:08 +02:00
Stefan Schneider
eab9c25856
[RLlib] Better example scripts: Description --no-tune and --local-mode CLI options (autoregressive_action_dist.py) ( #17705 )
2021-08-16 22:08:13 +02:00
Sven Mika
f3bbe4ea44
[RLlib] Test cases/BUILD cleanup; split "everything else" (longest running one rn) tests in 2. ( #17640 )
2021-08-16 22:01:01 +02:00