Commit graph

10305 commits

Author SHA1 Message Date
Sven Mika
6f85af435f
[RLlib] POC: PGTrainer class that works by sub-classing, not trainer_template.py. (#20055) 2021-11-11 12:16:20 +01:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Will Drevo
2fdb1c46c7
[RLlib; Documentation] Added atari pip installs to Pong-v0 example. (#20225)
* Added imports to Pongv0 example

* Added comment

* Apply suggestions from code review

Co-authored-by: will <will@anyscale.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
2021-11-11 09:08:02 +01:00
Siyuan (Ryans) Zhuang
8adcca54e8
[tune] Fix type error (#19872) 2021-11-10 21:35:38 -08:00
SangBin Cho
b2acfd6ff4
[Test] Change the frequency of many nodes actor test (#20232) 2021-11-10 21:12:22 -08:00
Yi Cheng
e54d3117a4
[gcs] Update all redis kv usage in python except function table (#20014)
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR

## Related issue number
#19443
2021-11-10 20:24:53 -08:00
Tobias Kaymak
893f57591d
[serve] Add Google Cloud Storage as a backend (#20104) 2021-11-10 19:45:19 -08:00
SangBin Cho
5985c1902d
Add code owner to the symbol export (#20237) 2021-11-10 19:12:45 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Edward Oakes
082a4af3e6
[serve] Remove lingering backend/endpoint wording in docs (#20229) 2021-11-10 16:49:29 -08:00
gjoliver
b6b4aaa632
[Release] Fix stress_tests (#20233) 2021-11-10 16:05:46 -08:00
liuyang-my
efca009258
[Serve] Make Java Replica Extendable (#19463) 2021-11-10 15:05:37 -08:00
Edward Oakes
81f036d078
[job submission] Move job_manager to dashboard module, common parts to common.py (#20209) 2021-11-10 14:14:55 -08:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
Amog Kamsetty
f164f3a8b5
[Release] Increase Placement Group timeout (#20224) 2021-11-10 13:02:38 -08:00
Alex Wu
d85f7f3bfa
[windows][ci] Skip test_multinode_failures_2.py (typo) (#20206) 2021-11-10 12:05:45 -08:00
xwjiang2010
2fbbecf1e4
[release] Define worker node type even if no worker node is needed. (#20223) 2021-11-10 11:19:09 -08:00
architkulkarni
923131ba37
[runtime env] Enable reference counting for URIs for actors (#20165) 2021-11-10 10:52:03 -08:00
matthewdeng
790e22f9ad
[tune] move force_on_current_node to ml_utils (#20211) 2021-11-10 10:21:24 -08:00
Sven Mika
143d23a278
[RLlib] Issue 20062: Action inference examples missing (#20144) 2021-11-10 18:49:06 +01:00
DK.Pino
20f126896e
[Placement Group] [Test] Add fractional resources test for placement group (#20185)
* add fractional resources test

* lint
2021-11-10 07:25:49 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00
DK.Pino
2c41936a39
[Placement Group] [Test] Fix pg.ready hang forever when gcs restarting (#20063)
* fixed

* lint

* fix comment

* revert previous fix code
2021-11-10 00:53:42 -08:00
Tao Wang
507bd9186b
[Core]Make convertion between ray/grpc status more specific (#20047)
* [Core]Make convertion between ray/grpc status more specific

* per comments

* lint

* per comments

* use ABORT instead of UNKNOWN, add some tests

* lint

* lint
2021-11-10 00:48:05 -08:00
SangBin Cho
3bae6b94b3
[test] Fix flaky chaos_test.py (#20202)
* Fix

* fix lint
2021-11-10 00:23:55 -08:00
Edward Oakes
5475bb054c
[job submission] Redirect stdout + stderr to a single log file (#20208) 2021-11-09 22:34:12 -08:00
Jiajun Yao
5ffa0bb01f
Listen on 127.0.0.1 if node ip is 127.0.0.1 (#20190) 2021-11-09 20:24:05 -08:00
Sungho Joo
dc51af798c
[RLlib] Minor fix on json encoding during worker sampling (#20134)
* import custom json encoder from util and improve encoder default function

* linting
2021-11-09 16:46:41 -08:00
matthewdeng
33af739bf2
[train] add placement group support (#20091)
* [train] add placement group support

* fix additional resources

* fix tests

* add comment to add_workers
2021-11-09 16:36:07 -08:00
Edward Oakes
f6399e3389
[job submission] Remove jobs intermediate directory for logs (#20192) 2021-11-09 16:20:40 -08:00
Kim Pevey
82a5bf68fa
[Docs] Add note for multi-node on Windows (#20184)
* add note for multi-node on Windows

* update message

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-11-09 16:02:01 -08:00
Guyang Song
6bb0659010
[core][bugfix] fix print ref count for an erased iterator (#20138) 2021-11-09 14:50:25 -08:00
Edward Oakes
39b3eb9763
[serve] Don't halt main control loop due to exceptions in snapshot logic (#20151) 2021-11-09 14:46:15 -08:00
Simon Mo
215f47bc53
[CI] Move Serve nightly tests to a separate suite (#20194)
So we can run them via separate cronjobs
2021-11-09 13:22:50 -08:00
Zyiqin-Miranda
333d0b43fd
[autoscaler] AWS Autoscaler CloudWatch Integration (#18619) 2021-11-09 11:48:55 -08:00
Kim Pevey
bbacb6d828
[Docs] Streaming MapReduce: Remove breaking city (#20187) 2021-11-09 11:15:57 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
b0550aa440
[Core] Fix the named actor get or create race condition (#20126)
* Fix done.

* Fixed.

* clean up

* Done
2021-11-09 02:27:54 -08:00
Tao Wang
60df705b4e
[Cpp]Get next job id globally instead of random selecting (#20102)
## Why are these changes needed?

## Related issue number
Final part of #13984

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-09 15:46:57 +08:00
Edward Oakes
c04e5af1eb
[job submission] Rename log files to job-driver-{job_id}.{out,err} (#20170) 2021-11-08 23:10:56 -08:00
Edward Oakes
50f2cf8a74
[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164) 2021-11-08 23:10:27 -08:00
Jiao
d46caa9856
[job submission] Remove test_utils dependency (#20168)
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-11-08 23:08:43 -08:00
Lingxuan Zuo
97259e33b2
Relink grpc/absl for streaming.so (#20136)
To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so.

Side-effect:
Static variables might be uninitialized if core worker lib and streaming lib both use them.
2021-11-09 14:13:53 +08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
Stephanie Wang
ffcc5935d7
[core] Evict lineage to bound memory usage (#19946)
* bound lineage

* Bound lineage in bytes

* test

* Lineage evicted error

* Lineage evicted

* lint

* test

* test

* comment

* doc

* x

* x

* x

* x
2021-11-08 21:53:40 -08:00
architkulkarni
e5e62d8991
[runtime env] Fix runtime env conda test and enable it in CI (#20121) 2021-11-08 18:33:19 -08:00
Lixin Wei
8e666ca1e9
[Core] Fix Used Memory Calculation (#20127)
* fix memory

* fix
2021-11-08 17:36:32 -08:00
Kai Fricke
9c2b8c8501
[tune] Deprecate DurableTrainable (#19880) 2021-11-08 20:56:07 +00:00
Amog Kamsetty
f8430e6eca
[CI] Pin shortuuid to fix CI (#20153) 2021-11-08 12:08:32 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00