Commit graph

10315 commits

Author SHA1 Message Date
Teofilo Zosa
abf0eb53cc
Fix aiohttp 3.8.0 breaking changes (and unpin from 3.7) (#20261) 2021-11-11 15:35:20 -08:00
Michael Galarnyk
dbeb2e2f73
Add Ray Serve Blogs to Doc(#19846)
The Serving ML Models in Production blog links is inline with the latest Ray Summit talk on Ray Serve.
2021-11-11 15:10:36 -08:00
Edward Oakes
59698aa89c
[Serve] add survey link (#20230) 2021-11-11 15:10:10 -08:00
mwtian
0330852baf
[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111)
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.

Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.

Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.

GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.

## Related issue number
2021-11-11 14:59:57 -08:00
Simon Mo
fca851eef5
[Serve] Change ReplicaName to use internal prefix (#20067) 2021-11-11 14:21:34 -08:00
Jiajun Yao
992ab3e098
[Release] Commit sanity check when a url is provided (#20255) 2021-11-11 13:33:58 -08:00
Jules S. Damji
71a162d8ab
Fixed code snippet to include config parameter and a minor typo (#20193)
Signed-off-by: Jules S.Damji <jules@anyscale.com>

Co-authored-by: Jules S.Damji <jules@anyscale.com>
2021-11-11 18:37:03 +00:00
Dmitri Gekhtman
8971422d8f
[autoscaler] Use drain node api in autoscaler before terminating nodes (#20013)
* wip

* Draft

* Use bytest for node id

* remove stray helm change

* fix autoscaler init arg

* don't forget to instantiate new load metrics dict

* remove extraneous diff

* Timeout, comments, function signature.

* typo

* another comment

* tweak

* docstring

* shorter timeout

* Use a better error code

* missing self

* Dedent example

* Add drain node prometheus metric.

* comment

* Update tests part 1: test_autoscaler.py

* Update tests part 2: test_resource_demand_scheduler

* lint

* Update tests part 3: test_autoscaling_policy

* Unit tests for new Prometheus metric and DrainNode error handling.

* comment

* removed unused function

* Try adding ability to mock out process termination to fake node provider

* Add integration test.

* fix

* fix

* lint

* Improve log message

* fix

* Simplify test

* Fix doc example

* remove unused dict

* Mock out process termination in a subclass

* Add add doc string and comment explaining prune active ips.

* Comment: wtf is use_node_id_as_ip

* one more comment

* more explanation

* period

* tweak
2021-11-11 08:31:40 -08:00
SangBin Cho
9fd8c6648c
[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220)
* Fix nightly tests

* done

* done
2021-11-11 05:01:19 -08:00
SangBin Cho
f3e3c04469
[Nightly test] Make report False by default. (#20238)
* Make report False by default.

* fix
2021-11-11 04:58:23 -08:00
Sven Mika
6f85af435f
[RLlib] POC: PGTrainer class that works by sub-classing, not trainer_template.py. (#20055) 2021-11-11 12:16:20 +01:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Will Drevo
2fdb1c46c7
[RLlib; Documentation] Added atari pip installs to Pong-v0 example. (#20225)
* Added imports to Pongv0 example

* Added comment

* Apply suggestions from code review

Co-authored-by: will <will@anyscale.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
2021-11-11 09:08:02 +01:00
Siyuan (Ryans) Zhuang
8adcca54e8
[tune] Fix type error (#19872) 2021-11-10 21:35:38 -08:00
SangBin Cho
b2acfd6ff4
[Test] Change the frequency of many nodes actor test (#20232) 2021-11-10 21:12:22 -08:00
Yi Cheng
e54d3117a4
[gcs] Update all redis kv usage in python except function table (#20014)
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR

## Related issue number
#19443
2021-11-10 20:24:53 -08:00
Tobias Kaymak
893f57591d
[serve] Add Google Cloud Storage as a backend (#20104) 2021-11-10 19:45:19 -08:00
SangBin Cho
5985c1902d
Add code owner to the symbol export (#20237) 2021-11-10 19:12:45 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Edward Oakes
082a4af3e6
[serve] Remove lingering backend/endpoint wording in docs (#20229) 2021-11-10 16:49:29 -08:00
gjoliver
b6b4aaa632
[Release] Fix stress_tests (#20233) 2021-11-10 16:05:46 -08:00
liuyang-my
efca009258
[Serve] Make Java Replica Extendable (#19463) 2021-11-10 15:05:37 -08:00
Edward Oakes
81f036d078
[job submission] Move job_manager to dashboard module, common parts to common.py (#20209) 2021-11-10 14:14:55 -08:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
Amog Kamsetty
f164f3a8b5
[Release] Increase Placement Group timeout (#20224) 2021-11-10 13:02:38 -08:00
Alex Wu
d85f7f3bfa
[windows][ci] Skip test_multinode_failures_2.py (typo) (#20206) 2021-11-10 12:05:45 -08:00
xwjiang2010
2fbbecf1e4
[release] Define worker node type even if no worker node is needed. (#20223) 2021-11-10 11:19:09 -08:00
architkulkarni
923131ba37
[runtime env] Enable reference counting for URIs for actors (#20165) 2021-11-10 10:52:03 -08:00
matthewdeng
790e22f9ad
[tune] move force_on_current_node to ml_utils (#20211) 2021-11-10 10:21:24 -08:00
Sven Mika
143d23a278
[RLlib] Issue 20062: Action inference examples missing (#20144) 2021-11-10 18:49:06 +01:00
DK.Pino
20f126896e
[Placement Group] [Test] Add fractional resources test for placement group (#20185)
* add fractional resources test

* lint
2021-11-10 07:25:49 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00
DK.Pino
2c41936a39
[Placement Group] [Test] Fix pg.ready hang forever when gcs restarting (#20063)
* fixed

* lint

* fix comment

* revert previous fix code
2021-11-10 00:53:42 -08:00
Tao Wang
507bd9186b
[Core]Make convertion between ray/grpc status more specific (#20047)
* [Core]Make convertion between ray/grpc status more specific

* per comments

* lint

* per comments

* use ABORT instead of UNKNOWN, add some tests

* lint

* lint
2021-11-10 00:48:05 -08:00
SangBin Cho
3bae6b94b3
[test] Fix flaky chaos_test.py (#20202)
* Fix

* fix lint
2021-11-10 00:23:55 -08:00
Edward Oakes
5475bb054c
[job submission] Redirect stdout + stderr to a single log file (#20208) 2021-11-09 22:34:12 -08:00
Jiajun Yao
5ffa0bb01f
Listen on 127.0.0.1 if node ip is 127.0.0.1 (#20190) 2021-11-09 20:24:05 -08:00
Sungho Joo
dc51af798c
[RLlib] Minor fix on json encoding during worker sampling (#20134)
* import custom json encoder from util and improve encoder default function

* linting
2021-11-09 16:46:41 -08:00
matthewdeng
33af739bf2
[train] add placement group support (#20091)
* [train] add placement group support

* fix additional resources

* fix tests

* add comment to add_workers
2021-11-09 16:36:07 -08:00
Edward Oakes
f6399e3389
[job submission] Remove jobs intermediate directory for logs (#20192) 2021-11-09 16:20:40 -08:00
Kim Pevey
82a5bf68fa
[Docs] Add note for multi-node on Windows (#20184)
* add note for multi-node on Windows

* update message

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-11-09 16:02:01 -08:00
Guyang Song
6bb0659010
[core][bugfix] fix print ref count for an erased iterator (#20138) 2021-11-09 14:50:25 -08:00
Edward Oakes
39b3eb9763
[serve] Don't halt main control loop due to exceptions in snapshot logic (#20151) 2021-11-09 14:46:15 -08:00
Simon Mo
215f47bc53
[CI] Move Serve nightly tests to a separate suite (#20194)
So we can run them via separate cronjobs
2021-11-09 13:22:50 -08:00
Zyiqin-Miranda
333d0b43fd
[autoscaler] AWS Autoscaler CloudWatch Integration (#18619) 2021-11-09 11:48:55 -08:00
Kim Pevey
bbacb6d828
[Docs] Streaming MapReduce: Remove breaking city (#20187) 2021-11-09 11:15:57 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
b0550aa440
[Core] Fix the named actor get or create race condition (#20126)
* Fix done.

* Fixed.

* clean up

* Done
2021-11-09 02:27:54 -08:00
Tao Wang
60df705b4e
[Cpp]Get next job id globally instead of random selecting (#20102)
## Why are these changes needed?

## Related issue number
Final part of #13984

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-09 15:46:57 +08:00
Edward Oakes
c04e5af1eb
[job submission] Rename log files to job-driver-{job_id}.{out,err} (#20170) 2021-11-08 23:10:56 -08:00