Commit graph

10333 commits

Author SHA1 Message Date
Simon Mo
b6bd4fd5f3
[Serve] Don't recover from current state checkpoint (#19998) 2021-11-12 09:02:27 -08:00
xwjiang2010
ce8504b0b2
[CI] Rebalance Tune tests a bit. (#20263) 2021-11-12 15:30:18 +00:00
xwjiang2010
5f14eb3ee4
[Tune] Remove PG caching. (#19515)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-11-12 14:36:04 +00:00
Sven Mika
38c456b6f4
[RLlib; Tune] Fix rllib/train.py script after tune.Experiment c'tor change. (#20283) 2021-11-12 15:25:50 +01:00
Kai Fricke
246787cdd9
Revert "[RLlib] POC: PGTrainer class that works by sub-classing, not trainer_template.py. (#20055)" (#20284)
This reverts commit 6f85af435f.
2021-11-12 13:09:43 +00:00
Kai Fricke
d88fdd6e38
[tune] refactor SyncConfig (#20155) 2021-11-12 09:36:15 +00:00
SangBin Cho
7132f91789
[Core] Reduce the frequency of retry messages (#20175)
* Reduce the frequency of retry messages

* done
2021-11-11 23:52:37 -08:00
Sven Mika
70fe25055a
[RLlib] Issue: Get single step input dict incorrect. (#20217) 2021-11-12 08:38:51 +01:00
Edward Oakes
ee4e4f4036
[runtime_env] Support specifying the runtime_resources directory for testing (#20257) 2021-11-11 21:50:42 -08:00
architkulkarni
33f680095d
[Test] [runtime env] Retry wheel urls for up to 2h to give time for Mac wheels to build (#19337) 2021-11-11 21:48:35 -08:00
Edward Oakes
7c9881b73d
[serve] Fix serve_failure test (#20268) 2021-11-11 19:19:34 -08:00
Edward Oakes
eb6449b21b
[serve] Remove 5s halt from controller startup (#20262) 2021-11-11 19:18:43 -08:00
SangBin Cho
e901180a55
Do not import pytest in test util (#20252) 2021-11-12 12:09:28 +09:00
Qing Wang
7500f7d88a
Remove deprecated Java PG APIs. (#20219)
These APIs were deprecated at least 7+ months and 4+ versions, it's the time and very necessary to remove them.
2021-11-12 09:29:48 +08:00
Qing Wang
5d773e75e6
Fix idle worker leak issue if it received a SIGTERM when DrainAndShutdown. (#19877)
This PR fixes the issue that worker might be leaked if task finished with some errors.
See #19639 for more details.
2021-11-12 09:26:46 +08:00
mwtian
be29fa0302
[CI] make using gcc 9 explicit (#20147) 2021-11-11 16:12:40 -08:00
chenk008
74fa267c72
Enable worker in container CI test (#20174) 2021-11-11 16:11:06 -08:00
Edward Oakes
5ae5c1ba28
[job submission] Basic CLI prototype (#20204) 2021-11-11 15:59:13 -08:00
Teofilo Zosa
abf0eb53cc
Fix aiohttp 3.8.0 breaking changes (and unpin from 3.7) (#20261) 2021-11-11 15:35:20 -08:00
Michael Galarnyk
dbeb2e2f73
Add Ray Serve Blogs to Doc(#19846)
The Serving ML Models in Production blog links is inline with the latest Ray Summit talk on Ray Serve.
2021-11-11 15:10:36 -08:00
Edward Oakes
59698aa89c
[Serve] add survey link (#20230) 2021-11-11 15:10:10 -08:00
mwtian
0330852baf
[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111)
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.

Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.

Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.

GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.

## Related issue number
2021-11-11 14:59:57 -08:00
Simon Mo
fca851eef5
[Serve] Change ReplicaName to use internal prefix (#20067) 2021-11-11 14:21:34 -08:00
Jiajun Yao
992ab3e098
[Release] Commit sanity check when a url is provided (#20255) 2021-11-11 13:33:58 -08:00
Jules S. Damji
71a162d8ab
Fixed code snippet to include config parameter and a minor typo (#20193)
Signed-off-by: Jules S.Damji <jules@anyscale.com>

Co-authored-by: Jules S.Damji <jules@anyscale.com>
2021-11-11 18:37:03 +00:00
Dmitri Gekhtman
8971422d8f
[autoscaler] Use drain node api in autoscaler before terminating nodes (#20013)
* wip

* Draft

* Use bytest for node id

* remove stray helm change

* fix autoscaler init arg

* don't forget to instantiate new load metrics dict

* remove extraneous diff

* Timeout, comments, function signature.

* typo

* another comment

* tweak

* docstring

* shorter timeout

* Use a better error code

* missing self

* Dedent example

* Add drain node prometheus metric.

* comment

* Update tests part 1: test_autoscaler.py

* Update tests part 2: test_resource_demand_scheduler

* lint

* Update tests part 3: test_autoscaling_policy

* Unit tests for new Prometheus metric and DrainNode error handling.

* comment

* removed unused function

* Try adding ability to mock out process termination to fake node provider

* Add integration test.

* fix

* fix

* lint

* Improve log message

* fix

* Simplify test

* Fix doc example

* remove unused dict

* Mock out process termination in a subclass

* Add add doc string and comment explaining prune active ips.

* Comment: wtf is use_node_id_as_ip

* one more comment

* more explanation

* period

* tweak
2021-11-11 08:31:40 -08:00
SangBin Cho
9fd8c6648c
[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220)
* Fix nightly tests

* done

* done
2021-11-11 05:01:19 -08:00
SangBin Cho
f3e3c04469
[Nightly test] Make report False by default. (#20238)
* Make report False by default.

* fix
2021-11-11 04:58:23 -08:00
Sven Mika
6f85af435f
[RLlib] POC: PGTrainer class that works by sub-classing, not trainer_template.py. (#20055) 2021-11-11 12:16:20 +01:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Will Drevo
2fdb1c46c7
[RLlib; Documentation] Added atari pip installs to Pong-v0 example. (#20225)
* Added imports to Pongv0 example

* Added comment

* Apply suggestions from code review

Co-authored-by: will <will@anyscale.com>
Co-authored-by: Sven Mika <sven@anyscale.io>
2021-11-11 09:08:02 +01:00
Siyuan (Ryans) Zhuang
8adcca54e8
[tune] Fix type error (#19872) 2021-11-10 21:35:38 -08:00
SangBin Cho
b2acfd6ff4
[Test] Change the frequency of many nodes actor test (#20232) 2021-11-10 21:12:22 -08:00
Yi Cheng
e54d3117a4
[gcs] Update all redis kv usage in python except function table (#20014)
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR

## Related issue number
#19443
2021-11-10 20:24:53 -08:00
Tobias Kaymak
893f57591d
[serve] Add Google Cloud Storage as a backend (#20104) 2021-11-10 19:45:19 -08:00
SangBin Cho
5985c1902d
Add code owner to the symbol export (#20237) 2021-11-10 19:12:45 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Edward Oakes
082a4af3e6
[serve] Remove lingering backend/endpoint wording in docs (#20229) 2021-11-10 16:49:29 -08:00
gjoliver
b6b4aaa632
[Release] Fix stress_tests (#20233) 2021-11-10 16:05:46 -08:00
liuyang-my
efca009258
[Serve] Make Java Replica Extendable (#19463) 2021-11-10 15:05:37 -08:00
Edward Oakes
81f036d078
[job submission] Move job_manager to dashboard module, common parts to common.py (#20209) 2021-11-10 14:14:55 -08:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
Amog Kamsetty
f164f3a8b5
[Release] Increase Placement Group timeout (#20224) 2021-11-10 13:02:38 -08:00
Alex Wu
d85f7f3bfa
[windows][ci] Skip test_multinode_failures_2.py (typo) (#20206) 2021-11-10 12:05:45 -08:00
xwjiang2010
2fbbecf1e4
[release] Define worker node type even if no worker node is needed. (#20223) 2021-11-10 11:19:09 -08:00
architkulkarni
923131ba37
[runtime env] Enable reference counting for URIs for actors (#20165) 2021-11-10 10:52:03 -08:00
matthewdeng
790e22f9ad
[tune] move force_on_current_node to ml_utils (#20211) 2021-11-10 10:21:24 -08:00
Sven Mika
143d23a278
[RLlib] Issue 20062: Action inference examples missing (#20144) 2021-11-10 18:49:06 +01:00
DK.Pino
20f126896e
[Placement Group] [Test] Add fractional resources test for placement group (#20185)
* add fractional resources test

* lint
2021-11-10 07:25:49 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00