Commit graph

10924 commits

Author SHA1 Message Date
Gagandeep Singh
d392f97331
Unskipped tests in serve: test_controller_recovery.py (#21450) 2022-01-13 01:09:59 -08:00
Yi Cheng
a6e76c2803
[nightly] Disable bootstrapping from gcs (#21570)
Right now, testing infra doesn't support run ray without redis. Disable it shortly so that we can still test the rest functionality.
2022-01-12 23:02:42 -08:00
Ruoyun Huang
a36b7a9908
[doc]Update doc for profiling using the correct VARs (#21561)
Based on code here: https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L702

Also, verified that the ENV vars as is makes "ray start" crash.
2022-01-12 23:01:51 -08:00
SangBin Cho
f5fdbeb594
Refactor event tracker out of asio class (#21215)
This refactors the event tracker to be decoupled from the asio class.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-12 22:43:31 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Clark Zinzow
7a1aaac86c
[Core] Small comment/docstring fixes in cluster task manager header. (#21539) 2022-01-12 19:35:38 -08:00
Max Pumperla
703c161034
[doc] Fix sklearn doc error, introduce MyST markdown parser (#21527) 2022-01-12 15:17:28 -08:00
Max Pumperla
54dd2d0644
[docs] remove old site (#21528) 2022-01-12 15:13:53 -08:00
Gagandeep Singh
13f20e5e1e
[Serve] Unskipped tests in test_pipeline.py (#21484) 2022-01-12 13:56:50 -08:00
Sven Mika
188324c5c7
[RLlib] Issue 21552: unsquash_action and clip_action (when None) cause wrong actions computed by Trainer.compute_single_action. (#21553) 2022-01-12 18:56:51 +01:00
Guyang Song
0627f841b2
[runtime env][observability]print debug string for runtime env uri reference table (#21309)
The debug log like this:
![image](https://user-images.githubusercontent.com/26714159/148529305-89b01151-7d76-4fda-89ed-0e13802207b3.png)

The debug state like this:
![image](https://user-images.githubusercontent.com/26714159/148529369-60222b99-595a-441d-8fe6-fb3e6ae13ac2.png)
2022-01-12 08:33:53 +00:00
Jiajun Yao
25035152bc
Fix SchedulingClassInfo.running_tasks memory leak (#21535)
In some cases, the task that's added to the `running_tasks` is never removed and introduces wait time for all the following tasks due to worker cap. One such case is lease request cancellation: the request is cancelled after `PopWorker` is called and the task is never removed from `running_tasks`.
2022-01-11 23:13:27 -08:00
Sven Mika
95d1476494
[RLlib; github] Update RLlib codeowners. (#21453)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-11 22:25:33 -08:00
Eric Liang
a69ae1d886
Add blogs to dataset materials (#21546) 2022-01-11 22:09:57 -08:00
Kai Fricke
c1d4c22351
[ci/multinode] Follow-up fix for resource popping (#21543)
The previous change in #21531 unfortunately added the fix at the wrong location. This PR corrects this.
2022-01-11 17:47:42 -08:00
Antoni Baum
6ba4513777
[tune] Load experiment into searcher (#21506)
This PR adds a new method to the Searcher class, add_evaluated_trials. This method wraps around add_evaluated_point and allows the user to pass a Trial, list of Trials or ExperimentAnalysis to load into the searcher. Furthermore, this PR updates the HEBO version to the latest and removes outdated documentation, and adds add_evaluated_point methods to Dragonfly and SkOpt searchers.
2022-01-11 15:58:20 -08:00
Matti Picus
ec6a33b736
[tune] fixes to allow tune/tests/test_commands.py to run on windows (#21342)
tune does not run smoothly on Windows. This cleans up some blockers:
- use cross-platform shutils.get_terminal_size instead of Popen(stty)
- somehow Trainer.workers is None at the end of test_commands.py, so the cleanup command was erroring. The error was not fatal, but was printing in the logs.
- if run locally, the log files are all written to the same location, so the rync-based syncing solution is not needed. This is the real fix for issue #20747
2022-01-11 15:57:20 -08:00
mwtian
45cddef2d3
[GCS] disable tests related to GCS restarting in GCS pubsub mode (#21534)
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.

Also, suppress exceptions when sync subscribers are disconnected from GCS.

I can push changes in this PR to #21513 as well.
2022-01-11 14:14:05 -08:00
Kai Fricke
084bda87a5
[ci/multinode] Fix resource popping resulting in empty resource head nodes (#21531)
Fixes a small bug where we pop from the resources dict without making a copy, emptying the head node resources. This sometimes leads to empty head node resources.
2022-01-11 13:20:58 -08:00
Yi Cheng
d2d749b6f9
[workflow] Fix test_serialization.py (#21522)
The new version of responses will introduce some errors in the test. This PR fixed responses.

It also fixed moto in case of future updates upstream.
2022-01-11 11:45:18 -08:00
Sven Mika
f94bd99ce4
[RLlib] Issue 21044: Improve error message for "multiagent" dict checks. (#21448) 2022-01-11 19:50:03 +01:00
mwtian
0e5de61c18
remove unnecessary test filter (#21510)
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.

Also, re-enable some additional tests in GCS HA mode.
2022-01-11 10:17:03 -08:00
Gagandeep Singh
d47b82883a
Unskipped non-cluster tests in test_actor_resources.py (#21500) 2022-01-11 09:46:03 -08:00
Gagandeep Singh
a5a8156198
Unskipped tests in test_actor_failures (#21498) 2022-01-11 09:42:12 -08:00
Gagandeep Singh
e8df34af08
Unskipped test in test_autoscaling_policy (#21497)
The test passes on my Windows Azure VM. 

P.S. Is it related to cluster tests? I am not sure.
2022-01-11 09:40:37 -08:00
SangBin Cho
097706b35d
[Internal Observability] Re-enable event stats again. (#21515)
I tried reproducing the many pg mini integration failure from this PR; https://github.com/ray-project/ray/pull/21216, but I failed to do that. (this was the only test that became flaky when we turned on the flag last time).

I tried
- Run tests:test_placement_group_mini_integration 5 times instead of 3 (the default)
- Re-run the PR 3 times.

So I think it is worth trying re-enabling it again.
2022-01-11 09:00:27 -08:00
Jamie Slome
a68bd2fcfd
Create SECURITY.md (#21521) 2022-01-11 08:54:51 -08:00
Qing Wang
bb647626cf
[Xlang][Java] Fix Java overrided default method cannot be invoked. (#21491)
In Xlang(Python call Java), a Java method which overrides a `default` method of the super class is not able to be invoked successfully, due to we treat it as overloaded method instead of overrided method. This PR correctly handle it at the case it overrides a `default` method.

Before this PR, the following usage is not able to be invoked from Python -> Java.
```Java
public interface ExampleInterface {
  default String echo(String inp) {
    return inp;
  }
}
public class ExampleImpl implements ExampleInterface {
  @Override
  public String echo(String inp) {
    return inp + " echo";
  }
}
```
```python
/// Invoke it in Python.
cls = ray.java_actor_class("io.ray.serve.util.ExampleImpl")
handle = cls.remote()
print(ray.get(handle.echo.remote("hi")))
```
2022-01-11 23:11:24 +08:00
Eric Liang
9ac34ecc94
Revert "[workflow] Skip saving outputs of "workflow.wait"" (#21520)
This is breaking linux://python/ray/workflow:tests/test_wait per https://flakey-tests.ray.io/
2022-01-10 20:51:42 -08:00
Kai Fricke
5a7f6e4fdd
[rfc][ci] create fake docker-compose cluster environment (#20256)
Following #18987 this PR adds a docker-compose based local multi node cluster.

The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.

This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.
2022-01-11 04:35:36 +00:00
Gagandeep Singh
4a8a8b30b0
Skipped test_reference_counting_2 and test_actor (#21507) 2022-01-10 20:34:03 -08:00
Yi Cheng
65598b3bb0
[gcs] Re-enable release tests with GCS HA (#21511)
Re-enable release tests with GCS HA mode.
2022-01-10 16:35:57 -08:00
hckuo
7955333ffd
[runtime env] allow working_dir to be a zipped package (#20826)
Check if working_dir is a zip, unzip it if so.
2022-01-10 18:29:01 -06:00
Siyuan (Ryans) Zhuang
6e568d2c02
[workflow] Skip saving outputs of "workflow.wait" (#21183) 2022-01-10 15:37:13 -08:00
Jiajun Yao
aec37d4b60
Add container utils (#21444)
- Add debug_string helper functions for common containers.
- Add map_find_or_die helper function
2022-01-10 15:29:29 -08:00
Amog Kamsetty
bcae6ba6c9
[Train] _WrappedDataLoader yield tuples (#21467)
Fixes bug with _WrappedDataLoader that yields a generator instead of a tuple.

Addresses https://discuss.ray.io/t/ray-train-creates-typeerror-generator-object-is-not-subscriptable/4605/10
2022-01-10 12:40:36 -08:00
Qing Wang
57ff13461c
[Java] Use localhost instead of public ip (#21462)
Use localhost ip address instead of public ip for avoid security popups on MacOS.
This also reverts This reverts commit e4542be0d1.
2022-01-11 02:58:22 +08:00
Zyiqin-Miranda
71fae21e8e
[autoscaler] AWS Autoscaler CloudWatch Dashboard support (#20266)
These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:

1. Get rapid insights into their cluster state via CloudWatch dashboards.
2. Allow users to update their CloudWatch dashboard JSON configuration files during Ray up execution time.

Notes:
1.  This PR is a follow-up PR for #18619, adds dashboard support.
2022-01-10 10:18:53 -08:00
Gagandeep Singh
6420c75fd2
Unskipped test in test_advanced_2.py (#21503) 2022-01-10 09:06:44 -08:00
Sven Mika
92f030331e
[RLlib] Initial code/comment cleanups in preparation for decentralized multi-agent learner. (#21420) 2022-01-10 11:22:55 +01:00
Sven Mika
4eaf70942d
[RLlib] Issue 21297: Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values due to some discrete action probs==0.0 (#21456) 2022-01-10 11:22:40 +01:00
Sven Mika
35af30a446
[RLlib] Issue 21109: Action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110) 2022-01-10 11:20:37 +01:00
Sven Mika
b10d5533be
[RLlib] Issue 20920 (partial solution): contrib/MADDPG + pettingzoo coop-pong-v4 not working. (#21452) 2022-01-10 11:19:40 +01:00
qicosmos
f8244a4cc0
[C++ Worker]fix uninit worker context (#21371) 2022-01-10 17:17:41 +08:00
Matti Picus
5aef1e1708
remove deprecated unittest aliases (#21455)
In a [recent review](https://discuss.python.org/t/experience-with-python-3-11-in-fedora/12911) of the experience of the Fedora team porting packages to the upcoming python 3.11, they remarked that most of the work was in removing deprecated aliases in unittest. I came across a few of these when looking at unrelated test failures, the DeprecationWarnings caught my eye. So a made a quick sweep of the code, using `git grep` to find occurances of the deprecated aliases:

old | new
---|---
assertEquals | assertEqual
assertNotEquals | assertNotEqual
assertRaisesRegexp | assertRaisesRegex
2022-01-09 20:29:54 -08:00
Gagandeep Singh
c43d4cc028
Unskipped test in test_kv_store.py (#21451)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-01-09 14:38:55 -08:00
Sven Mika
34cee199b1
[RLlib] from remote_vector_env import ... -> from remote_base_env import ... (avoid deprecation warning). (#21460) 2022-01-08 17:13:04 +01:00
Yi Cheng
4ab059eaa1
[gcs] Fix the server standalone tests in HA mode (#21480)
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
2022-01-07 22:54:50 -08:00
Yi Cheng
bdfba88082
[2/3][kv] Add delete by prefix support for internal kv (#21442)
Delete by prefix for internal kv is necessary to cleanup the function table. This will be used to fix the issue #8822
2022-01-07 22:54:24 -08:00
mwtian
4a34233a90
[Core] allow message in deprecation annotation (#21466) 2022-01-07 21:52:31 -08:00