Commit graph

10960 commits

Author SHA1 Message Date
Jun Gong
1315293dd8
[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643) 2022-01-18 09:29:01 +01:00
mwtian
4faf3e1e31
[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589)
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).

This PR makes sure each test case shuts down its Ray cluster.
2022-01-17 23:08:47 -08:00
Guyang Song
c321e6e5bd
[script] support using hostname as node_ip_address (#20720) 2022-01-18 11:05:50 +08:00
Gagandeep Singh
970b7b2a4b
Unskip tests from ci.sh (#21483) 2022-01-17 15:22:57 -08:00
Rong Ma
f54282147c
[PlacementGroup] Support using any available bundle in java api (#21496)
In python or C++, we can specify the bundle index as -1 to use any available bundle in the placement group. We should also enable it in Java to keep the API consistent across all languages.
2022-01-18 01:58:02 +08:00
Qing Wang
a5cabb324b
Remove streaming deploying process. (#21603)
1. Remove the streaming from deploying to maven central.
2. Remove related streaming stuff from setup.py.
2022-01-17 23:37:48 +08:00
Qing Wang
6f82bff7ff
[Java] Change ActorLifetime API: DEFAULT -> NON_DETACHED (#21639)
This PR changes the enum value `ActorLifetime.DEFAULT` to `ActorLifetime.NON_DETACHED`. In our release versions, `ActorLifetime` was not introduced <= 1.9.2

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-17 18:10:12 +08:00
Qing Wang
2c3be852ab
[Java] Support defining ConcurrencyGroup statically in Java. (#20373)
This PR introduces statically defining ConcurrencyGroup APIs in Java.
We introduce 2 APIs:
1. Introducing `@DefConcurrencyGroup` annotation for an actor class to define a concurrency group statically.
2. Introducing `@UseConcurrencyGroup` annotation for actor methods to define the concurrency group to be used in the method.

Examples are below:

```java
 @DefConcurrencyGroup(name = "io", maxConcurrency = 2)
  @DefConcurrencyGroup(name = "compute", maxConcurrency = 4)
  private static class MyActor {
    @UseConcurrencyGroup(name = "io")
    public long f1() { }

    @UseConcurrencyGroup(name = "io")
    public long f2() { }

    @UseConcurrencyGroup(name = "compute")
    public long f3(int a, int b) { }

    @UseConcurrencyGroup(name = "compute")
    public long f4() { }
  }

ActorHandle<> myActor = Ray.actor(MyActor::new).remote();
myActor.task(MyActor::f1).remote();
myActor.task(MyActor::f2).remote();
myActor.task(MyActor::f3).remote();
myActor.task(MyActor::f4).remote();
```
`MyActor` has 3 concurrency groups: `io` with 2 concurrency, `compute` with 4 concurrency and `default` with 1 concurrency.
f1 and f2 will be executed in `io`, f3 and f4 will be executed in `compute`.
2022-01-17 16:23:10 +08:00
Yi Cheng
87d852fc28
[gcs/ha] Fix some tests failed in HA mode (#21587)
This PR fixed and reenabled tests in HA mode

- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api 
- //python/ray/tests:test_ray_debugger
2022-01-16 21:53:14 -08:00
jon-chuang
5f7224bd51
[C++ API] fix wrong arg handling for object references in TaskExecutor, TaskArgByReference (#21236)
Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. 

That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. 

This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB.
```
C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body.
```
This was not caught due to lack of testing for large objects, which has now been added.
2022-01-17 12:08:15 +08:00
Simon Mo
86bbf28e4c
[CI] Fix test_get_deployment and test_runtime_env_validation (#21637) 2022-01-16 17:25:14 -08:00
Yi Cheng
927c5467eb
[gcs/function table] Change function table keys' prefix from binary to hex (#21616)
When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern.

For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function.

This PR uses hex of job id to avoid this.
2022-01-15 21:58:14 -08:00
Kai Fricke
0e9e8824e4
[ci/release] use s3 sync (#21626)
Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now.

This reverts commit 42bcab27e8.
2022-01-15 17:53:19 -08:00
Kai Fricke
d84154a774
[ci/multinode] Add utilities to kill nodes in multi node testing (#21580)
Killing nodes enables advanced fault tolerance testing. This PR adds utilities and a test for this functionality in fake multinode docker mode.
2022-01-15 17:11:16 -08:00
Eric Liang
a971774820
Improve errors raised by ds.groupby() of unsupported key type (#21610) 2022-01-15 16:35:31 -08:00
Kai Yang
4a55d10bb1
[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988)
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-15 17:28:34 +08:00
Archit Kulkarni
26057c433f
[CI] pin uvicorn to 0.16.0 to fix serve (#21612) 2022-01-14 16:00:51 -08:00
Kai Fricke
42bcab27e8
Revert "[Release Test] Opt-in tests to use K8s based cloud. (#21583)" (#21605)
This reverts commit 0d5fbcc7bb.
2022-01-14 11:46:52 -08:00
Gagandeep Singh
f8bcb8aeb6
Unskipped tests in test_actor.py (#21501) 2022-01-14 08:46:46 -08:00
Jun Gong
7517aefe05
[RLlib] Bring back BC and Marwil learning tests. (#21574) 2022-01-14 14:35:32 +01:00
Jialing He
ded4128ebf
[Core] dlmalloc allocate bottom-most memory chunk failed (#21439)
Why are these changes needed?
fix dlmalloc allocate bug, details in here #21310
* fix dlmalloc bug

* make lint happy

* make lint happy

* fix by comment

* use _check_spilled_mb

* add cpp UT
2022-01-13 23:53:29 -08:00
Jiajun Yao
e0f4636477
Fix simple dataset sort generating only 1 non-empty block (#21588) 2022-01-13 23:50:24 -08:00
Richard Liaw
169e422937
[docs] Make Jobs more prominent in documentation (#21575) 2022-01-13 23:49:34 -08:00
Matti Picus
f4da0410b3
WINDOWS: unskip actor, component_failure, failure tests (#21492)
Unskip windows tests that pass locally
2022-01-13 23:16:22 -08:00
Stephanie Wang
1df67eb977
[core] Avoid ObjectID collisions for re-executed tasks (#21395)
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.

This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-13 18:18:55 -08:00
Yi Cheng
e4ba51f25b
[core] Add GC for function table (#21509)
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.

Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
2022-01-13 18:06:05 -08:00
Simon Mo
0d5fbcc7bb
[Release Test] Opt-in tests to use K8s based cloud. (#21583) 2022-01-13 17:20:36 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Jiajun Yao
d6dbf3b8bf
[scheduler] Set default max_pending_lease_requests_per_scheduling_category to 10 (#20404) 2022-01-13 13:50:56 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
mwtian
cf6a54ca46
[CI] pin pytest-asyncio (#21579) 2022-01-13 11:35:30 -08:00
Sven Mika
3ac4daba07
[RLlib] Discussion 4351: Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560) 2022-01-13 18:50:42 +01:00
Kai Fricke
a3442df584
[ci/multinode] Build multinode image with OpenSSH before running tests (#21544)
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
2022-01-13 08:47:04 -08:00
Avnish Narayan
c0f1202278
[RLlib] MultiAgentEnv pre-checker (#21476) 2022-01-13 11:31:22 +01:00
Sven Mika
90c6b10498
[RLlib] Decentralized multi-agent learning; PR #01 (#21421) 2022-01-13 10:52:55 +01:00
Gagandeep Singh
d392f97331
Unskipped tests in serve: test_controller_recovery.py (#21450) 2022-01-13 01:09:59 -08:00
Yi Cheng
a6e76c2803
[nightly] Disable bootstrapping from gcs (#21570)
Right now, testing infra doesn't support run ray without redis. Disable it shortly so that we can still test the rest functionality.
2022-01-12 23:02:42 -08:00
Ruoyun Huang
a36b7a9908
[doc]Update doc for profiling using the correct VARs (#21561)
Based on code here: https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L702

Also, verified that the ENV vars as is makes "ray start" crash.
2022-01-12 23:01:51 -08:00
SangBin Cho
f5fdbeb594
Refactor event tracker out of asio class (#21215)
This refactors the event tracker to be decoupled from the asio class.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-12 22:43:31 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Clark Zinzow
7a1aaac86c
[Core] Small comment/docstring fixes in cluster task manager header. (#21539) 2022-01-12 19:35:38 -08:00
Max Pumperla
703c161034
[doc] Fix sklearn doc error, introduce MyST markdown parser (#21527) 2022-01-12 15:17:28 -08:00
Max Pumperla
54dd2d0644
[docs] remove old site (#21528) 2022-01-12 15:13:53 -08:00
Gagandeep Singh
13f20e5e1e
[Serve] Unskipped tests in test_pipeline.py (#21484) 2022-01-12 13:56:50 -08:00
Sven Mika
188324c5c7
[RLlib] Issue 21552: unsquash_action and clip_action (when None) cause wrong actions computed by Trainer.compute_single_action. (#21553) 2022-01-12 18:56:51 +01:00
Guyang Song
0627f841b2
[runtime env][observability]print debug string for runtime env uri reference table (#21309)
The debug log like this:
![image](https://user-images.githubusercontent.com/26714159/148529305-89b01151-7d76-4fda-89ed-0e13802207b3.png)

The debug state like this:
![image](https://user-images.githubusercontent.com/26714159/148529369-60222b99-595a-441d-8fe6-fb3e6ae13ac2.png)
2022-01-12 08:33:53 +00:00
Jiajun Yao
25035152bc
Fix SchedulingClassInfo.running_tasks memory leak (#21535)
In some cases, the task that's added to the `running_tasks` is never removed and introduces wait time for all the following tasks due to worker cap. One such case is lease request cancellation: the request is cancelled after `PopWorker` is called and the task is never removed from `running_tasks`.
2022-01-11 23:13:27 -08:00
Sven Mika
95d1476494
[RLlib; github] Update RLlib codeowners. (#21453)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-01-11 22:25:33 -08:00
Eric Liang
a69ae1d886
Add blogs to dataset materials (#21546) 2022-01-11 22:09:57 -08:00