Commit graph

304 commits

Author SHA1 Message Date
Guyang Song
cf2cb66d29
[runtime env][java] Support runtime env config in Java (#28083)
Support job level and task/actor level runtime env config eg. `setupTimeoutSeconds` and `eagerInstall`.
2022-08-26 08:37:39 +08:00
Guyang Song
06b0e715c7
[runtime env] plugin refactor [7/n]: support runtime env in C++ API (#27010)
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
2022-07-27 18:24:31 +08:00
Guyang Song
419e78180a
[runtime env] plugin refactor[6/n]: java api refactor (#26783) 2022-07-26 09:00:57 +08:00
Tao Wang
5a0ca8da10
Revert "[Test]Disable java call cpp actor case for now (#26288)" (#26462)
The hanging is caused by hiding symbols(see https://github.com/ray-project/ray/issues/26435), let's enable this test again.
2022-07-13 10:42:48 +08:00
Tao Wang
bb6c805bd7
[Java worker][Cpp worker]Support Java call Cpp Task (#26182) 2022-07-12 17:49:22 +08:00
Tao Wang
b3ba1e7ea2
[Test]Disable java call cpp actor case for now (#26288) 2022-07-06 19:53:30 +08:00
Kai Yang
ba642dd271
[Java] Make Java test more stable (#26282)
If compile Ray in debug mode,

* run `MetricsTest:: testAddHistogram` will crash with below error message:

```
BucketBoundaries::Explicit called with non-monotonic boundary list.
java: external/io_opencensus_cpp/opencensus/stats/internal/bucket_boundaries.cc:64: opencensus::stats::BucketBoundaries::Explicit(std::__debug::vector<double>)::<lambda()>: Assertion `false && "0"' failed.
```

*  run `NamespaceTest::testIsolationInTheSameNamespaces` can fail with great possibility with below error message:

```
java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at io.ray.test.NamespaceTest.lambda$testIsolationInTheSameNamespaces$2(NamespaceTest.java:39)
	at io.ray.test.NamespaceTest.testIsolation(NamespaceTest.java:116)
	at io.ray.test.NamespaceTest.testIsolationInTheSameNamespaces(NamespaceTest.java:36)
```
2022-07-05 11:18:19 +08:00
Qing Wang
2d4663d0cd
[Java] Support getCurrentNodeId API for RuntimeContext (#26147)
Add an API to get the node id of this worker, see usage:
```java
UniqueId currNodeId = Ray.getRuntimeContext().getCurrentNodeId();
```
for the requirement from Ray Serve.
2022-06-30 16:19:32 +08:00
Qing Wang
cb77209ce1
[Java] Allow to specify zero CPU as resource. (#26148)
This is aligned to the behavior of Python resources validation.
2022-06-29 22:53:00 +08:00
Tao Wang
49cafc6323
[Cpp worker][Java worker]Support Java call Cpp Actor (#25933) 2022-06-29 14:33:32 +08:00
Qing Wang
8884bfb445
[Java] Support starting named actors in different namespace. (#25995)
Allow you start actors in different namespace instead of the driver namespace.
Usage is simple:
```java
Ray.init(namespace="a");
/// Named actor a will starts in namespace `b`
ActorHandle<A> a = Ray.actor(A::new).setName("myActor", "b").remote();
```

Co-authored-by: Hao Chen <chenh1024@gmail.com>
2022-06-28 13:49:15 +08:00
Tao Wang
593a522abd
[Cpp worker]Support cpp call java actor (#25581) 2022-06-14 14:17:14 +08:00
Eric Liang
48acbf0d69
[hotfix] Revert "[runtime env] runtime env inheritance refactor (#24538)" (#25487)
This reverts commit eb2692c.

This is a temporary mitigation for #25484
2022-06-05 14:55:38 -07:00
Qing Wang
65d863d349
Revert "Revert "[Java] Remove RayRuntimeInternal class (#25016)" (#25… (#25153)
This reverts commit 804b6b11d1.
2022-05-26 14:15:51 +08:00
Kai Fricke
804b6b11d1
Revert "[Java] Remove RayRuntimeInternal class (#25016)" (#25139)
This reverts commit 4026b38b09.

Broke test_raydp_dataset
2022-05-24 13:17:47 +01:00
Qing Wang
4026b38b09
[Java] Remove RayRuntimeInternal class (#25016)
Due to we have already removed the multiple workers in one process, remove RayRuntimeInternal for purpose.
2022-05-24 09:22:48 +08:00
Guyang Song
eb2692cb32
[runtime env] runtime env inheritance refactor (#24538)
* [runtime env] runtime env inheritance refactor (#22244)

Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc):
- The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment.
- Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent.

We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run.

Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior:
- **If there is no runtime env option when we create actor, inherit the parent runtime env.**
- **Otherwise, use the optional runtime env directly and don't do the merging.**

Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like:
```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))```
This new API also can be used in ray client.
2022-05-20 10:53:54 +08:00
Qing Wang
af418fb729
[Java][API CHANGE] Move exception to api module. (#24540)
This PR moves all exception classes from runtime module to api module. It's aiming to eliminate the confusion about ray exceptions. It means that Ray users don't need to touch runtime module when API programming after this PR.

Note that this should be merged onto 2.0.
2022-05-19 10:18:20 +08:00
Qing Wang
cc621ff08a
[Java][API CHANGE] Rename mode SINGLE_PROCESS to LOCAL (#24714)
for aligning to the key concept local mode, this PR renames SINGLE_PROCESS to LOCAL.
2022-05-19 10:17:24 +08:00
Qing Wang
eb29895dbb
[Core] Remove multiple core workers in one process 1/n. (#24147)
This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to `num_workers`.
After this PR checking in, we needn't to consider the multiple core workers any longer.

The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process,  removing the logic related to multiple workers from workerpool, gcs and etc.

**BREAK CHANGE**
This PR removes these APIs:
- Ray.wrapRunnable();
- Ray.wrapCallable();
- Ray.setAsyncContext();
- Ray.getAsyncContext();

And the following APIs are not allowed to invoke in a user-created thread in local mode:
- Ray.getRuntimeContext().getCurrentActorId();
- Ray.getRuntimeContext().getCurrentTaskId()

Note that this PR shouldn't be merged to 1.x.
2022-05-19 00:36:22 +08:00
Qing Wang
d40fa391a5
[RuntimeEnv][Java] Support runtime env jars for job. (#24725)
This PR supports specifying the  jars(or zip packages) for a job, which are used for all workers for this job.

You can specify jars or zips in the config file of your job:
```yml
ray {
  job {
    runtime-env: {
         "jars": [
            "https://my_host/a.jar",
            "https://my_host/b.jar"
         ]
    }
  }
}
```
or via system properties:
```java
System.setProperty("ray.job.runtime-env.jars.0", "https://my_host/a.jar");
System.setProperty("ray.job.runtime-env.jars.1", "https://my_host/a.jar");
Ray.init();
// all workers of this job will add a.jar and b.jar into the classpath.
```
2022-05-16 15:07:02 +08:00
Kai Yang
f5c6c7d28f
[Core] Allow failing new tasks immediately while the actor is restarting (#22818)
Currently, when an actor has `max_restarts` > 0 and has crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors.

```
                                                        -> Worker A (actor)
                                                       /
                                                      /
HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor)
                                                      \
                                                       \
                                                        -> ...
```

For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and calls `ray.get()` to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and `ray.get()` will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK.

But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it's not ready to serve requests. `ray.get()` on subsequent requests sent to worker A will hang until worker A is back online (ALIVE state). The proxy won't be able to reschedule these requests to another worker because currently there's no way to know if worker A is alive or not before sending a request. We can't say worker A is not alive just based on whether `ray.get()` hangs either.

To solve this issue, we change the semantics of `max_task_retries`.

* When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages.
* When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state.

TODO

- [x] Add test cases.
- [ ] Update docs.
- [x] API change review.
2022-05-14 10:48:47 +08:00
Qing Wang
2627c7b5bc
[Core] Use async post instead of PostBlocking for concurrency group executor. (#24293)
Aiming to:
1. addressing the bug about concurrency group, see #19593
2. improving the stability of the ray call latency perf in online applications.

we're proposing using async post instead of `PostBlocking` in threadpool.

Note that since we have already had back pressure in the caller side, I believe this change is safe to merge and it doesn't break any behavior.
2022-05-13 11:30:52 +08:00
Qing Wang
3208cfc167
[Runtime env][Java] Add unit tests for specifying jars for tasks. (#24712)
It seems that we have already supported specifying java jars for normal tasks, this PR only needs to add unit tests for that.
2022-05-13 09:46:20 +08:00
Qing Wang
259661042c
[runtime env] [java] Support jars in runtime env for Java (#24170)
This PR supports setting the jars for an actor in Ray API. The API looks like:
```java
class A {
    public boolean findClass(String className) {
      try {
        Class.forName(className);
      } catch (ClassNotFoundException e) {
        return false;
      }
      return true;
    }
}

RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
    .addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar"))
    .build();
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get();
System.out.println(ret); // true
```
2022-05-12 09:34:40 +08:00
Qing Wang
c5252c5ceb
[Java] Support parallel actor in experimental. (#21701)
For the purpose to provide an alternative option for running multiple actor instances in a Java worker process, and the eventual goal is to remove the original multi-worker-instances in one worker process implementation.  we're proposing supporting parallel actor concept in Java. This feature enables that users could define some homogeneous parallel execution instances in an actor, and all instances hold one thread as the execution backend.

### Introduction

For the following example, we define a parallel actor with  10 parallelism. The backend actor has 10 concurrency groups for the parallel executions, it also means there're 10 threads for that.

We can access the instance by the instance handle, like:
```java
ParallelActorHandle<A> actor = ParallelActor.actor(A::new).setParallelism(10).remote();
ParallelInstance<A> instance = actor.getInstance(/*index=*/ 2);
Preconditions.checkNotNull(instance);
Ray.get(instance.task(A::incr, 1000000).remote()); // print 1000000           

instance = actor.getInstance(/*index=*/ 2);
Preconditions.checkNotNull(instance);
Ray.get(instance.task(A::incr, 2000000).remote().get()); // print 3000000

instance = actor.getInstance(/*index=*/ 3);
Preconditions.checkNotNull(instance);
Ray.get(instance.task(A::incr, 2000000).remote().get()); // print 2000000
```


### Limitation
- It doesn't support concurrency group on a parallel actor yet.

Co-authored-by: Kai Yang <kfstorm@outlook.com>
2022-04-21 22:54:33 +08:00
Qing Wang
77b0015ea0
[Java] Add NO_RESTART and INFINITE_RESTART constants. (#23771) 2022-04-12 10:40:44 +08:00
Qing Wang
ef5b9b87d3
[Java] Add set runtime env api for normal task. (#23412)
This PR adds the API `setRuntimeEnv` for submitting a normal task, for the usage:
```java
RuntimeEnv runtimeEnv =
    new RuntimeEnv.Builder()
        .addEnvVar("KEY1", "A")
        .build();

/// Return `A`
Ray.task(RuntimeEnvTest::getEnvVar, "KEY1").setRuntimeEnv(runtimeEnv).remote().get();
```
2022-03-24 15:57:24 +08:00
Qing Wang
160e2ca9f8
[Java] Add per job runtime env env vars. (#23366)
1. Support setting environment variables in runtime env for a job, like:
```yaml
ray : {
  job : {
    runtime-env: {
         // Environment variables to be set on worker processes in current job.
         "env-vars": {
           // key1: "value11"
           // key2: "value22"
         }
      }
   }
}
```
It could be set by system properties before `Ray.init()` as well:
```java
  System.setProperty("ray.job.runtime-env.env-vars.KEY1", "A");
  System.setProperty("ray.job.runtime-env.env-vars.KEY2", "B");

  Ray.init();
```

2. Setting environment variables for an actor will overwrite and merge to the environment variables of job.
```java
System.setProperty("ray.job.runtime-env.env-vars.KEY1", "A");
System.setProperty("ray.job.runtime-env.env-vars.KEY2", "B");

Ray.init();
  
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder().addEnvVar("KEY1", "C").build();

/// actor1 has the env vars: {"KEY1" : "C", "KEY2" : "B"}
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
/// actor2 has the env vars: {"KEY1" : "A", "KEY2" : "B"} 
ActorHandle<A> actor2 = Ray.actor(A::new).remote();
```
2022-03-23 08:00:00 +08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
Qing Wang
9572bb717f
[RuntimeEnv] Support setting actor level env vars for Java worker (#22240)
This PR supports setting actor level env vars for Java worker in runtime env.
General API looks like:
```java
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
    .addEnvVar("KEY1", "A")
    .addEnvVar("KEY2", "B")
    .addEnvVar("KEY1", "C")  // This overwrites "KEY1" to "C"
    .build();

ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
```

If `num-java-workers-per-process` > 1, it will never reuse the worker process except they have the same runtime envs.

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-02-28 10:58:37 +08:00
Simon Mo
bfb619a127
[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410) 2022-02-24 16:51:38 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Shawn
6603ad450a
[Java] print hang test case name (#21804)
* print hang test case name

* use getFullTestName
2022-01-23 23:56:44 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Rong Ma
f54282147c
[PlacementGroup] Support using any available bundle in java api (#21496)
In python or C++, we can specify the bundle index as -1 to use any available bundle in the placement group. We should also enable it in Java to keep the API consistent across all languages.
2022-01-18 01:58:02 +08:00
Qing Wang
2c3be852ab
[Java] Support defining ConcurrencyGroup statically in Java. (#20373)
This PR introduces statically defining ConcurrencyGroup APIs in Java.
We introduce 2 APIs:
1. Introducing `@DefConcurrencyGroup` annotation for an actor class to define a concurrency group statically.
2. Introducing `@UseConcurrencyGroup` annotation for actor methods to define the concurrency group to be used in the method.

Examples are below:

```java
 @DefConcurrencyGroup(name = "io", maxConcurrency = 2)
  @DefConcurrencyGroup(name = "compute", maxConcurrency = 4)
  private static class MyActor {
    @UseConcurrencyGroup(name = "io")
    public long f1() { }

    @UseConcurrencyGroup(name = "io")
    public long f2() { }

    @UseConcurrencyGroup(name = "compute")
    public long f3(int a, int b) { }

    @UseConcurrencyGroup(name = "compute")
    public long f4() { }
  }

ActorHandle<> myActor = Ray.actor(MyActor::new).remote();
myActor.task(MyActor::f1).remote();
myActor.task(MyActor::f2).remote();
myActor.task(MyActor::f3).remote();
myActor.task(MyActor::f4).remote();
```
`MyActor` has 3 concurrency groups: `io` with 2 concurrency, `compute` with 4 concurrency and `default` with 1 concurrency.
f1 and f2 will be executed in `io`, f3 and f4 will be executed in `compute`.
2022-01-17 16:23:10 +08:00
Qing Wang
bb647626cf
[Xlang][Java] Fix Java overrided default method cannot be invoked. (#21491)
In Xlang(Python call Java), a Java method which overrides a `default` method of the super class is not able to be invoked successfully, due to we treat it as overloaded method instead of overrided method. This PR correctly handle it at the case it overrides a `default` method.

Before this PR, the following usage is not able to be invoked from Python -> Java.
```Java
public interface ExampleInterface {
  default String echo(String inp) {
    return inp;
  }
}
public class ExampleImpl implements ExampleInterface {
  @Override
  public String echo(String inp) {
    return inp + " echo";
  }
}
```
```python
/// Invoke it in Python.
cls = ray.java_actor_class("io.ray.serve.util.ExampleImpl")
handle = cls.remote()
print(ray.get(handle.echo.remote("hi")))
```
2022-01-11 23:11:24 +08:00
Clark Zinzow
da4cc26449
[CI] Disable Java log rotation test. (#21394) 2022-01-05 14:51:27 -08:00
Qing Wang
240e6efe21
[Java] Try to fix flaky NamespaceTest (#21370) 2022-01-05 09:01:34 +08:00
Qing Wang
340fbf53c0
[Java] Support actor handle reference counting. (#21249) 2022-01-01 10:26:22 +08:00
Qing Wang
663e14b232
[Java] Fix namespace test case. (#21280)
Since we've supported lifetime in Java, we should set the DETACHED for the detached actors in test.
2021-12-28 22:31:51 +08:00
Qing Wang
2df27a5f87
[Java] Support ActorLifetime (#21074)
We add a enum class ActorLifetime to indicate the lifetime of an actor. In this PR, we also add the necessary API to create an actor with specifying lifetime.
Currently, it has 2 values: detached and default.
2021-12-23 19:48:56 +08:00
WanXing Wang
72bd2d7e09
[Core] Support back pressure for actor tasks. (#20894)
Resubmit the PR https://github.com/ray-project/ray/pull/19936

I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode.
In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock.
The solution is quite simple, release the lock before executing task in local mode.

In the commit 7c2f61c76c:
1. Release the lock in local mode to fix the bug. @scv119 
2. `test_local_mode_deadlock` added to cover the case. @rkooo567 
3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.
2021-12-13 23:56:07 -08:00
Kai Fricke
d4413299c0
Revert "[Core] Support back pressure for actor tasks (#19936)" (#20880)
This reverts commit a4495941c2.
2021-12-03 17:48:47 -08:00
WanXing Wang
a4495941c2
[Core] Support back pressure for actor tasks (#19936)
Support back pressure in core worker.
Job config added for python worker and java worker.
2021-12-02 14:41:30 -08:00
Qing Wang
cd2b83a259
[Core][ConcurrencyGroup] Fix blocking task in default group block tasks in other group. (#20525)
Why are these changes needed?
If max concurrency is 1 in default group, a blocking task executing in default group will block the following tasks in different group. See reproduction script in #20475

The issue is due to tasks executing in the default concurrent group run in the main task execution thread, and tasks in other concurrent groups will be blocked if the main task execution thread is blocked.

This PR only changes concurrent actor behavior that default group will not block other groups.

Related issue number
Fix #20475
2021-11-25 14:24:17 +08:00
Lixin Wei
a912b68375
[Java] Reenable Named Actor Test. (#20627)
We skipped testGetNonExistingNamedActor for some reason. Now this test is ready to enable. This PR reenables this test.
2021-11-22 16:25:16 +08:00
Larry
454db6902c
[Java] Add timeout parameter for Ray.get() API (#20282)
Why are these changes needed?

Add timeout(ms) param for Java ray.get. The API changes have been updated to doc ([Ray Core Walkthrough]->[Fetching Results]).

eg:
ObjectRef<Integer> objRef = Ray.put(1);
objRef.get(1000) 
Ray.get(Ray.task(MyRayApp::slowFunction).remote(), 3000)

Related issue number
#20247
2021-11-17 11:02:17 +08:00