Commit graph

2710 commits

Author SHA1 Message Date
Clark Zinzow
7a1aaac86c
[Core] Small comment/docstring fixes in cluster task manager header. (#21539) 2022-01-12 19:35:38 -08:00
Guyang Song
0627f841b2
[runtime env][observability]print debug string for runtime env uri reference table (#21309)
The debug log like this:
![image](https://user-images.githubusercontent.com/26714159/148529305-89b01151-7d76-4fda-89ed-0e13802207b3.png)

The debug state like this:
![image](https://user-images.githubusercontent.com/26714159/148529369-60222b99-595a-441d-8fe6-fb3e6ae13ac2.png)
2022-01-12 08:33:53 +00:00
Jiajun Yao
25035152bc
Fix SchedulingClassInfo.running_tasks memory leak (#21535)
In some cases, the task that's added to the `running_tasks` is never removed and introduces wait time for all the following tasks due to worker cap. One such case is lease request cancellation: the request is cancelled after `PopWorker` is called and the task is never removed from `running_tasks`.
2022-01-11 23:13:27 -08:00
SangBin Cho
097706b35d
[Internal Observability] Re-enable event stats again. (#21515)
I tried reproducing the many pg mini integration failure from this PR; https://github.com/ray-project/ray/pull/21216, but I failed to do that. (this was the only test that became flaky when we turned on the flag last time).

I tried
- Run tests:test_placement_group_mini_integration 5 times instead of 3 (the default)
- Re-run the PR 3 times.

So I think it is worth trying re-enabling it again.
2022-01-11 09:00:27 -08:00
Jiajun Yao
aec37d4b60
Add container utils (#21444)
- Add debug_string helper functions for common containers.
- Add map_find_or_die helper function
2022-01-10 15:29:29 -08:00
Yi Cheng
4ab059eaa1
[gcs] Fix the server standalone tests in HA mode (#21480)
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
2022-01-07 22:54:50 -08:00
Yi Cheng
bdfba88082
[2/3][kv] Add delete by prefix support for internal kv (#21442)
Delete by prefix for internal kv is necessary to cleanup the function table. This will be used to fix the issue #8822
2022-01-07 22:54:24 -08:00
Clark Zinzow
d6c02f46b9
Fix raylet command line arg descriptions. (#21478) 2022-01-07 21:46:36 -08:00
Yi Cheng
8fa9fddaa0
[1/3][kv] move some internal kv py logic into cpp (#21386)
This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons:

- internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit.
- for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs

One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key
2022-01-07 17:35:06 -08:00
Alex Wu
8cf4071759
[core] Nested tasks on by default (#20800)
This PR turns worker capping on by default. Note that there are a couple of faulty tests that this uncovers which are fixed here.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-06 15:00:03 -08:00
Jiajun Yao
48a5208645
Refactor ObjectManager wait logic to WaitManager (#21369)
- This PR moves the `ObjectManager::Wait` related logic to a separate WaitManager class.
- Fix the wait hang issue by not  relying on the async object location notification, but checking if wait is complete when the local object is added, at that time the object is guaranteed to be local.
2022-01-06 10:42:31 -05:00
Qing Wang
132e2b2a96
[Core] Remove unused flag put_small_object_in_memory_store (#21284)
Since we have not been using `put_small_object_in_memory_store` flag for a long time, it's should be removed.
2022-01-06 14:46:58 +08:00
Qing Wang
3c68370fcf
[Core] Cache job_configs instead of ray_namespace. (#21279)
We need to get not only ray_namespace config of a job. In this PR, we cache the job_configs instead of ray_namespaces, so that we can use it for other PR(For example, this PR #21249 needs the num_java_worker_pre_process item).

Also, before this PR, ray_namespaces_ cache will not be cleared, and we clear the cache in this PR.
2022-01-05 17:48:06 -08:00
Lixin Wei
64a2ba47d3
[Core] Rename PublisherService to SubscriberService (#20666)
`PublisherClient` is a more reasonable name than `SubscriberClient` since XClient means ‘client used to access X’, like GcsClient.

Besides, in the current codebase we already called this client `publisher_client`(line 329/333), while the actual class name is `SubscriberClient`, this is inconsistent.
a8d7897a56/src/ray/pubsub/subscriber.cc (L326-L339)
2022-01-05 05:40:45 -08:00
SangBin Cho
94af7ccc92
[Actor exception message improvement] Unify the schema + improve error messages. (#21219)
This PR is added to handle this comment; https://github.com/ray-project/ray/pull/20903#discussion_r772635662

The PR 
- Unifies the multiple actor died error to a single schema. (cannot unify runtime env or creation task exception)
- Improve each of actor error message to include more metadata.
- Include actor information to actor death cause.
2022-01-04 23:22:57 -08:00
mwtian
70db5c5592
[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232)
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-01-04 23:06:44 -08:00
Tao Wang
b9106483af
[Core]Clear the unnecessary fields before broadcasting (#20965)
Only `resource_avaialbe` and `resource_total` are used in raylet, so let's clear the rest before broadcasting.
2022-01-03 15:56:41 -08:00
Qing Wang
340fbf53c0
[Java] Support actor handle reference counting. (#21249) 2022-01-01 10:26:22 +08:00
Jiajun Yao
9776e21842
Revert "Round robin during spread scheduling (#19968)" (#21293)
This reverts commit 60388b2834.
2021-12-30 10:33:06 +09:00
WanXing Wang
e5920dee8e
[Core]Refine StealTasks rpc. (#21258)
It seems that the `StealTasks` rpc has no different from other common rpc methods, should be implemented by `VOID_RPC_CLIENT_METHOD` macro. We find this when merge code into our internal codebase.
2021-12-28 14:17:25 +08:00
Qing Wang
2df27a5f87
[Java] Support ActorLifetime (#21074)
We add a enum class ActorLifetime to indicate the lifetime of an actor. In this PR, we also add the necessary API to create an actor with specifying lifetime.
Currently, it has 2 values: detached and default.
2021-12-23 19:48:56 +08:00
Qing Wang
e653d47533
[Java] Shade some widely used dependencies in bazel_jar_jar rule. (#21237)
These dependencies are widely used:
- com.google.common
- com.google.protobuf
- com.google.thirdparty

So that we need to shade them to avoid being conflict with jars introduced by user.

In this PR, we introduce a `bazel_jar_jar` rule for doing these and also shade them in maven pom files.
2021-12-23 16:54:31 +08:00
Jiajun Yao
60388b2834
Round robin during spread scheduling (#19968) 2021-12-22 20:27:34 -08:00
SangBin Cho
99693096d6
[gRPC] Improve blocking call Placement group (#21130)
Use Sync methods with timeout for placement group RPCs
2021-12-22 17:21:56 -08:00
Yi Cheng
f62faca04c
[1/gcs] gcs ha bootstrap for raylet (#21174)
This is part of #21129

This PR tries to cover the cpp/ray part of the bootstrap, some updates there:

remove the unused function/tests
some API updates

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 08:50:42 -08:00
SangBin Cho
5d3042ed9d
[Internal Observability] Record Raylet Gauge (#21049)
* Revert "[Please revert] Remove new metrics temporarily"

This reverts commit baf7846daa3d1dad50dbedac19b7afbae3e197fc.

* Addressed code review.

* [Please revert] Revert plasma stats for the next PR

* improve grammar

* Addressed code review v1.

* Addressed code review.

* Add code owner.

* Fix tests.

* Add code owner to metric_defs.cc
2021-12-21 00:34:48 -08:00
SangBin Cho
5959669a70
[Core] Remove task table. (#21188)
Remove task table that's not used anymore.
2021-12-20 06:22:01 -08:00
DK.Pino
33a45e55df
Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)" (#21152)
* Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)"

This reverts commit 02465a6792.

* fix flakey ut
2021-12-20 00:32:42 -08:00
SangBin Cho
02465a6792
Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)
This PR makes pg_test_2 flaky. cc @clay4444 can you re-merge it?
2021-12-17 00:13:26 -08:00
Guyang Song
32cf19a881
[runtime env] add and remove uri reference in worker pool (#20789)
Currently, the logic of uri reference in raylet is:
- For job level, add uri reference when job started and remove uri reference when job finished.
- For actor level, add and remove uri reference for detached actor only.

In this PR, the logic is optimized to:
- For job level, check if runtime env should be installed eagerly first. If true, add or remove uri reference. 
- For actor level
    * First, add uri reference for starting worker process to avoid that runtime env is gcd before worker registered.
    * Second, add uri reference for echo worker thread of worker process. We will remove reference when worker disconnected.

- Besides, we move the instance of `RuntimeEnvManager` from `node_manager` to `worker_pool`.
- Enable the test `test_actor_level_gc` and add some tests in python and worker pool test.
2021-12-16 01:00:05 -08:00
Yi Cheng
a778741db6
[gcs] Update constructor of gcs client (#21025)
GcsClient accepts only redis before. To make it work without redis, we need to be able to pass gcs address to gcs client as well.

In this PR, we add GCS related into into GcsClientOptions so that we can connect to the gcs directly with gcs address.

This  PR is part of GCS bootstrap. In the following PR, we'll add functionality to set the correct GcsClientOptions based on flags.
2021-12-16 00:19:37 -08:00
DK.Pino
1edf4ab041
[Placement Group] Make placement group prepare resource rpc request batched (#20897)
This is one part of this refactor,  #20715 , make the prepare resource RPC requests batched per node.
2021-12-15 22:32:50 -08:00
Chen Shen
03e05df9cb
[Core] fix wrong memory size reporting #21089
The current resource reporting is run in OSS. Revert the change. For example it reported

InitialConfigResources: {node:172.31.45.118: 1.000000}, {object_store_memory: 468605759.960938 GiB},
For 10GB memory object_store.
2021-12-15 10:24:35 -08:00
SangBin Cho
2878161a28
[Core] Properly implement some blocking RPCs with promise. Actor + KV store (#20849)
This PR implements gRPC timeout for various blocking RPCs.

Previously, the timeout with promise didn't work properly because the client didn't cancel the timed out RPCs. This PR will properly implement RPC timeout.

This PR supports;

- Blocking RPCs for core APIs, creating / getting / removing actor + pg.
- Internal KV ops

The global state accessor also has the infinite blocking calls which we need to fix. But fixing them requires a huge refactoring, so this will be done in a separate PR. 

Same for the placement group calls (they will be done in a separate PR)

Also, this means we can have scenario such as the client receives the DEADLINE EXCEEDED error, but the handler is invoked. Right now, this is not handled correctly in Ray. We should start thinking about how to handle these scenarios better.
2021-12-15 06:46:43 -08:00
Edward Oakes
10947c83b3
[runtime_env] Make pip installs incremental (#20341)
Uses a direct `pip install` instead of creating a conda env to make pip installs incremental to the cluster environment.

Separates the handling of `pip` and `conda` dependencies.

The new `pip` approach still works if only the base Ray is installed on the cluster and the user specifies libraries like "ray[serve]" in the `pip` field.  The mechanism is as follows:
- We don't actually want to reinstall ray via pip, since this could lead to version mismatch issues.  Instead, we want to use the Ray that's already installed in the cluster.
- So if "ray" was included by the user in the pip list, remove it
- If a library "ray[serve]" or "ray[tune, rllib]" was included in the pip list, remove it and replace it by its dependencies (e.g. "uvicorn", "requests", ..)

Co-authored-by: architkulkarni <arkulkar@gmail.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
2021-12-14 15:55:18 -08:00
newmanwang
42a108ff60
[gcs] Fix can not resue actor name in same namespace (#21053)
Before this PR, GcsActorManager::CreateActor() would replace actor's namespace by
actors's owner job's namespace, even if actor is created by user with a user specified
namespace. But in named_actors_, actor is set to use user specified namespace by
GcsActorManager::RegisterActor before CreateActor() is called, So that
GcsActorManager::DestroyActor failed to find actor from named_actors_ by owner job's
namespace to remove, hence reuse actor name in same namespace failed for same name actor
not removed by GcsActorManager::DestroyActor in named_actors_.

issue #20611
2021-12-14 14:20:49 -08:00
SangBin Cho
5665b69fff
[Internal Observability] Record GCS debug stats to metrics (#20993)
Streamline all existing GCS debug state to metrics.
2021-12-14 14:19:37 -08:00
SangBin Cho
7baf62386a
[Core] Shorten the GCS dead detection to 60 seconds instead of 10 minutes. (#20900)
Currently, when the GCS RPC failed with gRPC unavailable error because the GCS is dead, it will retry forever. 

b3a9d4d87d/src/ray/rpc/gcs_server/gcs_rpc_client.h (L57)

And it takes about 10 minutes to detect the GCS server failure, meaning if GCS is dead, users will notice in 10 minutes.

This can easily cause confusion that the cluster is hanging (since users are not that patient). Also, since GCS is not fault tolerant in OSS now, 10 minutes are too long timeout to detect GCS death.

This PR changes the value to 60 seconds, which I believe is much more reasonable (since this is the same value as our blocking RPC call timeout).
2021-12-14 07:50:45 -08:00
WanXing Wang
72bd2d7e09
[Core] Support back pressure for actor tasks. (#20894)
Resubmit the PR https://github.com/ray-project/ray/pull/19936

I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode.
In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock.
The solution is quite simple, release the lock before executing task in local mode.

In the commit 7c2f61c76c:
1. Release the lock in local mode to fix the bug. @scv119 
2. `test_local_mode_deadlock` added to cover the case. @rkooo567 
3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.
2021-12-13 23:56:07 -08:00
Yi Cheng
30d3115c45
[gcs] print log for storage setup of gcs (#21013)
In this PR, logs are printed so that we can check what's the setup of gcs the cluster is using. This is useful for debugging and checking.
2021-12-13 14:02:45 -08:00
Matti Picus
6c6c76c3f0
Starting workers map (#20986)
PR #19014 introduced the idea of a StartupToken to uniquely identify a worker via a counter. This PR:
- returns the Process and the StartupToken from StartWorkerProcess (previously only Process was returned)
- Change the starting_workers_to_tasks map to index via the StartupToken, which seems to fix the windows failures.
- Unskip the windows tests in test_basic_2.py
It seems once a fix to PR #18167 goes in, the starting_workers_to_tasks map will be removed, which should remove the need for the changes to StartWorkerProcess made in this PR.
2021-12-12 19:28:53 -08:00
Yi Cheng
f4e6623522
Revert "Revert "[core] Ensure failed to register worker is killed and print better log"" (#21028)
Reverts ray-project/ray#21023
Revert this one since 7fc9a9c227 has fixed the issue
2021-12-11 20:49:47 -08:00
mwtian
3028ba0f98
[Core][GCS] add feature flag for GCS bootstrapping, and flag to pass GCS address to raylet (#21003) 2021-12-10 23:48:37 -08:00
Jiajun Yao
f04ee71dc7
Fix driver lease request infinite loop when local raylet dies (#20859)
Currently if local lease request fails due to raylet death, direct_task_transport.cc will retry forever for driver.

With this PR, we treat grpc unavailable as non-retryable error (the assumption is that local grpc is always reliable and grpc unavailable error indicates that server is dead) and will just fail the task.

Note: this PR doesn't try to address a bigger problem: don't crash driver when local raylet dies. We have multiple places in the code that assumes the local raylet never fail and have CHECK_STATUS_OK for that. All these places need to be changed so we can properly propagate failures to the user.
2021-12-10 18:02:59 -08:00
Qing Wang
a3bf1af10e
[core] Fix the risk of iterator invalidation issue. (#20989)
We erase the elements from object_id_refs_ in the method `RemoveLocalReferenceInternal()` which may cause iterator invalidation issue.

Note that, normally flatmap will not trigger any iterator invalidation except triggering `rehash()`. But in this case, we may remove other elements(not only the current iterator), so there is still a risk of it.
2021-12-10 15:59:16 -08:00
Stephanie Wang
3a5dd9a10b
[core] Pin object if it already exists (#20447)
A worker can crash right after putting its return values into the object store. Then, the owner will receive the worker crashed error, but the return objects will still be in the remote object store. Later, if the task is retried, the worker will crash on [this line](https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L105) because the object already exists.

Another way this can happen is if a task has multiple return values, and one of those return values is transferred to another node. If the task is later re-executed on that node, the task will fail because of the same error.

This PR fixes the crash so that:
1. If an object already exists, we try to pin that copy. Ideally, we should destroy the old copy and create the new one to make sure that metadata like the owner address is in sync, but this is pretty complicated to do right now.
2. If the pinning fails, we store an OBJECT_LOST error to throw to the application.
3. On the raylet, we check whether we already have the object pinned, and only subscribe to the owner's eviction message if the object is not pinned.
4. Also fixes bugs in the analogous case for `ray.put` (previously this would hang, now the application will receive an error if a `ray.put` object already exists).
2021-12-10 15:56:43 -08:00
Yi Cheng
6280bc4391
Revert "[core] Ensure failed to register worker is killed and print better log" (#21023)
`linux://python/ray/tests:test_runtime_env_complicated` looks flaky after this pr.
Reverts ray-project/ray#20964
2021-12-10 14:57:32 -08:00
Yi Cheng
2ed5b1ee07
[2/gcs-mem-kv] Use memory store client when flag is set (#20931)
This is part of redis removal. In this PR, if `RAY_gcs_storage=memory`, it'll use memory table instead of redis table.
The config setup has to be moved into GcsServer because with the memory table it's transistent.
2021-12-09 22:41:05 -08:00
mwtian
2410ec5ef0
[Core][Dashboard Pubsub 1/n] Allow a channel to have subscribers to a key and to the whole channel concurrently (#20954)
For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible.

Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together.

Also, add the basic support for dashboard reporter pubsub.
2021-12-09 15:00:38 -08:00
SangBin Cho
f4d46398f7
[Internal Observability] [Part 2] Share the same code for RecordMetrics & DebugString for cluster task manager. (#20958)
Share the same code for RecordMetrics & DebugString for cluster task manager.

Both requires almost identical (and also expensive) operation. This PR makes them share the same `UpdateState` code which stores stats in the struct. 

Note that we don't update state when metrics are recorded because the debug string is anyway consistently called and states are updated.

Ideally, we should dynamically update the stats.
2021-12-09 14:24:33 -08:00