Commit graph

2581 commits

Author SHA1 Message Date
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Eric Liang
5065156dd9
Set task retry delay to zero (#21690) 2022-01-19 23:41:35 -08:00
SangBin Cho
e3357eb9e5
[Internal Observability] Fix the event stats segfault 1/2 (#21593)
This PR is a pre-work before actually fixing a thread-safety bug within shutdown.

It is doing
- Add better logging upon core worker shutdown.
- Improve document around core worker shutdown.
- Remove unnecessary pointer usage from periodical runner for clean destruction order.
- Remove unnecessary `WaitForShutdown` API and combine them into a single `Shutdown` API.
2022-01-19 23:08:54 -08:00
Hao Chen
8dcc07ec9c
[Fix][Locality] ref count should remove object locations for dead nodes (#21548)
When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.
2022-01-20 11:58:52 +08:00
Wilson Wang
2626c64060
Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533)
GCS, when running as an individual component, can cause other components to fail in case of crashes. 

Here are two main cases covered in this patch:

1. monitor.py will raise an exception when disconnected from GCS.
2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start.


In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-19 18:48:03 -08:00
Stephanie Wang
bab7cd6388
[core] Fix race condition between object free and duplicate creation (#21364)
An object can get created/pinned twice if the original worker fails mid-task, or when lineage reconstruction is enabled. This can cause inconsistencies in the LocalObjectManager if the second creation races with object spilling and/or object free. For example:
1. Object X get created, then is pending spill.
2. Object X is freed by original owner because it goes out of scope.
3. Task that created X gets re-executed due to failure.
4. Task recreates X, which can now get spilled again while the original copy is also being spilled/freed.

This PR better enforces the state machine for objects managed by the LocalObjectManager. An object can be either: pinned, pending spill, or spilled. If we receive a free message from the owner, we do not delete the object metadata until all shared-memory and spilled copies of the object are removed.
2022-01-19 17:58:07 -08:00
mwtian
d3e7abb3c9
[GCS] use separate event loop for GCS pubsub (#21675)
Use a separate event loop for pubsub work, to provide some isolation from other workload. There is no benchmark result but the downside, if there is any, should not be large.
2022-01-19 17:39:50 -08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
mwtian
5893a9eddb
[GCS] enable GCS pubsub by default (#21673)
Turn the flags on by default.
2022-01-18 12:04:53 -08:00
Jiajun Yao
25e62d85bd
[LOGGING][RFC] Add RAY_CHECK_OP (#21607) 2022-01-18 11:38:26 -08:00
Qing Wang
6f82bff7ff
[Java] Change ActorLifetime API: DEFAULT -> NON_DETACHED (#21639)
This PR changes the enum value `ActorLifetime.DEFAULT` to `ActorLifetime.NON_DETACHED`. In our release versions, `ActorLifetime` was not introduced <= 1.9.2

Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-17 18:10:12 +08:00
jon-chuang
5f7224bd51
[C++ API] fix wrong arg handling for object references in TaskExecutor, TaskArgByReference (#21236)
Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. 

That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. 

This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB.
```
C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body.
```
This was not caught due to lack of testing for large objects, which has now been added.
2022-01-17 12:08:15 +08:00
Yi Cheng
927c5467eb
[gcs/function table] Change function table keys' prefix from binary to hex (#21616)
When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern.

For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function.

This PR uses hex of job id to avoid this.
2022-01-15 21:58:14 -08:00
Jialing He
ded4128ebf
[Core] dlmalloc allocate bottom-most memory chunk failed (#21439)
Why are these changes needed?
fix dlmalloc allocate bug, details in here #21310
* fix dlmalloc bug

* make lint happy

* make lint happy

* fix by comment

* use _check_spilled_mb

* add cpp UT
2022-01-13 23:53:29 -08:00
Stephanie Wang
1df67eb977
[core] Avoid ObjectID collisions for re-executed tasks (#21395)
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.

This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-13 18:18:55 -08:00
Yi Cheng
e4ba51f25b
[core] Add GC for function table (#21509)
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.

Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
2022-01-13 18:06:05 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Jiajun Yao
d6dbf3b8bf
[scheduler] Set default max_pending_lease_requests_per_scheduling_category to 10 (#20404) 2022-01-13 13:50:56 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
SangBin Cho
f5fdbeb594
Refactor event tracker out of asio class (#21215)
This refactors the event tracker to be decoupled from the asio class.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-12 22:43:31 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Clark Zinzow
7a1aaac86c
[Core] Small comment/docstring fixes in cluster task manager header. (#21539) 2022-01-12 19:35:38 -08:00
Guyang Song
0627f841b2
[runtime env][observability]print debug string for runtime env uri reference table (#21309)
The debug log like this:
![image](https://user-images.githubusercontent.com/26714159/148529305-89b01151-7d76-4fda-89ed-0e13802207b3.png)

The debug state like this:
![image](https://user-images.githubusercontent.com/26714159/148529369-60222b99-595a-441d-8fe6-fb3e6ae13ac2.png)
2022-01-12 08:33:53 +00:00
Jiajun Yao
25035152bc
Fix SchedulingClassInfo.running_tasks memory leak (#21535)
In some cases, the task that's added to the `running_tasks` is never removed and introduces wait time for all the following tasks due to worker cap. One such case is lease request cancellation: the request is cancelled after `PopWorker` is called and the task is never removed from `running_tasks`.
2022-01-11 23:13:27 -08:00
SangBin Cho
097706b35d
[Internal Observability] Re-enable event stats again. (#21515)
I tried reproducing the many pg mini integration failure from this PR; https://github.com/ray-project/ray/pull/21216, but I failed to do that. (this was the only test that became flaky when we turned on the flag last time).

I tried
- Run tests:test_placement_group_mini_integration 5 times instead of 3 (the default)
- Re-run the PR 3 times.

So I think it is worth trying re-enabling it again.
2022-01-11 09:00:27 -08:00
Jiajun Yao
aec37d4b60
Add container utils (#21444)
- Add debug_string helper functions for common containers.
- Add map_find_or_die helper function
2022-01-10 15:29:29 -08:00
Yi Cheng
4ab059eaa1
[gcs] Fix the server standalone tests in HA mode (#21480)
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
2022-01-07 22:54:50 -08:00
Yi Cheng
bdfba88082
[2/3][kv] Add delete by prefix support for internal kv (#21442)
Delete by prefix for internal kv is necessary to cleanup the function table. This will be used to fix the issue #8822
2022-01-07 22:54:24 -08:00
Clark Zinzow
d6c02f46b9
Fix raylet command line arg descriptions. (#21478) 2022-01-07 21:46:36 -08:00
Yi Cheng
8fa9fddaa0
[1/3][kv] move some internal kv py logic into cpp (#21386)
This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons:

- internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit.
- for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs

One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key
2022-01-07 17:35:06 -08:00
Alex Wu
8cf4071759
[core] Nested tasks on by default (#20800)
This PR turns worker capping on by default. Note that there are a couple of faulty tests that this uncovers which are fixed here.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-06 15:00:03 -08:00
Jiajun Yao
48a5208645
Refactor ObjectManager wait logic to WaitManager (#21369)
- This PR moves the `ObjectManager::Wait` related logic to a separate WaitManager class.
- Fix the wait hang issue by not  relying on the async object location notification, but checking if wait is complete when the local object is added, at that time the object is guaranteed to be local.
2022-01-06 10:42:31 -05:00
Qing Wang
132e2b2a96
[Core] Remove unused flag put_small_object_in_memory_store (#21284)
Since we have not been using `put_small_object_in_memory_store` flag for a long time, it's should be removed.
2022-01-06 14:46:58 +08:00
Qing Wang
3c68370fcf
[Core] Cache job_configs instead of ray_namespace. (#21279)
We need to get not only ray_namespace config of a job. In this PR, we cache the job_configs instead of ray_namespaces, so that we can use it for other PR(For example, this PR #21249 needs the num_java_worker_pre_process item).

Also, before this PR, ray_namespaces_ cache will not be cleared, and we clear the cache in this PR.
2022-01-05 17:48:06 -08:00
Lixin Wei
64a2ba47d3
[Core] Rename PublisherService to SubscriberService (#20666)
`PublisherClient` is a more reasonable name than `SubscriberClient` since XClient means ‘client used to access X’, like GcsClient.

Besides, in the current codebase we already called this client `publisher_client`(line 329/333), while the actual class name is `SubscriberClient`, this is inconsistent.
a8d7897a56/src/ray/pubsub/subscriber.cc (L326-L339)
2022-01-05 05:40:45 -08:00
SangBin Cho
94af7ccc92
[Actor exception message improvement] Unify the schema + improve error messages. (#21219)
This PR is added to handle this comment; https://github.com/ray-project/ray/pull/20903#discussion_r772635662

The PR 
- Unifies the multiple actor died error to a single schema. (cannot unify runtime env or creation task exception)
- Improve each of actor error message to include more metadata.
- Include actor information to actor death cause.
2022-01-04 23:22:57 -08:00
mwtian
70db5c5592
[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232)
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-01-04 23:06:44 -08:00
Tao Wang
b9106483af
[Core]Clear the unnecessary fields before broadcasting (#20965)
Only `resource_avaialbe` and `resource_total` are used in raylet, so let's clear the rest before broadcasting.
2022-01-03 15:56:41 -08:00
Qing Wang
340fbf53c0
[Java] Support actor handle reference counting. (#21249) 2022-01-01 10:26:22 +08:00
Jiajun Yao
9776e21842
Revert "Round robin during spread scheduling (#19968)" (#21293)
This reverts commit 60388b2834.
2021-12-30 10:33:06 +09:00
WanXing Wang
e5920dee8e
[Core]Refine StealTasks rpc. (#21258)
It seems that the `StealTasks` rpc has no different from other common rpc methods, should be implemented by `VOID_RPC_CLIENT_METHOD` macro. We find this when merge code into our internal codebase.
2021-12-28 14:17:25 +08:00
Qing Wang
2df27a5f87
[Java] Support ActorLifetime (#21074)
We add a enum class ActorLifetime to indicate the lifetime of an actor. In this PR, we also add the necessary API to create an actor with specifying lifetime.
Currently, it has 2 values: detached and default.
2021-12-23 19:48:56 +08:00
Qing Wang
e653d47533
[Java] Shade some widely used dependencies in bazel_jar_jar rule. (#21237)
These dependencies are widely used:
- com.google.common
- com.google.protobuf
- com.google.thirdparty

So that we need to shade them to avoid being conflict with jars introduced by user.

In this PR, we introduce a `bazel_jar_jar` rule for doing these and also shade them in maven pom files.
2021-12-23 16:54:31 +08:00
Jiajun Yao
60388b2834
Round robin during spread scheduling (#19968) 2021-12-22 20:27:34 -08:00
SangBin Cho
99693096d6
[gRPC] Improve blocking call Placement group (#21130)
Use Sync methods with timeout for placement group RPCs
2021-12-22 17:21:56 -08:00
Yi Cheng
f62faca04c
[1/gcs] gcs ha bootstrap for raylet (#21174)
This is part of #21129

This PR tries to cover the cpp/ray part of the bootstrap, some updates there:

remove the unused function/tests
some API updates

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 08:50:42 -08:00
SangBin Cho
5d3042ed9d
[Internal Observability] Record Raylet Gauge (#21049)
* Revert "[Please revert] Remove new metrics temporarily"

This reverts commit baf7846daa3d1dad50dbedac19b7afbae3e197fc.

* Addressed code review.

* [Please revert] Revert plasma stats for the next PR

* improve grammar

* Addressed code review v1.

* Addressed code review.

* Add code owner.

* Fix tests.

* Add code owner to metric_defs.cc
2021-12-21 00:34:48 -08:00
SangBin Cho
5959669a70
[Core] Remove task table. (#21188)
Remove task table that's not used anymore.
2021-12-20 06:22:01 -08:00
DK.Pino
33a45e55df
Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)" (#21152)
* Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)"

This reverts commit 02465a6792.

* fix flakey ut
2021-12-20 00:32:42 -08:00
SangBin Cho
02465a6792
Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)
This PR makes pg_test_2 flaky. cc @clay4444 can you re-merge it?
2021-12-17 00:13:26 -08:00