This is part of redis removal. In this PR, if `RAY_gcs_storage=memory`, it'll use memory table instead of redis table.
The config setup has to be moved into GcsServer because with the memory table it's transistent.
For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible.
Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together.
Also, add the basic support for dashboard reporter pubsub.
Share the same code for RecordMetrics & DebugString for cluster task manager.
Both requires almost identical (and also expensive) operation. This PR makes them share the same `UpdateState` code which stores stats in the struct.
Note that we don't update state when metrics are recorded because the debug string is anyway consistently called and states are updated.
Ideally, we should dynamically update the stats.
This PR adds RecordMetrics and DebugString to all raylet components.
Some of methods are probably empty now. They are going to be supported in the next PR
Before this PR, then raylet notices there is something wrong with the worker starting, it'll start a new worker but not kill the old one. If the old one is hanging, it'll lead to resource waste.
This PR killed the failed worker if it's still alive and also print useful logs
Recently I am testing some benchmark about worker registering with running worker in container. Current the Ray core has `process_startup_time_ms` metrics which is about process fork time.
This PR try to add metrics about the duration of worker registering.
After moving internal kv to grpc, there is a regression in actor launching performance. This PR move the work from main thread to a dedicated thread for internal kv to mitigate it.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This PR centralizes all existing metrics to `metric_defs.h`.
Previously, each file relies on implicit import of metric_def.h within the stats module. After this PR we only precisely import `metric_defs.h` for each file.
This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode.
There are two ways to address this:
- Update store client and use store client in internal kv
- Add memory table into internal kv directly.
The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change.
So the easier way to do this is 2) and it's what this PR doing.
Next: use the flag for store client
This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy.
Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.
Here we met a crash in line 446's RAY_CHECK
d26c9e67e8/src/ray/object_manager/ownership_based_object_directory.cc (L441-L450)
And we found out that it's because we didn't set the node_id for dead nodes. If there are dead nodes and we are trying to LookupRemoteConnectionInfo in it. This crash will happen.
This PR fixes this crash.
We can just pickle task options instead of json so that we don't need to write custom `to_dict` and `from_dict` methods for complex python option objects (e.g. PlacementGroup).
Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.
To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`
This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
Separate the CoreWorkerProcess static functions from CoreWorkerProcess state; Currently the static and non-static state are mixed together, and more importantly the static state is not thread safe. By separating them and create helper class for non-static state CoreWorkerProcessImpl, we can make it thread safe.
in follow up PR we will make CoreWorkerProcess state thread safe.
This PR depends on #19677, The follow up PR is #19679
Object metadata are fully managed by workers now, so the related protos and logic in GCS are obsolete. Most of the logic has been removed in https://github.com/ray-project/ray/pull/19963. This PR removes some remaining obsolete protos.
This will allow us to pass protobuf-defined metadata to the error object. It will allow us to propagate meaningful metadata (e.g., function names for ObjectLostError, ip address for ObjectLostError within raylet, or many useful metadata for ActorDiedError).
### Impl
We will allow the error object to include "payload". The payload will be the protobuf message that includes metadata.
```
# Prev
ACTOR_DIED (metadata) | (empty)
# New
ACTOR_DIED (metadata) | Serialized protobuf message (body)
```
Note that currently, the body is
serialized message pack that contains serialized protobuf. This needs to be cleaned up in the future.
Adds a working failure test for streaming and non-streaming shuffle, without lineage reconstruction. This does a few things.
Test improvements:
- modifies AutoscalingCluster to allow passing an idle node timeout (the default is very low)
- some small improvements to the NodeKiller actor to hopefully improve flakiness.
Shuffle fixes:
- modifies shuffle tracker to wait on futures instead of having tasks signal. During failures, tasks may never signal the tracker, so we can't rely on these to track progress.
Core fixes:
- raylet will exit immediately if it receives the Shutdown RPC with graceful=False - there was a bug here where it's supposed to exit after replying to the client, but the gRPC server goes down for an unknown reason and the client reply is never sent
- On reference deletion, the owner now publishes an additional message to subscribers that the object has been deleted. Previously, this was causing a hang in streaming shuffle because the raylets pulling an object subscribed after the object was already deleted, so they never received the error signal.
This PR adds support for publishing and subscribing to logs in Python via GCS pubsub. It also refactors the Python threaded subscriber to support subscribing and calling `close()` from multiple threads.
We can also move tests and logging support to another PR, but it will make the purpose of the refactoring seems less obvious.
Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below.
Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds
For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.
## Why are these changes needed?
ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them.
Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.
## Why are these changes needed?
When the Java multi-worker feature is on and if workers respond `Exit` requests from the worker pool with delays (even slower than the interval of `TryKillingIdleWorkers`), the worker pool may send additional `Exit` requests to workers before receiving replies of previous ones. This leads to a `RAY_CHECK` failure from here
60df705b4e/src/ray/raylet/worker_pool.cc (L984)
due to executing two reply callbacks in a row.
This PR fixes the bug by ensuring the worker pool only sends new `Exit` requests to a worker if there are no inflight `Exit` requests to any worker of the worker process.
This PR includes the precise reason why actor is dead to `ActorTable`. The `death cause` stored in the table will be propagated to core worker through pubsub, so that core worker can eventually raise a good error message with metadata.
Why are these changes needed?
If max concurrency is 1 in default group, a blocking task executing in default group will block the following tasks in different group. See reproduction script in #20475
The issue is due to tasks executing in the default concurrent group run in the main task execution thread, and tasks in other concurrent groups will be blocked if the main task execution thread is blocked.
This PR only changes concurrent actor behavior that default group will not block other groups.
Related issue number
Fix#20475
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
In this PR, instead of passing specific "creation_task_exception", we pass RayErrorInfo. This will allow us to pass any type of error metadata to MarkTaskReturnObjectFailed.
This PR is basically refactoring.
## Related issue number
https://github.com/ray-project/ray/issues/20534
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This is the first step to improve `RayActorError` which doesn't provide any information to the user.
In the first step, we re-define ambiguous / confusing APIs and code path.
1. Change the name of APIs that expose too less information
- MarkPendingTaskFailed -> MarkPendingTaskObjectFailed (API too general compared to what it does)
- PendingTaskFailed -> FailOrRetryPendingTask (API name doesn't make much sense compared to its behavior).
2. Change the name of arguments that expose too much impl detail
- immediately_mark_object_fail -> mark_task_object_failed (no need to specify "immediately")
3. Move msgpack serialization to a util function instead of embedding it to the task manager function.
## Why are these changes needed?
Before the commit (e54d3117a4) all traffics go to redis which is a dedicated service.
After moving to gcs, internal kv are competing with gcs traffic which make it a bottleneck sometimes.
Before this PR, `many_actor` tests are failing, the reason is that when a lot of actors starts, gcs is really heavy loads, and then worker starts timeout because it failed to get internal kv requests executed in short time.
When worker failed, it'll starts a new worker even the original one is pending, and in the end there will be a lot workers.
There are several things here need to fix and this is the quick fix for this issues which also convert it back to the status when we are using redis.
## Related issue number
Closes#20602
This reverts commit e9132ed7ca.
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->
## Why are these changes needed?
Seems to break Windows build.
```
(07:46:25) ERROR: BUILD.bazel:406:11: Compiling src/ray/common/task/task_spec.cc failed: (Exit 2): cl.exe failed: error executing command
```
<img width="487" alt="Screen Shot 2021-11-23 at 3 09 18 AM" src="https://user-images.githubusercontent.com/18510752/143013973-f157724c-4951-49a9-80c6-158d41aa4295.png">
## Related issue number
<!-- For example: "Closes #1234" -->
## Checks
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Remerging #19789 with some fixes for Dask-on-Ray 1TB sort:
- Fixes a bug where the timer was not getting reset correctly
- Increased timeout to 10min just to be safe
- Changed the error to a unique exception ObjectFetchTimedOutError to improve debugging.
This exception should usually indicate a system-level bug.
This PR reverts the previous revert with the following minor changes.
Worker capping is off by default.
The cap feature flag is on the for the tests that explicitely require it.
This PR is the last PR that enables out of order execution. Previous PR: #20176
In this PR specifically, we added an execute_out_of_order option to .options call, which creates the actor with both out_of_order_submit_queue and out_of_order_scheduling queue.
this PR also added @simon-mo original case for testing.
Why are these changes needed?
This is a serial of PRs to make CoreWorkerProcess thread-safe and CoreWorker Code easy to read. [#19675#19677#19678#19679]
Move CoreWorkerOptions out of core_worker.h; makes the code easier to read.
Next PR: #19677