Commit graph

2514 commits

Author SHA1 Message Date
Yi Cheng
6280bc4391
Revert "[core] Ensure failed to register worker is killed and print better log" (#21023)
`linux://python/ray/tests:test_runtime_env_complicated` looks flaky after this pr.
Reverts ray-project/ray#20964
2021-12-10 14:57:32 -08:00
Yi Cheng
2ed5b1ee07
[2/gcs-mem-kv] Use memory store client when flag is set (#20931)
This is part of redis removal. In this PR, if `RAY_gcs_storage=memory`, it'll use memory table instead of redis table.
The config setup has to be moved into GcsServer because with the memory table it's transistent.
2021-12-09 22:41:05 -08:00
mwtian
2410ec5ef0
[Core][Dashboard Pubsub 1/n] Allow a channel to have subscribers to a key and to the whole channel concurrently (#20954)
For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible.

Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together.

Also, add the basic support for dashboard reporter pubsub.
2021-12-09 15:00:38 -08:00
SangBin Cho
f4d46398f7
[Internal Observability] [Part 2] Share the same code for RecordMetrics & DebugString for cluster task manager. (#20958)
Share the same code for RecordMetrics & DebugString for cluster task manager.

Both requires almost identical (and also expensive) operation. This PR makes them share the same `UpdateState` code which stores stats in the struct. 

Note that we don't update state when metrics are recorded because the debug string is anyway consistently called and states are updated.

Ideally, we should dynamically update the stats.
2021-12-09 14:24:33 -08:00
SangBin Cho
05a302b468
[Internal Observability] [Part 3] Support debug state metrics on all components. (#20957)
This PR adds RecordMetrics and DebugString to all raylet components. 

Some of methods are probably empty now. They are going to be supported in the next PR
2021-12-09 14:24:15 -08:00
Yi Cheng
83c639ea76
[core] Ensure failed to register worker is killed and print better log (#20964)
Before this PR, then raylet notices there is something wrong with the worker starting, it'll start a new worker but not kill the old one. If the old one is hanging, it'll lead to resource waste.
This PR killed the failed worker if it's still alive and also print useful logs
2021-12-09 12:37:39 -08:00
chenk008
8bb9bfe632
[Core]Add metrics: worker_register_time_ms (#20472)
Recently I am testing some benchmark about worker registering with running worker in container. Current the Ray core has `process_startup_time_ms` metrics which is about process fork time.

This PR try to add metrics about the duration of worker registering.
2021-12-09 21:25:49 +08:00
Yi Cheng
f7b0b872f9
[1/kv-regression] Put KV into a dedicated thread pool (#20922)
After moving internal kv to grpc, there is a regression in actor launching performance. This PR move the work from main thread to a dedicated thread for internal kv to mitigate it.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-12-09 00:21:47 -08:00
Jiajun Yao
655cc584a9
[Scheduler] Support per task/actor SpreadSchedulingStrategy (#20972)
This PR adds per task/actor SpreadSchedulingStrategy which will try to spread the tasks on a best effort basis.
2021-12-08 22:22:07 -08:00
SangBin Cho
5298a9046c
[Internal Observability] [Part 1] Centralize existing metrics to metric_defs.h (#20728)
This PR centralizes all existing metrics to `metric_defs.h`. 

Previously, each file relies on implicit import of metric_def.h within the stats module. After this PR we only precisely import `metric_defs.h` for each file.
2021-12-08 14:06:05 -08:00
Yi Cheng
442b1025cd
[1/gcs-mem-kv] Memory mode for internal kv (#20881)
This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode.
There are two ways to address this:
- Update store client and use store client in internal kv
- Add memory table into internal kv directly.

The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change.

So the easier way to do this is 2) and it's what this PR doing.

Next: use the flag for store client
2021-12-08 10:40:35 -08:00
Jiajun Yao
5b168a1515
[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy (#20507)
This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy.

Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.
2021-12-07 23:11:31 -08:00
Lixin Wei
96dc10a95a
[Core] Fix Crash in ObjectDirectory (#20540)
Here we met a crash in line 446's RAY_CHECK

d26c9e67e8/src/ray/object_manager/ownership_based_object_directory.cc (L441-L450)


And we found out that it's because we didn't set the node_id for dead nodes. If there are dead nodes and we are trying to LookupRemoteConnectionInfo in it. This crash will happen.

This PR fixes this crash.
2021-12-07 23:03:49 -08:00
Stephanie Wang
1b9c03adb3
[core] Remove spammy code in object directory client (#20838)
* log

* remove

* fix

* fix

* x

* x
2021-12-07 19:51:44 -08:00
Jiajun Yao
2208cf7672
[Ray Client] Pickle task options for ray client (#20930)
We can just pickle task options instead of json so that we don't need to write custom `to_dict` and `from_dict` methods for complex python option objects (e.g. PlacementGroup).
2021-12-07 17:07:19 -08:00
Yi Cheng
ea1d081aac
[core] Simple chaos testing for asio (#19970)
Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.

To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`

This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
2021-12-07 14:47:07 -08:00
Chen Shen
b9a418352b
[Core][Refactor CoreWorker 3/n] split static function from CoreWorkerProcess (#19678)
Separate the CoreWorkerProcess static functions from CoreWorkerProcess state; Currently the static and non-static state are mixed together, and more importantly the static state is not thread safe. By separating them and create helper class for non-static state CoreWorkerProcessImpl, we can make it thread safe.

in follow up PR we will make CoreWorkerProcess state thread safe.

This PR depends on #19677, The follow up PR is #19679
2021-12-06 11:12:21 -08:00
Kai Fricke
d4413299c0
Revert "[Core] Support back pressure for actor tasks (#19936)" (#20880)
This reverts commit a4495941c2.
2021-12-03 17:48:47 -08:00
mwtian
c01fa39d84
[Cleanup] delete remaining protos related to ObjectLocation in GCS (#20823)
Object metadata are fully managed by workers now, so the related protos and logic in GCS are obsolete. Most of the logic has been removed in https://github.com/ray-project/ray/pull/19963. This PR removes some remaining obsolete protos.
2021-12-02 15:24:43 -08:00
WanXing Wang
a4495941c2
[Core] Support back pressure for actor tasks (#19936)
Support back pressure in core worker.
Job config added for python worker and java worker.
2021-12-02 14:41:30 -08:00
mwtian
0467bc9df5
[Core][Pubsub][Importer] GCS pubsub for function manager & importer (#20804)
This PR allows using Ray pubsub for notifying worker importers that a new function / actor class needs to be imported.
2021-12-01 10:44:50 -08:00
SangBin Cho
4b9524ed76
[Part 4] Support passing metadata to Ray error object. (#20714)
This will allow us to pass protobuf-defined metadata to the error object. It will allow us to propagate meaningful metadata (e.g., function names for ObjectLostError, ip address for ObjectLostError within raylet, or many useful metadata for ActorDiedError).

### Impl
We will allow the error object to include "payload". The payload will be the protobuf message that includes metadata.
```
# Prev 
ACTOR_DIED (metadata) | (empty)

# New
ACTOR_DIED (metadata) | Serialized protobuf message (body)
```

Note that currently, the body is 

serialized message pack that contains serialized protobuf. This needs to be cleaned up in the future.
2021-11-30 21:58:07 -08:00
SangBin Cho
5e1692e8ac
[Core] Support timeout for gRPC methods. (#20734)
* Completed

* Add a test

* lint failure
2021-11-30 18:46:20 -08:00
Stephanie Wang
162cc9e6bd
Add chaos test for shuffle (#20657)
Adds a working failure test for streaming and non-streaming shuffle, without lineage reconstruction. This does a few things.

Test improvements:
- modifies AutoscalingCluster to allow passing an idle node timeout (the default is very low)
- some small improvements to the NodeKiller actor to hopefully improve flakiness.

Shuffle fixes:
- modifies shuffle tracker to wait on futures instead of having tasks signal. During failures, tasks may never signal the tracker, so we can't rely on these to track progress.

Core fixes:
- raylet will exit immediately if it receives the Shutdown RPC with graceful=False - there was a bug here where it's supposed to exit after replying to the client, but the gRPC server goes down for an unknown reason and the client reply is never sent
- On reference deletion, the owner now publishes an additional message to subscribers that the object has been deleted. Previously, this was causing a hang in streaming shuffle because the raylets pulling an object subscribed after the object was already deleted, so they never received the error signal.
2021-11-30 15:24:09 -08:00
Jiajun Yao
e3e2739164
Exit worker when parent raylet dies (#20777)
* Exit worker when parent raylet dies

* Exit worker when parent raylet dies

* Exit worker when parent raylet dies
2021-11-30 10:04:11 -08:00
mwtian
a4d3898159
[Core][Pubsub][Logging 1/n] add logging support to GCS pubsub in Python (#20604)
This PR adds support for publishing and subscribing to logs in Python via GCS pubsub. It also refactors the Python threaded subscriber to support subscribing and calling `close()` from multiple threads.

We can also move tests and logging support to another PR, but it will make the purpose of the refactoring seems less obvious.
2021-11-29 11:26:01 -08:00
SangBin Cho
6649f078e5
[Internal Observability] Move debug_state.txt to the log dir + support gcs_server debug state (#20722)
Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below.
Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds
For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.
2021-11-28 20:42:37 -08:00
Philipp Moritz
15a51b7c65
Use thread local random number generator (#20708) 2021-11-27 15:44:44 -08:00
Qing Wang
116bda8f05
[Core] Remove duplicated implementations of concurrency group executor. (#20467)
## Why are these changes needed?
ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them.

Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.
2021-11-27 12:57:40 +08:00
Kai Yang
722428a657
[Core] Fix worker pool crash due to incorrect pending_exit_idle_workers_ usage (#20180)
## Why are these changes needed?

When the Java multi-worker feature is on and if workers respond `Exit` requests from the worker pool with delays (even slower than the interval of `TryKillingIdleWorkers`), the worker pool may send additional `Exit` requests to workers before receiving replies of previous ones. This leads to a `RAY_CHECK` failure from here

60df705b4e/src/ray/raylet/worker_pool.cc (L984)

due to executing two reply callbacks in a row.

This PR fixes the bug by ensuring the worker pool only sends new `Exit` requests to a worker if there are no inflight `Exit` requests to any worker of the worker process.
2021-11-26 13:50:07 +08:00
SangBin Cho
31f378e45a
[Part 2] Improve RayActorDiedError: Store why the actor is dead to the actor table. (#20528)
This PR includes the precise reason why actor is dead to `ActorTable`. The `death cause` stored in the table will be propagated to core worker through pubsub, so that core worker can eventually raise a good error message with metadata.
2021-11-25 04:39:02 -08:00
Guyang Song
454b7bd125
Revert "Revert "[core] add runtime env info to task spec debug string (#20668)" (#20697)
## Why are these changes needed?
- fix compiling error and revert the revert.
2021-11-24 23:03:10 -08:00
Qing Wang
cd2b83a259
[Core][ConcurrencyGroup] Fix blocking task in default group block tasks in other group. (#20525)
Why are these changes needed?
If max concurrency is 1 in default group, a blocking task executing in default group will block the following tasks in different group. See reproduction script in #20475

The issue is due to tasks executing in the default concurrent group run in the main task execution thread, and tasks in other concurrent groups will be blocked if the main task execution thread is blocked.

This PR only changes concurrent actor behavior that default group will not block other groups.

Related issue number
Fix #20475
2021-11-25 14:24:17 +08:00
SangBin Cho
d725457c9f
[Part 3] Improve ActorDiedError message: Passing error info (#20701)
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

In this PR, instead of passing specific "creation_task_exception", we pass RayErrorInfo. This will allow us to pass any type of error metadata to MarkTaskReturnObjectFailed. 

This PR is basically refactoring. 

## Related issue number

https://github.com/ray-project/ray/issues/20534

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-24 19:29:20 -08:00
qicosmos
c6e347c06f
Clean exported symbols (#20608)
* linkopts shared

* try to remove some symbols

* revert PyInit

* remove opencensus

* add suffix

* try to remove JNI

* _JNI_On*
2021-11-24 18:23:01 -08:00
Gagandeep Singh
f22a24aca4
Replace time based seed generation with absl::BitGen and absl::Uniform (#20696) 2021-11-24 14:36:35 -08:00
Guyang Song
53630ee03b
Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692)
- Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641
- Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.
2021-11-24 09:01:01 -08:00
SangBin Cho
e310f6c76f
[Part 1] Improve RayActorDeadError: Refactoring (#20458)
This is the first step to improve `RayActorError` which doesn't provide any information to the user.

In the first step, we re-define ambiguous / confusing APIs and code path. 

1. Change the name of APIs that expose too less information
- MarkPendingTaskFailed -> MarkPendingTaskObjectFailed (API too general compared to what it does)
- PendingTaskFailed  -> FailOrRetryPendingTask (API name doesn't make much sense compared to its behavior).

2. Change the name of arguments that expose too much impl detail
- immediately_mark_object_fail -> mark_task_object_failed (no need to specify "immediately")

3. Move msgpack serialization to a util function instead of embedding it to the task manager function.
2021-11-24 05:11:33 -08:00
Matti Picus
08655ab812
[Windows] only report metric reporting failure once (#20426) 2021-11-23 17:35:58 -08:00
Yi Cheng
40db73c2ff
[gcs] Fix internal kv as the bottleneck when worker starts (#20662)
## Why are these changes needed?

Before the commit (e54d3117a4) all traffics go to redis which is a dedicated service.

After moving to gcs, internal kv are competing with gcs traffic which make it a bottleneck sometimes.

Before this PR, `many_actor` tests are failing, the reason is that when a lot of actors starts, gcs is really heavy loads, and then worker starts timeout because it failed to get internal kv requests executed in short time.
When worker failed, it'll starts a new worker even the original one is pending, and in the end there will be a lot workers.

There are several things here need to fix and this is the quick fix for this issues which also convert it back to the status when we are using redis.

## Related issue number

Closes #20602
2021-11-23 15:13:07 -08:00
SangBin Cho
720bca8a1c
Revert "[core] add runtime env info to task spec debug string (#20631)" (#20668)
This reverts commit e9132ed7ca.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

Seems to break Windows build. 

```
(07:46:25) ERROR: BUILD.bazel:406:11: Compiling src/ray/common/task/task_spec.cc failed: (Exit 2): cl.exe failed: error executing command
```

<img width="487" alt="Screen Shot 2021-11-23 at 3 09 18 AM" src="https://user-images.githubusercontent.com/18510752/143013973-f157724c-4951-49a9-80c6-158d41aa4295.png">


## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-23 03:09:39 -08:00
Guyang Song
e9132ed7ca
[core] add runtime env info to task spec debug string (#20631) 2021-11-23 14:10:56 +08:00
Guyang Song
191be85057
[script][format] check copyright for .proto files (#20632)
## Why are these changes needed?
- I found that we also have a copyright header in .proto files. Add it to the copyright formatter.
2021-11-23 12:26:30 +08:00
Alex Wu
9388d28233
Revert "[runtime env] redefine runtime env to protobuf" (#20641)
Reverts #19511

Breaks windows compilation
2021-11-22 13:11:30 -08:00
Chen Shen
dc726ab6ba
[Core][Refactor CoreWorker 2/n] split CoreWorkerProcess from CoreWorker #19677
Move CoreWorkerProcess from CoreWorker into separate files. This PR depends on #19675. The follow up PR is #19678
2021-11-22 11:06:35 -08:00
Stephanie Wang
88136fa495
[core] Timeout object fetches that take too long (#20516)
Remerging #19789 with some fixes for Dask-on-Ray 1TB sort:

- Fixes a bug where the timer was not getting reset correctly
- Increased timeout to 10min just to be safe
- Changed the error to a unique exception ObjectFetchTimedOutError to improve debugging. 

This exception should usually indicate a system-level bug.
2021-11-20 16:43:56 -08:00
Guyang Song
ad56b9b432
[runtime env] redefine runtime env to protobuf (#19511) 2021-11-20 16:54:42 +08:00
Alex Wu
4cc225e9d4
Revert "Revert "[core] Nested task support via task depth + backpressure" (#20438)" (#20443)
This PR reverts the previous revert with the following minor changes.

Worker capping is off by default.
The cap feature flag is on the for the tests that explicitely require it.
2021-11-19 15:22:35 -08:00
Chen Shen
77a8723bba
[Core][actor out-of-order execution 6/n] plumbing work to make it work e2e (#20177)
This PR is the last PR that enables out of order execution. Previous PR: #20176

In this PR specifically, we added an execute_out_of_order option to .options call, which creates the actor with both out_of_order_submit_queue and out_of_order_scheduling queue.

this PR also added @simon-mo original case for testing.
2021-11-19 11:05:18 -08:00
Chen Shen
f0e8d66a85
[Core][Refactor CoreWorker 1/n] move CoreWorkerOptions to its own file #19675
Why are these changes needed?
This is a serial of PRs to make CoreWorkerProcess thread-safe and CoreWorker Code easy to read. [#19675 #19677 #19678 #19679]

Move CoreWorkerOptions out of core_worker.h; makes the code easier to read.

Next PR: #19677
2021-11-19 09:24:30 -08:00