Commit graph

337 commits

Author SHA1 Message Date
Chen Shen
dd527da58b
[GCS][Storage Unification 4/n] Monitor for Gcs storage interface. (#23577)
Add basic monitoring for GCS Storage Interface, including the qps and latency for each operation.
2022-05-09 16:50:39 -07:00
Chong-Li
f3767131cb
[Enable gcs actor scheduler 1/n] Raylet and GCS schedulers share cluster_task_manager (#23829) 2022-05-02 21:45:23 +08:00
mwtian
d5d2ef4249
[Core] Add a utility to check GCS / Ray cluster health (#23382)
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
2022-04-18 09:58:45 -07:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Jiajun Yao
95714cc281
Node affinity scheduling strategy (#23381)
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:

1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.

After this PR, the node-ip custom resource can be deprecated.
2022-04-12 21:31:26 -07:00
Yi Cheng
87dc57df26
[3][cleanup][gcs] Remove redis based pubsub. (#23520)
This PR removes redis based pubsub.
2022-03-31 00:13:55 -07:00
ZhuSenlin
e79a63db64
[GCS] [3 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler #23371
As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler.

There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs.
This is [3/n]:

Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy
Delete gcs_resource_scheduler
Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test
BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment)

To be clear:

scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes.
Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy.
Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)
2022-03-30 17:39:46 -07:00
jon-chuang
54ddcedd1a
[Core] Chore: move test to right dir #23096 2022-03-30 09:29:38 -07:00
Yi Cheng
781c46ae44
[scheduling][5] Refactor resource syncer. (#23270)
## Why are these changes needed?

This PR refactor the resource syncer to decouple it from GCS and raylet. GCS and raylet will use the same module to sync data. The integration will happen in the next PR.

There are several new introduced components:

* RaySyncer: the place where remote and local information sits. It's a coordinator layer.
* NodeState: keeps track of the local status, similar to NodeSyncConnection.
* NodeSyncConnection: keeps track of the sending and receiving information and make sure not sending the information the remote node knows.

The core protocol is that each node will send {what it has} - {what the target has} to the target.
For example, think about node A <-> B. A will send all A has exclude what B has to B.

Whenever when there is new information (from NodeState or NodeSyncConnection), it'll be passed to RaySyncer broadcast message to broadcast. 

NodeSyncConnection is for the communication layer. It has two implementations Client and Server:

* Server => Client: client will send a long-polling request and server will response every 100ms if there is data to be sent.
* Client => Server: client will check every 100ms to see whether there is new data to be sent. If there is, just use RPC call to send the data.

Here is one example:

```mermaid
flowchart LR;
    A-->B;
    B-->C;
    B-->D;
```

It means A initialize the connection to B and B initialize the connections to C and D

Now C generate a message M:

1. [C] RaySyncer check whether there is new message generated in C and get M
2. [C] RaySyncer will push M to NodeSyncConnection in local component (B)
3. [C] ServerSyncConnection will wait until B send a long polling and send the data to B
4. [B] B received the message from C and push it to local sync connection (C, A, D)
5. [B] ClientSyncConnection of C will not push it to its local queue since it's received by this channel.
6. [B] ClientSyncConnection of D will send this message to D
7. [B] ServerSyncConnection of A will be used to send this message to A (long-polling here)
8. [B] B will update NodeState (local component) with this message M
9. [D] D's pipelines is similar to 5) (with ServerSyncConnection) and 8)
10. [A] A's pipeline is similar to 5) and 8)
2022-03-29 23:52:39 -07:00
Hao Chen
b7d32df8b0
Refactor scheduler data structures (#22854)
This is the first PR to refactor scheduler data structures (See #22850).

Major changes:
- Hid the implementation details in the `ResourceRequest` and `TaskResourceInstnaces` classes, which expose public methods such as algebra operators and comparison operators. 
- Hid the differences between "predefined" and "custom" resources inside these 2 classes. Call sites can simply use the resource ID to access the resource, no matter it is predefined or custom.
- The predefined_resources vector now always has the full length. So no more "resize"s are needed. 
- Removed the `ResourceCapacity` class. Now "total" and "available" resources are stored in separate fields in "NodeResources". 
- Moved helper functions for FixedPoint vectors from "cluster_resource_data.h" to "fixed_point.h"
- "ResourceID" now has static methods to get the resource ids of predefined resources, e.g. "ResourceID::CPU()". 
- Encapsulated unit-instance resource logic to "ResourceID"

Other planned changes that are not included in this PR:
- Rename ResourceRequest to ResourceSet, and move it to its own file.
- Remove the predefined vectors and always use maps.

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-29 19:44:59 +08:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
ZhuSenlin
125ef0e5a6
[GCS] integrate cluster_resource_manager into gcs_resource_manager and gcs_resource_scheduler (#23105)
* refactor gcs_resource_manager

* fix lint error

* fix lint error

* fix compile error

* fix test

* fix test

* fix test

* add unit test

* refactor UpdateNodeNormalTaskResources

* fix comment

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-03-16 16:27:14 -07:00
Chen Shen
5a2ebc281c
[Scheduler] separate scheduler code to its own build target (#23124)
* wip

* comments

* fix build

* fix-test

* fix format
2022-03-14 23:23:58 -07:00
Tao Wang
10c03cb126
Migrating to flat hash map [GCS&util&common] (#22932)
Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules.
The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated.

The follow-up PRs would be migrating in core worker, placement group and others.
2022-03-11 18:35:06 +09:00
Chen Shen
bc3f7a7684
[scheduling policy 3/n][rfc] Refactor SchedulingPolicy into interface and implementations (#22907)
* scheduling policy

* update

Co-authored-by: Gagandeep Singh <gdp.1807@gmail.com>
2022-03-08 18:47:56 -08:00
Chen Shen
3e3db8e9cd
[scheduler] hide StringIDMap under BaseSchedulingID (#22722)
* add

* address comments
2022-03-01 22:50:53 -08:00
Chen Shen
dfcb0f5de5
[clean up ClusterResourceScheduler 1/n] move IsSchedulable logic into ClusterResourceManager #22711 2022-02-28 20:37:56 -08:00
Lingxuan Zuo
94caac8722
Remove exporting symbols (#22623)
To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects.

Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported.

More details in https://github.com/ray-project/ray/pull/22526

Mobius has applied this change in https://github.com/ray-project/mobius/pull/28.

Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>
2022-02-28 09:41:10 +08:00
Stephanie Wang
634ca9afdb
[core] Cleanup handling for nondeterministic object size during transfer (#22639)
Currently object transfers assume that the object size is fixed. This is a bad assumption during failures, especially with lineage reconstruction enabled and tasks with nondeterministic outputs.

This PR cleans up the handling and hopefully guards against two cases where the object size may change during a transfer:
1. The object manager's size information does not match the object in the local plasma store (due to async notifications). --> the object manager overwrites its own information if it finds that the physical object has a different size.
2. The receiver's created buffer size does not match the sender's object size. --> the receiver destroys the previous buffer and creates a new buffer with the correct size. This might cause some transient errors but eventually object transfer should succeed.

Unfortunately I couldn't trigger this from Python because it depends on some pretty specific timing conditions. However, I did add some unit tests for case 2 (this is the majority of the PR).
2022-02-25 09:39:14 -08:00
mwtian
839bc5019f
Fix building Windows wheels (#22388) (#22391)
This fixes Windows wheel build issue on master and releases/1.11.0 branch. If the issue happens more often we can try to run iwyu.
2022-02-15 15:24:10 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
Yi Cheng
e4ba51f25b
[core] Add GC for function table (#21509)
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.

Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
2022-01-13 18:06:05 -08:00
SangBin Cho
f5fdbeb594
Refactor event tracker out of asio class (#21215)
This refactors the event tracker to be decoupled from the asio class.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-12 22:43:31 -08:00
Jiajun Yao
aec37d4b60
Add container utils (#21444)
- Add debug_string helper functions for common containers.
- Add map_find_or_die helper function
2022-01-10 15:29:29 -08:00
Jiajun Yao
48a5208645
Refactor ObjectManager wait logic to WaitManager (#21369)
- This PR moves the `ObjectManager::Wait` related logic to a separate WaitManager class.
- Fix the wait hang issue by not  relying on the async object location notification, but checking if wait is complete when the local object is added, at that time the object is guaranteed to be local.
2022-01-06 10:42:31 -05:00
Qing Wang
3c68370fcf
[Core] Cache job_configs instead of ray_namespace. (#21279)
We need to get not only ray_namespace config of a job. In this PR, we cache the job_configs instead of ray_namespaces, so that we can use it for other PR(For example, this PR #21249 needs the num_java_worker_pre_process item).

Also, before this PR, ray_namespaces_ cache will not be cleared, and we clear the cache in this PR.
2022-01-05 17:48:06 -08:00
mwtian
2410ec5ef0
[Core][Dashboard Pubsub 1/n] Allow a channel to have subscribers to a key and to the whole channel concurrently (#20954)
For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible.

Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together.

Also, add the basic support for dashboard reporter pubsub.
2021-12-09 15:00:38 -08:00
SangBin Cho
5298a9046c
[Internal Observability] [Part 1] Centralize existing metrics to metric_defs.h (#20728)
This PR centralizes all existing metrics to `metric_defs.h`. 

Previously, each file relies on implicit import of metric_def.h within the stats module. After this PR we only precisely import `metric_defs.h` for each file.
2021-12-08 14:06:05 -08:00
Yi Cheng
442b1025cd
[1/gcs-mem-kv] Memory mode for internal kv (#20881)
This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode.
There are two ways to address this:
- Update store client and use store client in internal kv
- Add memory table into internal kv directly.

The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change.

So the easier way to do this is 2) and it's what this PR doing.

Next: use the flag for store client
2021-12-08 10:40:35 -08:00
Lixin Wei
96dc10a95a
[Core] Fix Crash in ObjectDirectory (#20540)
Here we met a crash in line 446's RAY_CHECK

d26c9e67e8/src/ray/object_manager/ownership_based_object_directory.cc (L441-L450)


And we found out that it's because we didn't set the node_id for dead nodes. If there are dead nodes and we are trying to LookupRemoteConnectionInfo in it. This crash will happen.

This PR fixes this crash.
2021-12-07 23:03:49 -08:00
Yi Cheng
ea1d081aac
[core] Simple chaos testing for asio (#19970)
Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.

To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`

This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
2021-12-07 14:47:07 -08:00
Qing Wang
116bda8f05
[Core] Remove duplicated implementations of concurrency group executor. (#20467)
## Why are these changes needed?
ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them.

Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.
2021-11-27 12:57:40 +08:00
Gagandeep Singh
f22a24aca4
Replace time based seed generation with absl::BitGen and absl::Uniform (#20696) 2021-11-24 14:36:35 -08:00
Guyang Song
53630ee03b
Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692)
- Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641
- Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.
2021-11-24 09:01:01 -08:00
Alex Wu
9388d28233
Revert "[runtime env] redefine runtime env to protobuf" (#20641)
Reverts #19511

Breaks windows compilation
2021-11-22 13:11:30 -08:00
Guyang Song
ad56b9b432
[runtime env] redefine runtime env to protobuf (#19511) 2021-11-20 16:54:42 +08:00
Chen Shen
f02b53a810
[Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue (#20150)
Why are these changes needed?
This is the third PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20149 Next PR #20160
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.

In this PR, we implemented the out-of-order of queue that supports out of order execution. Conceptually it's very simple: it sends the requests as soon as the dependency is resolved.
2021-11-16 10:48:49 -08:00
Tao Wang
507bd9186b
[Core]Make convertion between ray/grpc status more specific (#20047)
* [Core]Make convertion between ray/grpc status more specific

* per comments

* lint

* per comments

* use ABORT instead of UNKNOWN, add some tests

* lint

* lint
2021-11-10 00:48:05 -08:00
Lingxuan Zuo
97259e33b2
Relink grpc/absl for streaming.so (#20136)
To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so.

Side-effect:
Static variables might be uninitialized if core worker lib and streaming lib both use them.
2021-11-09 14:13:53 +08:00
mwtian
ef4b6e4648
[Core][GCS] remove gcs object manager (#19963) 2021-11-02 16:20:53 -07:00
SangBin Cho
99b5932d06
Add a simple node failure integration test + clean up spammy logs upon node failures (#19695)
* .

* Done

* clean up

* lint

* fix a bug

* lint

* fix issue

* Remove no-op from StartRayLog

* Addressed code review.
2021-10-29 18:42:35 -04:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
Yi Cheng
48fb86a978
[core] Fix the spilling back failure in case of node missing (#19564)
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
2021-10-22 11:22:07 -07:00
mwtian
530f2d7c5e
[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600)
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
   -  Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.

Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.

Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.

This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.

## Related issue number
2021-10-22 10:52:36 -07:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Matti Picus
9ca34c7192
add dependencies to BUILD.bazel and update windows bazel to 4.2.1 (#19132)
* add dependencies to BUILD.bazel and update windows bazel to 4.2.1

* fixes from review
2021-10-11 10:25:19 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
Yi Cheng
16cf719aff
[core] hot fix of build failure (#18963) 2021-09-28 20:29:28 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" (#18871)
* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057644.

* up
2021-09-28 05:53:52 -07:00