Commit graph

2514 commits

Author SHA1 Message Date
Eric Liang
28d4cfb039
[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg (#19910) 2021-11-03 01:37:06 -07:00
Yi Cheng
99034f5af5
Revert "Revert "[core] Fix wrong local resource view in raylet (#1991… (#19996)
This reverts commit f1eedb15b6.

## Why are these changes needed?
Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.

## Related issue number
#19438
2021-11-03 00:11:40 -07:00
mwtian
c0eeb36209
[Core][Pubsub] Support publishing / subscribing to Actor / Job / Node info via GCS (#19903)
## Why are these changes needed?
This is the first step in migrating Redis pubsub to be GCS pubsub based. Changes include:
- Remove `SubscribeAll()` API for Actor pubsub since it is only used in tests. Supporting both `Subscribe()` and `SubscribeAll()` APIs would be too complex without much return.
- Update `Subscribe()` API to accept a done status callback.
- Implement `SubscribeAll()` / `Unsubscribe()`(from channel) API in Ray pubsub.
- Implement using Ray pubsub for Actor, Job, Node info and Node resource publishing / subscribing.

GCS changes are tested with GCS server test in GCS pubsub mode.

## Related issue number
2021-11-02 22:47:05 -07:00
Lixin Wei
a369fc97cf
[scheduler] Remove isFeasible (#19931) 2021-11-02 17:40:46 -07:00
mwtian
ef4b6e4648
[Core][GCS] remove gcs object manager (#19963) 2021-11-02 16:20:53 -07:00
Edward Oakes
14d0889fbc
[serve] Rename BackendInfo -> DeploymentInfo (#19947) 2021-11-02 17:09:15 -05:00
SangBin Cho
f1eedb15b6
Revert "[core] Fix wrong local resource view in raylet (#19911)" (#19992)
This reverts commit a907168184.

## Why are these changes needed?

This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. 

## Related issue number
2021-11-02 14:27:05 -07:00
Kai Yang
a33466e905
[Core] Fail inflight tasks on actor restarting (#19354)
## Why are these changes needed?

If an actor failover is triggered, but the RPC connection between the caller and the crashed actor instance is not disconnected automatically, subsequent tasks to the new actor instance may not be executed. The root cause is that the sequence numbers of tasks sent to the new actor instance is not starting from 0. Details can be found in #14727.

This PR fixes it by ensuring all inflight actor tasks fail immediately when actor failover is detected (via actor state notifications).

## Related issue number

closes #14727
2021-11-02 11:03:12 +08:00
Yi Cheng
a907168184
[core] Fix wrong local resource view in raylet (#19911)
## Why are these changes needed?
When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.

1. local node has used all some pg resource
2. gcs broadcast node resources
3. local node now have resources
4. scheduler picks local node
5. local node can't schedule the task
6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs

## Related issue number
#19438
2021-11-01 19:52:03 -07:00
Edward Oakes
ee57025be6
[serve] Rename BackendConfig -> DeploymentConfig (#19923) 2021-11-01 10:24:02 -07:00
Edward Oakes
e507b7ba6e
[serve] Rename BackendVersion -> DeploymentVersion (#19798) 2021-10-31 10:27:19 -05:00
Stephanie Wang
630a8cacb3
Revert "[core] Fail objects when pull/reconstruction hangs (#19789)" (#19904)
This reverts commit e6d60d7376.
2021-10-30 10:54:39 -07:00
chenk008
57363995f3
[runtime env] Move container related code to runtime env (#19067) 2021-10-29 16:31:11 -07:00
SangBin Cho
99b5932d06
Add a simple node failure integration test + clean up spammy logs upon node failures (#19695)
* .

* Done

* clean up

* lint

* fix a bug

* lint

* fix issue

* Remove no-op from StartRayLog

* Addressed code review.
2021-10-29 18:42:35 -04:00
SangBin Cho
f2b831f50f
[Placement Group] Fix the implicit value change from uint32_t -> uint64_t for pg scheduling retry (#19882)
* .

* done

* done
2021-10-29 12:16:53 -07:00
Philipp Moritz
0a5942d8b0
[Documentation] Fix quotes for windows installations (#19859)
* [Documentation] Fix quotes for windows installations

* update

* formatting
2021-10-29 10:54:38 -07:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
SangBin Cho
4586ced5e4
Limit the max number of resource usage print (#19828)
* done

* done

* addressed code review

* done
2021-10-29 07:24:14 -07:00
SangBin Cho
16dcff4091
[Core/RuntimeEnv] Fix runtime environment hanging issues. (#19823)
* done

* Add a right test

* Fix unit tests

* fix issues
2021-10-29 07:01:56 -07:00
Stephanie Wang
e6d60d7376
[core] Fail objects when pull/reconstruction hangs (#19789) 2021-10-28 23:34:51 -07:00
Yi Cheng
68ec652be7
[gcs] New option to increase gcs grpc client threads and fix issues in hybrid scheduling (#19663)
## Why are these changes needed?

- Since broadcasting is moving to grpc, introducing the option to increase the client side thread number
- For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled

With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s

## Related issue number
2021-10-28 22:40:18 -07:00
Eric Liang
1ba07439fc
Reduce log level of concurrent actor creation 2021-10-28 20:44:14 -07:00
SangBin Cho
96fc875a89
[Core] Improve scheduling observability and fix wrong resource deadlock report message. (#19746) 2021-10-28 11:42:21 -07:00
SangBin Cho
39486ef08c
[Core] Fix the resource leak if custom resources don't exist. #19837
Why are these changes needed?
The current logic can cause resource leak if AllocateTaskResourceInstances is requested with the custom resources that don't exist in the local node. The original assumption was the caller will free resources when it returns false, but it is an error prone API, and it actually turns out that we don't do this anywhere.

Related issue number
Closes #17044
2021-10-28 11:00:34 -07:00
Eric Liang
f60d312259
Try fixing reference counting issue with manual _owner assignment (#19734) 2021-10-28 02:26:35 -07:00
SangBin Cho
c414eb20d5
[Internal Observability] Improve the per task/actor resource usage visibility (#19782)
* prototype done

* done
2021-10-28 00:21:22 -07:00
Chen Shen
224ed0fa5c
[Core][CoreWorker] graceful shutdown if GetCoreWorker is null (#19598)
There are cases that the language frontend calls GetCoreWorker() after the worker has already been shutdown. Currently this results in a crash and causes confusions.

pid=3714) [2021-10-21 10:50:23,596 C 3714 33544237] core_worker.cc:194:  Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
(pid=3714) *** StackTrace Information ***
(pid=3714)     ray::GetCallTrace()
(pid=3714)     ray::SpdLogMessage::Flush()
(pid=3714)     ray::SpdLogMessage::~SpdLogMessage()
(pid=3714)     ray::RayLog::~RayLog()
(pid=3714)     ray::core::CoreWorkerProcess::EnsureInitialized()
(pid=3714)     ray::core::CoreWorkerProcess::GetCoreWorker()
(pid=3714)     __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
(pid=3714)     _PyMethodDef_RawFastCallKeywords
(pid=3714)     _PyMethodDescr_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     property_descr_get
(pid=3714)     _PyObject_GenericGetAttrWithDict
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     _PyEval_EvalCodeWithName
(pid=3714)     _PyFunction_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     t_bootstrap
(pid=3714)     pythread_wrapper
(pid=3714)     _pthread_start
(pid=3714)     thread_start
2021-10-27 23:11:53 -07:00
Alex Wu
46965e7672
[ARM] Use uint64_t instead of unsigned long (#13774)
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-10-27 21:08:25 -07:00
Yi Cheng
98961d1ee2
[core] Fix the wrong error message in gcs for worker exits (#19774) 2021-10-27 12:55:27 -07:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
SangBin Cho
418b4a94e6
[Core] Remove legacy scheduler code (#19780)
* Remove unused worker APIs

* Remove unused scheduling resources.

* lint
2021-10-27 06:57:08 -07:00
SangBin Cho
3e81506d90
[Threaded actor] Fix threaded actor race condition (#19751) 2021-10-26 15:17:53 -07:00
Yi Cheng
2ec9a70e24
[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling (#19664)
## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.

TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.


## Related issue number
#19438
2021-10-26 12:00:37 -07:00
SangBin Cho
00ea716ada
Revert "Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)" (#19736)
This reverts commit d453afbab8.
2021-10-26 08:25:09 -07:00
SangBin Cho
e914ea930d
[Core] Stop reporting tasks spec to GCS that are unnecessary #19699 (#19699)
This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis.

The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used.

�[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m

�[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m
defaultdict(int,
            {b'WORKE': 1080864,
             b'ACTOR': 1470931,
             b'TASK&': 1020646,
             b'TASK:': 870551,
             b'PROFI': 360000,
             b'PLACE': 10107,
             b'JOB:\x01': 8,
             b'JOB:\x04': 8,
             b'NODE:': 99,
             b'NODE_': 126,
             b'INTER': 44,
             b'JOB:\x03': 8,
             b'redis': 16,
             b'JOB:\x02': 8,
             b'JOB:\x05': 8})
2021-10-26 04:17:58 -07:00
SangBin Cho
ba61c436ea
Revert "Try enabling event stats by default (#19650)" (#19735)
This reverts commit 6081cf870e.
2021-10-26 14:33:40 +09:00
SangBin Cho
d453afbab8
Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)
This reverts commit e3ced0e59e.
2021-10-26 09:14:25 +09:00
SangBin Cho
544f774245
[Autoscaler/Core] Drain node API (#19350)
* Initial version done. Graceful shutdown  is possible with direct raylet RPCs

* .

* .

* ip

* Done.

* done tests might fail

* fix lint + cpp tests

* fix 2

* Fix issues.

* Addressed code review.

* Fix another cpp test failure

* completed

* Skip windows tests

* Update the comment

* complete

* addressed code review.
2021-10-25 14:57:50 -07:00
DK.Pino
e3ced0e59e
[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)
* fixed

* lint

* add cxx ut

* fix comment

* Revert "fix comment"

This reverts commit 32ea2558166a7674d7efe2e0c0a66ea7409c7d99.

* fix comment
2021-10-25 14:15:36 -07:00
Eric Liang
6081cf870e
Try enabling event stats by default (#19650) 2021-10-25 12:19:34 -07:00
Jiajun Yao
a7b219fea1
[Core] Don't unpickle and run functions exported by other jobs (#19576) 2021-10-22 17:13:20 -07:00
Gagandeep Singh
358aa57474
Fixed usage of `cv_.wait_for` (#19582)
* Fixed usage of cv.wait_for

* Changed method to calculate remaining time out

* Modify timeout_ms -> remaining_timeout_ms
2021-10-22 16:23:13 -07:00
Yi Cheng
48fb86a978
[core] Fix the spilling back failure in case of node missing (#19564)
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
2021-10-22 11:22:07 -07:00
mwtian
530f2d7c5e
[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600)
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
   -  Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.

Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.

Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.

This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.

## Related issue number
2021-10-22 10:52:36 -07:00
architkulkarni
030acf3857
[Serve] [Serve Autoscaler] Add upscale and downscale delay (#19290) 2021-10-22 10:33:28 -05:00
Stephanie Wang
499d6e9fc1
Turn on reconstruction tests in CI (#19497) 2021-10-21 22:34:44 -07:00
Yi Cheng
59b2f1f3f2
[gcs] Update select nodes to save cpu utilization (#19608)
## Why are these changes needed?
Recently we found that gcs is using a lot of CPU in scheduling actors and it's because the code is not well organized. This PR improved the SelectNodes function. From profiling, for many nodes actor test, 50% of CPU is wasted and could be saved here.

## Related issue number
2021-10-21 22:15:17 -07:00
SangBin Cho
cea7fda41a
Revert "Revert "[Dashboard] Disable unnecessary event messages. (#19490)" (#19574)" (#19577)
This reverts commit 699c5aeac6.
2021-10-21 15:36:22 -07:00
SangBin Cho
19e3280824
[Core] Fix shutdown Core worker crash when pg is removed. (#19549)
* fix core worker crash

* remove file

* done
2021-10-21 14:30:54 -07:00
Eric Liang
eb24b08ced
Relax the check on object size changing 2021-10-21 11:05:54 -07:00