Commit graph

2488 commits

Author SHA1 Message Date
Chen Shen
224ed0fa5c
[Core][CoreWorker] graceful shutdown if GetCoreWorker is null (#19598)
There are cases that the language frontend calls GetCoreWorker() after the worker has already been shutdown. Currently this results in a crash and causes confusions.

pid=3714) [2021-10-21 10:50:23,596 C 3714 33544237] core_worker.cc:194:  Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
(pid=3714) *** StackTrace Information ***
(pid=3714)     ray::GetCallTrace()
(pid=3714)     ray::SpdLogMessage::Flush()
(pid=3714)     ray::SpdLogMessage::~SpdLogMessage()
(pid=3714)     ray::RayLog::~RayLog()
(pid=3714)     ray::core::CoreWorkerProcess::EnsureInitialized()
(pid=3714)     ray::core::CoreWorkerProcess::GetCoreWorker()
(pid=3714)     __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
(pid=3714)     _PyMethodDef_RawFastCallKeywords
(pid=3714)     _PyMethodDescr_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     property_descr_get
(pid=3714)     _PyObject_GenericGetAttrWithDict
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     _PyEval_EvalCodeWithName
(pid=3714)     _PyFunction_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     t_bootstrap
(pid=3714)     pythread_wrapper
(pid=3714)     _pthread_start
(pid=3714)     thread_start
2021-10-27 23:11:53 -07:00
Alex Wu
46965e7672
[ARM] Use uint64_t instead of unsigned long (#13774)
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-10-27 21:08:25 -07:00
Yi Cheng
98961d1ee2
[core] Fix the wrong error message in gcs for worker exits (#19774) 2021-10-27 12:55:27 -07:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
SangBin Cho
418b4a94e6
[Core] Remove legacy scheduler code (#19780)
* Remove unused worker APIs

* Remove unused scheduling resources.

* lint
2021-10-27 06:57:08 -07:00
SangBin Cho
3e81506d90
[Threaded actor] Fix threaded actor race condition (#19751) 2021-10-26 15:17:53 -07:00
Yi Cheng
2ec9a70e24
[gcs] Fix the regression of enabling grpc based broadcasting in actor scheduling (#19664)
## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.

TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.


## Related issue number
#19438
2021-10-26 12:00:37 -07:00
SangBin Cho
00ea716ada
Revert "Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)" (#19736)
This reverts commit d453afbab8.
2021-10-26 08:25:09 -07:00
SangBin Cho
e914ea930d
[Core] Stop reporting tasks spec to GCS that are unnecessary #19699 (#19699)
This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis.

The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used.

�[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m

�[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m
defaultdict(int,
            {b'WORKE': 1080864,
             b'ACTOR': 1470931,
             b'TASK&': 1020646,
             b'TASK:': 870551,
             b'PROFI': 360000,
             b'PLACE': 10107,
             b'JOB:\x01': 8,
             b'JOB:\x04': 8,
             b'NODE:': 99,
             b'NODE_': 126,
             b'INTER': 44,
             b'JOB:\x03': 8,
             b'redis': 16,
             b'JOB:\x02': 8,
             b'JOB:\x05': 8})
2021-10-26 04:17:58 -07:00
SangBin Cho
ba61c436ea
Revert "Try enabling event stats by default (#19650)" (#19735)
This reverts commit 6081cf870e.
2021-10-26 14:33:40 +09:00
SangBin Cho
d453afbab8
Revert "[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)" (#19724)
This reverts commit e3ced0e59e.
2021-10-26 09:14:25 +09:00
SangBin Cho
544f774245
[Autoscaler/Core] Drain node API (#19350)
* Initial version done. Graceful shutdown  is possible with direct raylet RPCs

* .

* .

* ip

* Done.

* done tests might fail

* fix lint + cpp tests

* fix 2

* Fix issues.

* Addressed code review.

* Fix another cpp test failure

* completed

* Skip windows tests

* Update the comment

* complete

* addressed code review.
2021-10-25 14:57:50 -07:00
DK.Pino
e3ced0e59e
[Core] [Placement Group] Fix bundle reconstruction when raylet fo after gcs fo (#19452)
* fixed

* lint

* add cxx ut

* fix comment

* Revert "fix comment"

This reverts commit 32ea2558166a7674d7efe2e0c0a66ea7409c7d99.

* fix comment
2021-10-25 14:15:36 -07:00
Eric Liang
6081cf870e
Try enabling event stats by default (#19650) 2021-10-25 12:19:34 -07:00
Jiajun Yao
a7b219fea1
[Core] Don't unpickle and run functions exported by other jobs (#19576) 2021-10-22 17:13:20 -07:00
Gagandeep Singh
358aa57474
Fixed usage of `cv_.wait_for` (#19582)
* Fixed usage of cv.wait_for

* Changed method to calculate remaining time out

* Modify timeout_ms -> remaining_timeout_ms
2021-10-22 16:23:13 -07:00
Yi Cheng
48fb86a978
[core] Fix the spilling back failure in case of node missing (#19564)
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
2021-10-22 11:22:07 -07:00
mwtian
530f2d7c5e
[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600)
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
   -  Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.

Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.

Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.

This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.

## Related issue number
2021-10-22 10:52:36 -07:00
architkulkarni
030acf3857
[Serve] [Serve Autoscaler] Add upscale and downscale delay (#19290) 2021-10-22 10:33:28 -05:00
Stephanie Wang
499d6e9fc1
Turn on reconstruction tests in CI (#19497) 2021-10-21 22:34:44 -07:00
Yi Cheng
59b2f1f3f2
[gcs] Update select nodes to save cpu utilization (#19608)
## Why are these changes needed?
Recently we found that gcs is using a lot of CPU in scheduling actors and it's because the code is not well organized. This PR improved the SelectNodes function. From profiling, for many nodes actor test, 50% of CPU is wasted and could be saved here.

## Related issue number
2021-10-21 22:15:17 -07:00
SangBin Cho
cea7fda41a
Revert "Revert "[Dashboard] Disable unnecessary event messages. (#19490)" (#19574)" (#19577)
This reverts commit 699c5aeac6.
2021-10-21 15:36:22 -07:00
SangBin Cho
19e3280824
[Core] Fix shutdown Core worker crash when pg is removed. (#19549)
* fix core worker crash

* remove file

* done
2021-10-21 14:30:54 -07:00
Eric Liang
eb24b08ced
Relax the check on object size changing 2021-10-21 11:05:54 -07:00
SangBin Cho
7cfd170d01
Temporarily disable event framework for 1.8 #19587
Although event framework seems to work, it has an issue that it prints ERROR level severity events to the stderr, which eventually is streamed to the driver. Before we add this to the prod, we should fix this issue. To have enough time to fix it, we will turn off the feature temporarily.
2021-10-21 09:51:02 -07:00
SangBin Cho
9000f41aa6
[Nightly Test] Support memory profiling on Ray + implement memory monitor for nightly tests (#19539)
* random fixes

* Done

* done

* update the doc

* doc lint fix

* .

* .
2021-10-21 07:37:05 -07:00
Qing Wang
048e7f7d5d
[Core] Port concurrency groups with asyncio (#18567)
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.

### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
    def __init__(self):
        pass

    @ray.method(concurrency_group="io")
    async def f1(self):
        pass

    @ray.method(concurrency_group="io")
    def f2(self):
        pass

    @ray.method(concurrency_group="compute")
    def f3(self):
        pass

    @ray.method(concurrency_group="compute")
    def f4(self):
        pass

    def f5(self):
        pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group.  Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.

Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.

The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```

### Implementation
The straightforward implementation details are:
 - Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.


## Related issue number
#16047
2021-10-21 21:46:56 +08:00
Yi Cheng
cba8480616
[dashboard] Fix the wrong metrics for grpc query execution time in server side (#19500)
## Why are these changes needed?
It looks like the metrics set on server side are wrong. The time the query is constructed sometimes is not the time we get the query. This PR fixed this.

## Related issue number
2021-10-20 23:06:35 -07:00
Oscar Knagg
5a05e89267
[Core] Add TLS/SSL support to gRPC channels (#18631) 2021-10-20 22:39:11 -07:00
Eric Liang
699c5aeac6
Revert "[Dashboard] Disable unnecessary event messages. (#19490)" (#19574)
This reverts commit 7fb681a35d.
2021-10-20 20:17:57 -07:00
SangBin Cho
7fb681a35d
[Dashboard] Disable unnecessary event messages. (#19490)
* Disable unnecessary event messages.

* use warning

* Fix tests
2021-10-20 17:40:25 -07:00
Eric Liang
7daf28f348
Revert "[Test] Fix flaky test_gpu test (#19524)" (#19562)
This reverts commit 39e54cd276.
2021-10-20 12:21:19 -07:00
mwtian
aaff6901dd
[Pubsub] refactor pubsub to support different channel types (#19498)
* refactor pubsub to support different channel types

* fix

* use std::string for key id

* fix mock

* fix
2021-10-20 07:02:55 -07:00
Jiajun Yao
39e54cd276
[Test] Fix flaky test_gpu test (#19524) 2021-10-19 22:36:34 -07:00
Yi Cheng
7a9cedfc5c
[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531)
* dbg

* up

* check

* up

* up

* put grpc based one into nightly test

* up
2021-10-19 19:59:39 -07:00
architkulkarni
b8941338d3
[runtime env] Raise error when creating runtime env when ray[default] is not installed (#19491) 2021-10-19 09:16:04 -05:00
mwtian
9742abb749
[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes (#19423)
Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,

import ray
ray.init()

@ray.remote
def f():
    import ctypes;
    ctypes.string_at(0)

ray.get(f.remote())
Ray currently only prints the following stack trace:

(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @     0x7fff203d9552  (unknown)  _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @     0x7fff203d9552  (unknown)  _platform_strlen
With this change, Python stack trace will be printed in addition to the stack trace above:

(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693)   File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693)   File "stack.py", line 7 in f
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>
This should make debugging crashes in Python worker easier, for users and Ray devs.

Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
2021-10-18 09:05:08 -07:00
Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search (#19376) 2021-10-18 16:03:49 +08:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup (#19014)
* Added StartupToken to idenitfy a process at startup

* Applied linting formats

* Addressed reviews

* Fixing worker_pool_test

* Fixed worker_pool_test

* Applied linting formatting

* Added documentation for StartupToken

* Fixed linting

* Reordered initialisation of WorkerPool members

* Fixed Python docs

* Fixing bugs in cluster_mode_test

* Fixing Java tests

* Create and set shim process after verifying startup_token

* shim_process.GetId() -> worker_shim_pid

* Improvements in startup token and modifying java files

* update io_ray_runtime_RayNativeRuntime.h

* Fixed java tests by adding startup-token to conf

* Applied linting

* Increased arg count for startup_token

* Attempt to fix streaming tests

* Type correction

* applied linting

* Corrected index of startup token arg

* Modified, mock_worker.cc to accept startup tokens

* Applied linting

* Applied linting changes from CI

* Removed override from worker.h

* Applied linting from scripts/format.sh

* Addressed reviews and applied scripts/format.sh

* Applied linting script from ci/travis

* Removed unrequired methods from public scope

* Applied linting
2021-10-15 15:13:13 -07:00
SangBin Cho
9bfe43198f
Use cleaner code for the map (#19386) 2021-10-14 21:18:42 -07:00
Chen Shen
b8c201b7cb
[Core][CoreWorker] Make WorkerContext thread safe, fix race condition. #19343
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:

the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.

we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.

Related issue number
Closes #19270

Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :(
2021-10-13 16:12:36 -07:00
hazeone
c2f0035fd2
[Java]Support getGpuIds API (#19031)
Add java getGpuIds() API which is the same as get_gpu_ids in python. We can get deviceId if we've allocated a GPU to a worker.
2021-10-13 23:40:26 +08:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Jiajun Yao
d99b095eac
Set default max_pending_lease_requests_per_scheduling_category to 1 (#19328) 2021-10-12 15:59:32 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Akash Patel
b897b7b3be
add missing <memory> include (#19083) 2021-10-12 12:03:07 -07:00
Yi Cheng
bce6a498f3
Ensure job registered first before return. (#19307)
## Why are these changes needed?
Before this PR, there is a race condition where:
- job register starts
- driver start to launch actor
- gcs register actor ===> crash
- job register ends

Actor registration should be forced to be after driver registration. This PR enforces that.

## Related issue number
Closes #19172
2021-10-12 11:26:58 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group (#19277) 2021-10-12 11:04:14 -07:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] (#19220)
* done

* Fix all unit tests

* done

* .

* Fix the build issue

* fix the compilation bug
2021-10-12 07:41:51 -07:00