Commit graph

2363 commits

Author SHA1 Message Date
SangBin Cho
9000f41aa6
[Nightly Test] Support memory profiling on Ray + implement memory monitor for nightly tests (#19539)
* random fixes

* Done

* done

* update the doc

* doc lint fix

* .

* .
2021-10-21 07:37:05 -07:00
Qing Wang
048e7f7d5d
[Core] Port concurrency groups with asyncio (#18567)
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.

### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
    def __init__(self):
        pass

    @ray.method(concurrency_group="io")
    async def f1(self):
        pass

    @ray.method(concurrency_group="io")
    def f2(self):
        pass

    @ray.method(concurrency_group="compute")
    def f3(self):
        pass

    @ray.method(concurrency_group="compute")
    def f4(self):
        pass

    def f5(self):
        pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group.  Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.

Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.

The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```

### Implementation
The straightforward implementation details are:
 - Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.


## Related issue number
#16047
2021-10-21 21:46:56 +08:00
Yi Cheng
cba8480616
[dashboard] Fix the wrong metrics for grpc query execution time in server side (#19500)
## Why are these changes needed?
It looks like the metrics set on server side are wrong. The time the query is constructed sometimes is not the time we get the query. This PR fixed this.

## Related issue number
2021-10-20 23:06:35 -07:00
Oscar Knagg
5a05e89267
[Core] Add TLS/SSL support to gRPC channels (#18631) 2021-10-20 22:39:11 -07:00
Eric Liang
699c5aeac6
Revert "[Dashboard] Disable unnecessary event messages. (#19490)" (#19574)
This reverts commit 7fb681a35d.
2021-10-20 20:17:57 -07:00
SangBin Cho
7fb681a35d
[Dashboard] Disable unnecessary event messages. (#19490)
* Disable unnecessary event messages.

* use warning

* Fix tests
2021-10-20 17:40:25 -07:00
Eric Liang
7daf28f348
Revert "[Test] Fix flaky test_gpu test (#19524)" (#19562)
This reverts commit 39e54cd276.
2021-10-20 12:21:19 -07:00
mwtian
aaff6901dd
[Pubsub] refactor pubsub to support different channel types (#19498)
* refactor pubsub to support different channel types

* fix

* use std::string for key id

* fix mock

* fix
2021-10-20 07:02:55 -07:00
Jiajun Yao
39e54cd276
[Test] Fix flaky test_gpu test (#19524) 2021-10-19 22:36:34 -07:00
Yi Cheng
7a9cedfc5c
[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531)
* dbg

* up

* check

* up

* up

* put grpc based one into nightly test

* up
2021-10-19 19:59:39 -07:00
architkulkarni
b8941338d3
[runtime env] Raise error when creating runtime env when ray[default] is not installed (#19491) 2021-10-19 09:16:04 -05:00
mwtian
9742abb749
[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes (#19423)
Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,

import ray
ray.init()

@ray.remote
def f():
    import ctypes;
    ctypes.string_at(0)

ray.get(f.remote())
Ray currently only prints the following stack trace:

(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @     0x7fff203d9552  (unknown)  _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @     0x7fff203d9552  (unknown)  _platform_strlen
With this change, Python stack trace will be printed in addition to the stack trace above:

(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693)   File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693)   File "stack.py", line 7 in f
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>
This should make debugging crashes in Python worker easier, for users and Ray devs.

Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
2021-10-18 09:05:08 -07:00
Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search (#19376) 2021-10-18 16:03:49 +08:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup (#19014)
* Added StartupToken to idenitfy a process at startup

* Applied linting formats

* Addressed reviews

* Fixing worker_pool_test

* Fixed worker_pool_test

* Applied linting formatting

* Added documentation for StartupToken

* Fixed linting

* Reordered initialisation of WorkerPool members

* Fixed Python docs

* Fixing bugs in cluster_mode_test

* Fixing Java tests

* Create and set shim process after verifying startup_token

* shim_process.GetId() -> worker_shim_pid

* Improvements in startup token and modifying java files

* update io_ray_runtime_RayNativeRuntime.h

* Fixed java tests by adding startup-token to conf

* Applied linting

* Increased arg count for startup_token

* Attempt to fix streaming tests

* Type correction

* applied linting

* Corrected index of startup token arg

* Modified, mock_worker.cc to accept startup tokens

* Applied linting

* Applied linting changes from CI

* Removed override from worker.h

* Applied linting from scripts/format.sh

* Addressed reviews and applied scripts/format.sh

* Applied linting script from ci/travis

* Removed unrequired methods from public scope

* Applied linting
2021-10-15 15:13:13 -07:00
SangBin Cho
9bfe43198f
Use cleaner code for the map (#19386) 2021-10-14 21:18:42 -07:00
Chen Shen
b8c201b7cb
[Core][CoreWorker] Make WorkerContext thread safe, fix race condition. #19343
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:

the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.

we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.

Related issue number
Closes #19270

Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :(
2021-10-13 16:12:36 -07:00
hazeone
c2f0035fd2
[Java]Support getGpuIds API (#19031)
Add java getGpuIds() API which is the same as get_gpu_ids in python. We can get deviceId if we've allocated a GPU to a worker.
2021-10-13 23:40:26 +08:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Jiajun Yao
d99b095eac
Set default max_pending_lease_requests_per_scheduling_category to 1 (#19328) 2021-10-12 15:59:32 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Akash Patel
b897b7b3be
add missing <memory> include (#19083) 2021-10-12 12:03:07 -07:00
Yi Cheng
bce6a498f3
Ensure job registered first before return. (#19307)
## Why are these changes needed?
Before this PR, there is a race condition where:
- job register starts
- driver start to launch actor
- gcs register actor ===> crash
- job register ends

Actor registration should be forced to be after driver registration. This PR enforces that.

## Related issue number
Closes #19172
2021-10-12 11:26:58 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group (#19277) 2021-10-12 11:04:14 -07:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] (#19220)
* done

* Fix all unit tests

* done

* .

* Fix the build issue

* fix the compilation bug
2021-10-12 07:41:51 -07:00
Akash Patel
8241a03d31
resolve maybe uninitialized error (#19103) 2021-10-12 04:06:48 -07:00
Jiajun Yao
a781b10a50
[Release] Centralize c++ ray version string definition (#19297)
* Centralize c++ ray version string definition

* Centralize c++ ray version string definition
2021-10-12 11:09:29 +09:00
Eric Liang
6cacc54774
[RFC] Fake multi-node mode for autoscaler (#18987) 2021-10-11 18:27:29 -07:00
Jiajun Yao
92516981ea
[core] Increase worker lease parallelism (#18647) 2021-10-11 15:34:32 -07:00
Guyang Song
ab55b808c5
[runtime env] move worker env to runtime env in Java (#19060) 2021-10-11 17:25:09 +08:00
SangBin Cho
3b865b463a
[Core] Fix GPU first scheduling that is not working with placement group (#19141)
* done

* Revert "done"

This reverts commit 56b18f0a7d14c5466d726c3ed1264f3e1506771e.

* ip

* Revert "Revert "done""

This reverts commit a34c90b0920893f4efbf171b8159f0d08a10dca0.

* Done

* Remove unnecessary log message

* skip test on windows

* Handle the code review.
2021-10-11 00:12:25 -07:00
liuyang-my
5353c5c2f1
Define Java Proxy and RayServeHandle (#18630) 2021-10-10 23:39:04 -07:00
gjoliver
635010d460
Update build rules and patches for darwin_arm64 platform. (#19037)
* Update build rules and patches for darwin_arm64 platform.

Changes include:

Update nelhage/rules_boost package from current version (08/5/2020) to 5/27/2021 version.
Remove rules_boost-undefine-boost_fallthrough.patch, since BOOST_FALLTHROUGH seems to be defined now.
Minor changes to rules_boost-windows-linkopts.patch to use default condition to add -lpthread flag for all platforms.
Add darwin_arm64 config to BUILD files for lib civetweb pulled in via prometheu dependency.

* upgrade boost to 1.74.0 from 1.71.0 to match the udpated build file for windows.

* Fix ray_cpp_pkg

* Use boost/bind/bind.hpp

boost/bind.hpp and global namespace placeholders are deprecated.

* lint

* Use absl::bind_front when possible. Otherwise, NOLINT

* lint

* lint

* lint

* lint

* more lint

* final lint

* trigger build
2021-10-09 18:48:35 -07:00
Guyang Song
bae543c956
[runtime env] support eager_install in runtime env (#17949) 2021-10-09 17:59:57 +08:00
mwtian
b066627539 [Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 (#19216) 2021-10-08 12:58:18 -07:00
chenk008
3780a73b45
[Core] Add worker resource info to runtime env (#18804) 2021-10-08 10:37:29 -07:00
Edward Oakes
9cf19b67cc
[serve] Remove log poll client from replicas (#19145)
In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.
2021-10-08 12:32:42 -05:00
Guyang Song
c4bc05bbab
set event_log_reporter_enabled True by default (#18112) 2021-10-07 23:09:36 -07:00
SangBin Cho
afaee05e1e
[Placement Group] Fix placement group removal leak (#19138) 2021-10-07 22:04:12 -07:00
Stephanie Wang
940f84cedb
[core] Remove unused plasma promotion path (#19122)
* remove unused

* lint

* lint

* lint
2021-10-07 10:55:50 -07:00
SangBin Cho
0ef0d9a77d
Revert "[core] Assign tasks to the first available worker (#18167)" (#19180)
This reverts commit 545db13800.
2021-10-07 10:38:37 -07:00
SangBin Cho
22f4ffed08
Disable cpu-only-nodes preferred scheduling that breaks placement groups. (#19129)
* Add a regression test for the short term

* done

* address code review

* lint
2021-10-07 05:34:04 -07:00
Chen Shen
1ed5f622c2
[Core] QuickExit CoreWorker when GetCoreWorker is called after shutdown 2021-10-06 15:07:57 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker (#18167)
* Convert worker pool to queue

* Start up to backlog size more workers

* fixes

* Prestart workers according to num available CPUs

* lint

* x

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* dedicated workers

* Fix tests

* x

* fix

* asan

* asan

* Workers can only exec tasks with same job ID

* size_t for runtime env hash, fix unit tests

* include job ID in runtime env hash, remove from worker registration msg

* x

* conflict

* debug

* Schedule and dispatch periodically, skip if no new tasks

* Update src/ray/common/task/task_spec.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Chen Shen
1efcf5c3d5
[Core][CoreWorker ThreadSafety 1/n] Ensure global_worker_ is protected by mutex #19073 2021-10-05 05:32:28 -07:00
Yi Cheng
2cff293810
fix (#19094) 2021-10-05 01:53:05 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
SangBin Cho
83cb992d5b
Revert pull retry (#19068)
* Revert "[Object manager] fix comments"

This reverts commit 56debfc063.

* Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)"

This reverts commit d12e35ce53.

* Fix a lint issue
2021-10-04 11:20:43 -07:00
SangBin Cho
7fcf1bf57e
[Dashboard] Refine the dashboard restart logic. (#18973)
* in progress

* Refine the dashboard agent retry logic

* refine

* done

* lint
2021-10-04 05:01:51 -07:00
mwtian
56debfc063
[Object manager] fix comments 2021-10-01 11:42:07 -07:00