## What do these changes do?
This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.
Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.
#3239 depends on this.
TODO:
- [x] Add documentation to method arguments before merging.
- [x] Add test to verify this works?
## Related issue number
* Increase timeout to 10s
* Skip eviction reconstruction tests
* Add stress test for many actors to one
* Fix test by shortening it.
* lower number of processes in stress test
* Skip slow test
* Policy that flushes the lineage stash immediately
* Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test
* test
* Fix bug where waiting task was getting subscribed
* Cleanup
* Update src/ray/raylet/lineage_cache.cc
Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>
* Update src/ray/raylet/lineage_cache.cc
Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu>
* cleanup
* cleanup
* Add another test for task with many parents
* fix, unsubscribe to new waiting tasks
* Unsubscribe as soon as the commit notification is handled
We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.).
This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.
This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux.
In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
## What do these changes do?
```c++
// Try to execute the worker command.
int rv = execvp(worker_command_args[0],
const_cast<char *const *>(worker_command_args.data()));
// The worker failed to start. This is a fatal error.
RAY_LOG(FATAL) << "Failed to start worker with return value " << rv;
```
When starting a process fails, the return value `rv` always be set to -1.
It is useless for us.
The log message should show some meaningful infos.
For example, If we did't install java. The message showed for us should be:
```shell
Failed to start worker: No such file or directory.
```
This could help us to locate issue quickly.
## Related issue number
N/A
This PR adds a `function_desc` field into task spec. a function descriptor is a list of strings that can uniquely describe a function.
- For a Python function, it should be: [module_name, class_name, function_name]
- For a Java function, it should be: [class_name, method_name, type_descriptor]
There're a couple of purposes to add this field:
In this PR:
- Java worker needs to know function's class name to load it. Previously, since task spec didn't have such a field to hold this info, we did a hack by appending the class name to the argument list. With this change, we fixed that hack and significantly simplified function management in Java.
Will be done in subsequent PRs:
- Support cross-language invocation (#2576): currently Python worker manages functions by saving them in GCS and pass function id in task spec. However, if we want to call a Python function from Java, we cannot save it in GCS and get the function id. But instead, we can pass the function descriptor (module name, class name, function name) in task spec and use it to load the function.
- Support deployment: one major problem of Python worker's current function management mechanism is #2327. In prod env, we should have a mechanism to deploy code and dependencies to the cluster. And when code is already deployed, we don't need to save functions to GCS any more and can use `function_desc` to manage functions.
* enable using thirdparty env variable to find installed dependency, to speed up the build process
* fix target dependency in cmake. :-) too chaos in each CMakeLists
* check env variable defined directory exists
When a new raylet starts, `ClientAdded` will be called with the disconnected client data. However, since the client was closed, the connection will fail.
Fixes issue where object manager sometimes crashes within the `Wait` method: The issue stems from inconsistent behavior of the boost deadline timer's `cancel` method, which is invoked within `WaitComplete` to enforce exactly one `WaitComplete` invocation for each `Wait` request. The `cancel` method sometimes fails to actually prevent the timer's invocation of the provided handler with non-zero error code.
This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).
* Trigger reconstruction in ray.wait and mark worker as blocked.
* Add test.
* Linting.
* Don't run new test with legacy Ray.
* Only call HandleClientUnblocked if it actually blocked in ray.wait.
* Reduce time to ray.wait in the test.
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance
* support boost external project, avoid using the system or build.sh boost
* keep compatible with build.sh, remove boost and arrow build from it.
* bugfix: parquet bison version control, plasma_java lib install problem
* bugfix: cmake, do not compile plasma java client if no need
* bugfix: component failures test timeout machenism has problem for plasma manager failed case
* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master
* revert some fix
* set arrow python executable, fix format error in component_failures_test.py
* make clean arrow python build directory
* update cmake code style, back to support cmake minimum version 3.4
This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass.
The following should happen when a driver exits (either gracefully or ungracefully).
#2797 should be enabled and pass.
Any actors created by the driver that are still running should be killed.
Any workers running tasks for the driver should be killed.
Any tasks for the driver in any node_manager queues should be removed.
Any future tasks received by a node manager for the driver should be ignored.
The driver death notification should only be received once.
* Add signal handlers to improve debuggability.
* Fix Linux compiling
* Fix Lint
* Change SIGILL case that happens in both Linux and MaxOs
* Add signal handler to main functions.
* Change handler name.
* Address comment
* Address comment.
* Fix Linux building failure
* Introduce RAII mechanism to SignalHandlers.
* Add InitShutdownWrapper to handle all RAII requirements
* Change util_test to signal_test
* Make sure shutdown is not nullptr.
* Using google::InstallFailureSignalHandler() instead of our own signal handler
* Refine code addording to comment
* Fix valgrind test failure.
* remove Shutdown template
* consistency
* linting
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
* Add some imports that make it easier to build with Bazel
* Use "/tmp" paths for sockets in tests
* Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.
1) Renamed the native JNI methods and some parameters of JNI methods.
2) Fixed native JNI methods' signatures by `javah` tool.
3) Removed some useless native methods.
This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.
* Limit number of concurrent workers started by hardware concurrency.
* Check if std:🧵:hardware_concurrency() returns 0.
* Pass in max concurrency from Python.
* Fix Java call to startRaylet.
* Fix typo
* Remove unnecessary cast.
* Fix linting.
* Cleanups on Java side.
* Comment back in actor test.
* Require maximum_startup_concurrency to be at least 1.
* Fix linting and test.
* Improve documentation.
* Fix typo.
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change, I rewrote the logic of generating `TaskID` in java which is the same as the python's.
In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.
## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)