This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to `num_workers`.
After this PR checking in, we needn't to consider the multiple core workers any longer.
The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process, removing the logic related to multiple workers from workerpool, gcs and etc.
**BREAK CHANGE**
This PR removes these APIs:
- Ray.wrapRunnable();
- Ray.wrapCallable();
- Ray.setAsyncContext();
- Ray.getAsyncContext();
And the following APIs are not allowed to invoke in a user-created thread in local mode:
- Ray.getRuntimeContext().getCurrentActorId();
- Ray.getRuntimeContext().getCurrentTaskId()
Note that this PR shouldn't be merged to 1.x.
These dependencies are widely used:
- com.google.common
- com.google.protobuf
- com.google.thirdparty
So that we need to shade them to avoid being conflict with jars introduced by user.
In this PR, we introduce a `bazel_jar_jar` rule for doing these and also shade them in maven pom files.
* Factor out --keep_going in Bazel --config=ci
* Remove Bazel --test_timeout=600 for Windows
* Use global --test_output for Bazel CI
Co-authored-by: Mehrdad <noreply@github.com>
* Fix SC2006: Use $(...) notation instead of legacy backticked `...`.
* Fix SC2016: Expressions don't expand in single quotes, use double quotes for that.
* Fix SC2046: Quote this to prevent word splitting.
* Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching.
* Fix SC2068: Double quote array expansions to avoid re-splitting elements.
* Fix SC2086: Double quote to prevent globbing and word splitting.
* Fix SC2102: Ranges can only match single chars (mentioned due to duplicates).
* Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"?
* Fix SC2145: Argument mixes string and array. Use * or separate argument.
* Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string).
Co-authored-by: Mehrdad <noreply@github.com>
* Delete LINT section of install-ray.sh since it appears unused
* Delete install.sh since it appears unused
* Delete run_test.sh since it appears unused
* Put environment variables on separate lines in .travis.yml
* Move --jobs 50 out of install-ray.sh
* Delete upgrade-syn.sh since it appears unused
* Move CI bazel flags to .bazelrc via --config
* Make installations quieter
* Get rid of verbose Maven messages
* Install Bazel system-wide for CI so that there's no need to update PATH
* Recognize Windows as valid platform
Co-authored-by: Mehrdad <noreply@github.com>
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance
* support boost external project, avoid using the system or build.sh boost
* keep compatible with build.sh, remove boost and arrow build from it.
* bugfix: parquet bison version control, plasma_java lib install problem
* bugfix: cmake, do not compile plasma java client if no need
* bugfix: component failures test timeout machenism has problem for plasma manager failed case
* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master
* revert some fix
* set arrow python executable, fix format error in component_failures_test.py
* make clean arrow python build directory
* update cmake code style, back to support cmake minimum version 3.4
* Support building Java and Python version at the same time.
* Remove duplicated definition.
* Refine the building process of local_scheduler
* Refine
* Add comment for languages
* Modify instruction and add python,jave building to CI.
* change according to comment
## What do these changes do?
This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.
Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).
Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.