Commit graph

1638 commits

Author SHA1 Message Date
Hao Chen
e96817d074 fix a syntax error of initializing unordered_map (#2871)
The previous way is incompatible with older version of gcc.
2018-09-14 12:07:08 -07:00
Philipp Moritz
2c9a4f6b41 Evaluate debug logging only in debug mode (#2869)
This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).
2018-09-14 11:40:44 -07:00
Robert Nishihara
f16d33593b Mark worker as blocked and trigger reconstruction in ray.wait. (#2864)
* Trigger reconstruction in ray.wait and mark worker as blocked.

* Add test.

* Linting.

* Don't run new test with legacy Ray.

* Only call HandleClientUnblocked if it actually blocked in ray.wait.

* Reduce time to ray.wait in the test.
2018-09-13 15:28:17 -07:00
Hanwei Jin
fbf214e408 update ray cmake build process (#2853)
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance

* support boost external project, avoid using the system or build.sh boost

* keep compatible with build.sh, remove boost and arrow build from it.

* bugfix: parquet bison version control, plasma_java lib install problem

* bugfix: cmake, do not compile plasma java client if no need

* bugfix: component failures test timeout machenism has problem for plasma manager failed case

* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master

* revert some fix

* set arrow python executable, fix format error in component_failures_test.py

* make clean arrow python build directory

* update cmake code style, back to support cmake minimum version 3.4
2018-09-12 11:19:33 -07:00
Hao Chen
8414e413a2 [java] refine and simplify java worker code structure (#2838) 2018-09-10 10:48:17 -07:00
Zhijun Fu
753ba76141 [Issue 2809][xray] Cleanup on driver detach (#2826)
This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass.

The following should happen when a driver exits (either gracefully or ungracefully).

#2797 should be enabled and pass.
Any actors created by the driver that are still running should be killed.
Any workers running tasks for the driver should be killed.
Any tasks for the driver in any node_manager queues should be removed.
Any future tasks received by a node manager for the driver should be ignored.
The driver death notification should only be received once.
2018-09-07 16:11:32 +08:00
Wang Qing
7e13e1fd49 [Java] Remove non-raylet code in Java. (#2828) 2018-09-06 14:54:13 +08:00
Yuhong Guo
dfb7c2be1e [Java] Add Plasma Free to Java code path (#2802) 2018-09-04 15:28:23 +08:00
Robert Nishihara
0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00
Yuhong Guo
2691b3a11a Add signal handlers to improve debuggability (#2757)
* Add signal handlers to improve debuggability.

* Fix Linux compiling

* Fix Lint

* Change SIGILL case that happens in both Linux and MaxOs

* Add signal handler to main functions.

* Change handler name.

* Address comment

* Address comment.

* Fix Linux building failure

* Introduce RAII mechanism to SignalHandlers.

* Add InitShutdownWrapper to handle all RAII requirements

* Change util_test to signal_test

* Make sure shutdown is not nullptr.

* Using google::InstallFailureSignalHandler() instead of our own signal handler

* Refine code addording to comment

* Fix valgrind test failure.

* remove Shutdown template

* consistency

* linting
2018-09-01 21:58:23 -07:00
Philipp Moritz
869ee8e25d Integrate plasma store list facility (#2752) 2018-09-01 16:53:51 -07:00
Alexey Tumanov
fdc9688226 [xray] push warning to driver for infeasible tasks (#2784)
This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.
2018-09-01 13:21:27 -07:00
Yucong He
5b45f0bdff [xray] Implementing Gcs sharding (#2409)
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
2018-08-31 15:54:30 -07:00
Ryan Sepassi
b6260003cb Some small changes (#2782)
* Add some imports that make it easier to build with Bazel
* Use "/tmp" paths for sockets in tests
* Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.
2018-08-30 22:42:49 -07:00
Wang Qing
514633456b [Java] Fix out-dated signatures of JNI methods (#2756)
1) Renamed the native JNI methods and some parameters of JNI methods. 
2) Fixed native JNI methods' signatures by `javah` tool.
3) Removed some useless native methods.
2018-08-30 17:59:29 +08:00
Robert Nishihara
ba7efafa67 Remove force_start argument from StartWorkerProcess. (#2762)
This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.
2018-08-30 13:43:47 +08:00
Robert Nishihara
132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std:🧵:hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
Wang Qing
b4cba9a49f [java] Fix the logic of generating TaskID (#2747)
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change,  I rewrote the logic of generating `TaskID` in java which is the same as the python's.

In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.

## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)
2018-08-27 13:11:33 -07:00
Hao Chen
f37c260bdb [multi-language part 3] support multiple languages in raylet backend (#2672)
This PR enables multi-language support in the raylet backend.
- `Worker` class now has a `language` label;
- `WorkerPool`:
	- It now maintains one set of states for each language.
	- `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id.
    -  `Size` and `StartWorkerProcess` functions now have an extra `language` parameter.
- `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.
2018-08-26 22:06:25 -07:00
Yuhong Guo
697bfb14db Hotfix for glog PR (#2734) 2018-08-24 16:30:51 -07:00
Philipp Moritz
b4c47a5861 Upgrade arrow to include more detailed flushing message (#2706) 2018-08-24 11:44:04 -07:00
Stephanie Wang
1b3de31ff1 [xray] Fix bug where driver task ID is assumed to be nil (#2725)
## What do these changes do?

#2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.
2018-08-23 14:44:47 -07:00
Yuhong Guo
eec1a3eb89 Support pluggable backend log lib with glog (#2695)
* [WIP] Support different backend log lib

* Refine code, unify level, address comment

* Address comment and change formatter

* Fix linux building failure.

* Fix lint

* Remove log4cplus.

* Add log init to raylet main and add test to travis.

* Address comment and refine.

* Update logging_test.cc
2018-08-23 09:43:38 -07:00
Stephanie Wang
8fd5757aaa [xray] Don't process any more messages from dead node managers (#2688) 2018-08-19 21:11:40 -07:00
Wang Qing
06a58016d8 [multi-language part 2] Change the command line arguments to start raylet (#2670) 2018-08-16 21:59:44 -07:00
Hao Chen
a719e089b0 [multi-language part 1] add a 'language' field to task specification (#2639) 2018-08-16 21:26:42 -07:00
Stephanie Wang
e3e0cfce87 [xray] Resubmit tasks that fail to be forwarded (#2645) 2018-08-16 00:12:56 -07:00
Philipp Moritz
6cb6dd30d1 silence shutdown callback (#2662) 2018-08-15 22:48:00 -07:00
tianyapiaozi
98fed67b45 fix offset by one issue in the local scheduler (#2652) 2018-08-15 10:10:30 -07:00
Yuhong Guo
eeb15771ba Add ray.internal.free (#2542) 2018-08-14 22:01:23 -07:00
Stephanie Wang
62649715ca [xray] Cache a task's object dependencies (#2623)
* Cache a Task's object dependencies

* Cache the parent task IDs for lineage cache entries

* Cache the parent task IDs in lineage cache entries

* revert

* Fix test

* remove unused line

* Fix test
2018-08-14 20:25:41 -07:00
Stephanie Wang
dede80f3df [xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642)
* Loosen checks in the lineage cache and log appropriate warnings in the node manager

* revert test
2018-08-14 15:25:32 -07:00
Yuhong Guo
4bd98eed45 Support building Java and Python version at the same time. (#2640)
* Support building Java and Python version at the same time.

* Remove duplicated definition.

* Refine the building process of local_scheduler

* Refine

* Add comment for languages

* Modify instruction and add python,jave building to CI.

* change according to comment
2018-08-14 11:33:51 -07:00
Stephanie Wang
806fdf2f05 [xray] Object manager retries Pull requests (#2630)
* Move all ObjectManager members to bottom of class def

* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients

* increase object manager Pull timeout

* Make the component failure test harder.

* note

* Notify SubscribeObjectLocations caller of empty list

* Address melih's comments

* Fix wait...

* Make component failure test easier for legacy ray

* lint
2018-08-13 19:15:55 -07:00
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Hao Chen
170e08cf02 fix a bug in killing unregistered workers (#2613) 2018-08-09 17:57:25 -07:00
Philipp Moritz
143a118fbf [xray] Fix valgrind crash when memory profiling raylet (#2583)
* use different random number generator to be compatible with older valgrind versions

* seed from time

* style

* fix

* remove more random devices

* also remove random_device from global scheduler

* rename mutex

* linting
2018-08-09 15:37:17 -07:00
Stephanie Wang
f093ed1fc6 [xray] Fix crash in case of spurious reconstruction (#2609)
* Exit if task already queued

* address comments
2018-08-09 14:46:46 -07:00
Stephanie Wang
2de9bfc7e3 [xray] Log warnings for asio handlers that take too long (#2601)
* Add fatal check for heartbeat drift

* Log warning messages for handlers that take too long

* Add debug labels to all ClientConnections
2018-08-09 14:39:23 -07:00
Stephanie Wang
d49b4bef0a [xray] Basic task reconstruction mechanism (#2526)
## What do these changes do?

This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.

Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).

Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
2018-08-09 07:24:37 -07:00
Melih Elibol
8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00
Alexey Tumanov
df7ee7ff1e raylet memory corruption fixes (#2591)
* raylet memory corruption fixes

* add util function to translate boost error to ray status

* tcp client connection now using ray status utility function

* lint
2018-08-08 19:50:43 -07:00
Stephanie Wang
6ab01a2cad [xray] Fix bug when counting a task's lineage size (#2600) 2018-08-08 00:00:17 -07:00
Ujval Misra
a0691ee49b [xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534)
* Add set to lineage cache entry to track nodes already forwarded to.

* Uncommitted lineage function naming, documentation.

* Simple test for uncommitted lineage with a marked task.

* Rebased, changed tests to use ClientID::nil.

* Bug fix, change MergeLineageHelper function type.

* Formatting.

* Checks and test changes based on PR comments.

* GetUncommittedLineage now always returns at least the requested task ID.

* Bug fix (return at least requested task ID)

* Formatting
2018-08-07 21:10:23 -07:00
Philipp Moritz
e7f76d7914 [xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586) 2018-08-07 13:45:51 -07:00
Philipp Moritz
25f0094ee4 Fix copying the plasma fbs directory from arrow (#2579) 2018-08-07 00:04:37 -07:00
Yuhong Guo
d35ce7fa63 Use real callback index in subscribe_callback_index_ (#2473) 2018-08-06 15:29:56 -07:00
Alexey Tumanov
85b8b2a395 mark all remaining placeable tasks pending with task dependency manager (#2528) 2018-08-06 13:08:11 -07:00
Melih Elibol
34d3a46f48 [xray] Revert dynamic chunk size optimization for ObjectManager. (#2557)
* Revert dynamic chunk size optimization.

* fix mac build issues.
2018-08-05 02:09:37 -07:00