Commit graph

551 commits

Author SHA1 Message Date
Eric Liang
9d23fa03c9 [xray] All messages on main asio event loop should be written asynchronously (#3023)
* copy over ref code

* wip async writes

* compiles

* fix error handling

* add test

* amend

* fix test

* clang fmgt

* clang format

* wip

* yapf

* rename format script

* test error

* clangfmt

* add test to list

* warn

* ref test

* fix test

* comment

* add capture

* Update client_connection.cc

* wip

* fix compile
2018-10-18 21:56:22 -07:00
Yuhong Guo
653c5b114a [c++] Refine Log Code (#2816)
* Support setting logging level from env variable

* Remove Env Variable related code

* lint
2018-10-18 10:51:36 -07:00
Peter Schafhalter
a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00
Robert Nishihara
faa31ae018 Introduce concept of resources required for placing a task. (#2837)
* Introduce concept of resources required for placement.
* Add placement resources to task spec
* Update java worker
* Update taskinfo.java
2018-10-04 10:35:39 -07:00
Richard Liaw
01bb073569 Suppress errors when worker or driver intentionally disconnects. (#2935) 2018-10-04 00:06:34 -07:00
Robert Nishihara
3ce8eb2d4c Test dying_worker_get and dying_worker_wait for xray. (#2997)
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
2018-10-02 00:08:47 -07:00
Wang Qing
a879302355 Improve log message when failing to fork worker process (#2990)
## What do these changes do?
```c++
  // Try to execute the worker command.
  int rv = execvp(worker_command_args[0],
                  const_cast<char *const *>(worker_command_args.data()));
  // The worker failed to start. This is a fatal error.
  RAY_LOG(FATAL) << "Failed to start worker with return value " << rv;
```
When starting a process fails, the return value `rv` always be set to -1.
It is useless for us.
The log message should show some meaningful infos.

For example, If we did't install java. The message showed for us should be:
```shell
 Failed to start worker: No such file or directory.
```
This could help us to locate issue quickly.

## Related issue number
N/A
2018-09-29 22:10:57 +08:00
Hao Chen
971df5ea8a [java] put function meta in task spec and load functions with function meta (#2881)
This PR adds a `function_desc` field into task spec. a function descriptor is a list of strings that can uniquely describe a function.
- For a Python function, it should be: [module_name, class_name, function_name]
- For a Java function, it should be: [class_name, method_name, type_descriptor]

There're a couple of purposes to add this field:

In this PR:
- Java worker needs to know function's class name to load it. Previously, since task spec didn't have such a field to hold this info, we did a hack by appending the class name to the argument list. With this change, we fixed that hack and significantly simplified function management in Java.

Will be done in subsequent PRs:
- Support cross-language invocation (#2576): currently Python worker manages functions by saving them in GCS and pass function id in task spec. However, if we want to call a Python function from Java, we cannot save it in GCS and get the function id. But instead, we can pass the function descriptor (module name, class name, function name) in task spec and use it to load the function.
- Support deployment: one major problem of Python worker's current function management mechanism is #2327. In prod env, we should have a mechanism to deploy code and dependencies to the cluster. And when code is already deployed, we don't need to save functions to GCS any more and can use `function_desc` to manage functions.
2018-09-25 23:05:05 -07:00
Hanwei Jin
9f9e49e4a1 [cmake] enable using thirdparty env variable to find installed dependency (#2912)
* enable using thirdparty env variable to find installed dependency, to speed up the build process

* fix target dependency in cmake. :-) too chaos in each CMakeLists

* check env variable defined directory exists
2018-09-23 07:52:33 -07:00
Yuhong Guo
b29839a0a3 Fix node manager failure when ClientTable has a disconnected entry. (#2905)
When a new raylet starts, `ClientAdded` will be called with the disconnected client data. However, since the client was closed, the connection will fail.
2018-09-21 22:45:06 -07:00
Hao Chen
715ec1bca5 Modularize NodeManager::ProcessClientMessage (#2895)
Split NodeManager::ProcessClientMessage into a couple of smaller functions, each of which handles one type of message.
2018-09-18 14:18:34 -07:00
Yuhong Guo
a8248e8628 Fix ObjectManager Crash (#2833)
Fixes issue where object manager sometimes crashes within the `Wait` method: The issue stems from inconsistent behavior of the boost deadline timer's `cancel` method, which is invoked within `WaitComplete` to enforce exactly one `WaitComplete` invocation for each `Wait` request. The `cancel` method sometimes fails to actually prevent the timer's invocation of the provided handler with non-zero error code.
2018-09-16 02:14:13 -04:00
Philipp Moritz
47d2f82c6c Fix common cmake dependencies (#2876) 2018-09-15 22:11:12 -07:00
Hao Chen
e96817d074 fix a syntax error of initializing unordered_map (#2871)
The previous way is incompatible with older version of gcc.
2018-09-14 12:07:08 -07:00
Philipp Moritz
2c9a4f6b41 Evaluate debug logging only in debug mode (#2869)
This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).
2018-09-14 11:40:44 -07:00
Robert Nishihara
f16d33593b Mark worker as blocked and trigger reconstruction in ray.wait. (#2864)
* Trigger reconstruction in ray.wait and mark worker as blocked.

* Add test.

* Linting.

* Don't run new test with legacy Ray.

* Only call HandleClientUnblocked if it actually blocked in ray.wait.

* Reduce time to ray.wait in the test.
2018-09-13 15:28:17 -07:00
Hanwei Jin
fbf214e408 update ray cmake build process (#2853)
* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance

* support boost external project, avoid using the system or build.sh boost

* keep compatible with build.sh, remove boost and arrow build from it.

* bugfix: parquet bison version control, plasma_java lib install problem

* bugfix: cmake, do not compile plasma java client if no need

* bugfix: component failures test timeout machenism has problem for plasma manager failed case

* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master

* revert some fix

* set arrow python executable, fix format error in component_failures_test.py

* make clean arrow python build directory

* update cmake code style, back to support cmake minimum version 3.4
2018-09-12 11:19:33 -07:00
Hao Chen
8414e413a2 [java] refine and simplify java worker code structure (#2838) 2018-09-10 10:48:17 -07:00
Zhijun Fu
753ba76141 [Issue 2809][xray] Cleanup on driver detach (#2826)
This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass.

The following should happen when a driver exits (either gracefully or ungracefully).

#2797 should be enabled and pass.
Any actors created by the driver that are still running should be killed.
Any workers running tasks for the driver should be killed.
Any tasks for the driver in any node_manager queues should be removed.
Any future tasks received by a node manager for the driver should be ignored.
The driver death notification should only be received once.
2018-09-07 16:11:32 +08:00
Wang Qing
7e13e1fd49 [Java] Remove non-raylet code in Java. (#2828) 2018-09-06 14:54:13 +08:00
Yuhong Guo
dfb7c2be1e [Java] Add Plasma Free to Java code path (#2802) 2018-09-04 15:28:23 +08:00
Robert Nishihara
0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00
Yuhong Guo
2691b3a11a Add signal handlers to improve debuggability (#2757)
* Add signal handlers to improve debuggability.

* Fix Linux compiling

* Fix Lint

* Change SIGILL case that happens in both Linux and MaxOs

* Add signal handler to main functions.

* Change handler name.

* Address comment

* Address comment.

* Fix Linux building failure

* Introduce RAII mechanism to SignalHandlers.

* Add InitShutdownWrapper to handle all RAII requirements

* Change util_test to signal_test

* Make sure shutdown is not nullptr.

* Using google::InstallFailureSignalHandler() instead of our own signal handler

* Refine code addording to comment

* Fix valgrind test failure.

* remove Shutdown template

* consistency

* linting
2018-09-01 21:58:23 -07:00
Philipp Moritz
869ee8e25d Integrate plasma store list facility (#2752) 2018-09-01 16:53:51 -07:00
Alexey Tumanov
fdc9688226 [xray] push warning to driver for infeasible tasks (#2784)
This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.
2018-09-01 13:21:27 -07:00
Yucong He
5b45f0bdff [xray] Implementing Gcs sharding (#2409)
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
2018-08-31 15:54:30 -07:00
Ryan Sepassi
b6260003cb Some small changes (#2782)
* Add some imports that make it easier to build with Bazel
* Use "/tmp" paths for sockets in tests
* Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.
2018-08-30 22:42:49 -07:00
Wang Qing
514633456b [Java] Fix out-dated signatures of JNI methods (#2756)
1) Renamed the native JNI methods and some parameters of JNI methods. 
2) Fixed native JNI methods' signatures by `javah` tool.
3) Removed some useless native methods.
2018-08-30 17:59:29 +08:00
Robert Nishihara
ba7efafa67 Remove force_start argument from StartWorkerProcess. (#2762)
This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.
2018-08-30 13:43:47 +08:00
Robert Nishihara
132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std:🧵:hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
Wang Qing
b4cba9a49f [java] Fix the logic of generating TaskID (#2747)
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change,  I rewrote the logic of generating `TaskID` in java which is the same as the python's.

In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.

## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)
2018-08-27 13:11:33 -07:00
Hao Chen
f37c260bdb [multi-language part 3] support multiple languages in raylet backend (#2672)
This PR enables multi-language support in the raylet backend.
- `Worker` class now has a `language` label;
- `WorkerPool`:
	- It now maintains one set of states for each language.
	- `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id.
    -  `Size` and `StartWorkerProcess` functions now have an extra `language` parameter.
- `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.
2018-08-26 22:06:25 -07:00
Yuhong Guo
697bfb14db Hotfix for glog PR (#2734) 2018-08-24 16:30:51 -07:00
Philipp Moritz
b4c47a5861 Upgrade arrow to include more detailed flushing message (#2706) 2018-08-24 11:44:04 -07:00
Stephanie Wang
1b3de31ff1 [xray] Fix bug where driver task ID is assumed to be nil (#2725)
## What do these changes do?

#2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.
2018-08-23 14:44:47 -07:00
Yuhong Guo
eec1a3eb89 Support pluggable backend log lib with glog (#2695)
* [WIP] Support different backend log lib

* Refine code, unify level, address comment

* Address comment and change formatter

* Fix linux building failure.

* Fix lint

* Remove log4cplus.

* Add log init to raylet main and add test to travis.

* Address comment and refine.

* Update logging_test.cc
2018-08-23 09:43:38 -07:00
Stephanie Wang
8fd5757aaa [xray] Don't process any more messages from dead node managers (#2688) 2018-08-19 21:11:40 -07:00
Wang Qing
06a58016d8 [multi-language part 2] Change the command line arguments to start raylet (#2670) 2018-08-16 21:59:44 -07:00
Hao Chen
a719e089b0 [multi-language part 1] add a 'language' field to task specification (#2639) 2018-08-16 21:26:42 -07:00
Stephanie Wang
e3e0cfce87 [xray] Resubmit tasks that fail to be forwarded (#2645) 2018-08-16 00:12:56 -07:00
Philipp Moritz
6cb6dd30d1 silence shutdown callback (#2662) 2018-08-15 22:48:00 -07:00
tianyapiaozi
98fed67b45 fix offset by one issue in the local scheduler (#2652) 2018-08-15 10:10:30 -07:00
Yuhong Guo
eeb15771ba Add ray.internal.free (#2542) 2018-08-14 22:01:23 -07:00
Stephanie Wang
62649715ca [xray] Cache a task's object dependencies (#2623)
* Cache a Task's object dependencies

* Cache the parent task IDs for lineage cache entries

* Cache the parent task IDs in lineage cache entries

* revert

* Fix test

* remove unused line

* Fix test
2018-08-14 20:25:41 -07:00
Stephanie Wang
dede80f3df [xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642)
* Loosen checks in the lineage cache and log appropriate warnings in the node manager

* revert test
2018-08-14 15:25:32 -07:00
Yuhong Guo
4bd98eed45 Support building Java and Python version at the same time. (#2640)
* Support building Java and Python version at the same time.

* Remove duplicated definition.

* Refine the building process of local_scheduler

* Refine

* Add comment for languages

* Modify instruction and add python,jave building to CI.

* change according to comment
2018-08-14 11:33:51 -07:00
Stephanie Wang
806fdf2f05 [xray] Object manager retries Pull requests (#2630)
* Move all ObjectManager members to bottom of class def

* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients

* increase object manager Pull timeout

* Make the component failure test harder.

* note

* Notify SubscribeObjectLocations caller of empty list

* Address melih's comments

* Fix wait...

* Make component failure test easier for legacy ray

* lint
2018-08-13 19:15:55 -07:00
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Hao Chen
170e08cf02 fix a bug in killing unregistered workers (#2613) 2018-08-09 17:57:25 -07:00