hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 13:26:39 -04:00

Author	SHA1	Message	Date
Hao Chen	e96817d074	fix a syntax error of initializing unordered_map (#2871 ) The previous way is incompatible with older version of gcc.	2018-09-14 12:07:08 -07:00
Philipp Moritz	2c9a4f6b41	Evaluate debug logging only in debug mode (#2869 ) This PR makes it so debugging logs are only evaluated during debugging. We found that for the current code, functions called in debug logging code are evaluated even in release mode (even though nothing is printed).	2018-09-14 11:40:44 -07:00
Robert Nishihara	f16d33593b	Mark worker as blocked and trigger reconstruction in ray.wait. (#2864 ) * Trigger reconstruction in ray.wait and mark worker as blocked. * Add test. * Linting. * Don't run new test with legacy Ray. * Only call HandleClientUnblocked if it actually blocked in ray.wait. * Reduce time to ray.wait in the test.	2018-09-13 15:28:17 -07:00
Hanwei Jin	fbf214e408	update ray cmake build process (#2853 ) * use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance * support boost external project, avoid using the system or build.sh boost * keep compatible with build.sh, remove boost and arrow build from it. * bugfix: parquet bison version control, plasma_java lib install problem * bugfix: cmake, do not compile plasma java client if no need * bugfix: component failures test timeout machenism has problem for plasma manager failed case * bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master * revert some fix * set arrow python executable, fix format error in component_failures_test.py * make clean arrow python build directory * update cmake code style, back to support cmake minimum version 3.4	2018-09-12 11:19:33 -07:00
Hao Chen	8414e413a2	[java] refine and simplify java worker code structure (#2838 )	2018-09-10 10:48:17 -07:00
Zhijun Fu	753ba76141	[Issue 2809][xray] Cleanup on driver detach (#2826 ) This change addresses issue #2809. Test #2797 has been enabled for raylet and can pass. The following should happen when a driver exits (either gracefully or ungracefully). #2797 should be enabled and pass. Any actors created by the driver that are still running should be killed. Any workers running tasks for the driver should be killed. Any tasks for the driver in any node_manager queues should be removed. Any future tasks received by a node manager for the driver should be ignored. The driver death notification should only be received once.	2018-09-07 16:11:32 +08:00
Wang Qing	7e13e1fd49	[Java] Remove non-raylet code in Java. (#2828 )	2018-09-06 14:54:13 +08:00
Yuhong Guo	dfb7c2be1e	[Java] Add Plasma Free to Java code path (#2802 )	2018-09-04 15:28:23 +08:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00
Yuhong Guo	2691b3a11a	Add signal handlers to improve debuggability (#2757 ) * Add signal handlers to improve debuggability. * Fix Linux compiling * Fix Lint * Change SIGILL case that happens in both Linux and MaxOs * Add signal handler to main functions. * Change handler name. * Address comment * Address comment. * Fix Linux building failure * Introduce RAII mechanism to SignalHandlers. * Add InitShutdownWrapper to handle all RAII requirements * Change util_test to signal_test * Make sure shutdown is not nullptr. * Using google::InstallFailureSignalHandler() instead of our own signal handler * Refine code addording to comment * Fix valgrind test failure. * remove Shutdown template * consistency * linting	2018-09-01 21:58:23 -07:00
Philipp Moritz	869ee8e25d	Integrate plasma store list facility (#2752 )	2018-09-01 16:53:51 -07:00
Alexey Tumanov	fdc9688226	[xray] push warning to driver for infeasible tasks (#2784 ) This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.	2018-09-01 13:21:27 -07:00
Yucong He	5b45f0bdff	[xray] Implementing Gcs sharding (#2409 ) Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.). [+] Implement sharding for gcs tables. [+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented. [+] Move AsyncGcsClient's initialization into Connect function. [-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.	2018-08-31 15:54:30 -07:00
Ryan Sepassi	b6260003cb	Some small changes (#2782 ) * Add some imports that make it easier to build with Bazel * Use "/tmp" paths for sockets in tests * Move `asio_test` into `run_gcs_tests.sh` instead of starting and stopping Redis within the test fixture with a `system` call.	2018-08-30 22:42:49 -07:00
Wang Qing	514633456b	[Java] Fix out-dated signatures of JNI methods (#2756 ) 1) Renamed the native JNI methods and some parameters of JNI methods. 2) Fixed native JNI methods' signatures by `javah` tool. 3) Removed some useless native methods.	2018-08-30 17:59:29 +08:00
Robert Nishihara	ba7efafa67	Remove force_start argument from StartWorkerProcess. (#2762 ) This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.	2018-08-30 13:43:47 +08:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Wang Qing	b4cba9a49f	[java] Fix the logic of generating TaskID (#2747 ) ## What do these changes do? Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code. In this change, I rewrote the logic of generating `TaskID` in java which is the same as the python's. In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too. ## Related issue number [#2608](https://github.com/ray-project/ray/issues/2608)	2018-08-27 13:11:33 -07:00
Hao Chen	f37c260bdb	[multi-language part 3] support multiple languages in raylet backend (#2672 ) This PR enables multi-language support in the raylet backend. - `Worker` class now has a `language` label; - `WorkerPool`: - It now maintains one set of states for each language. - `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id. - `Size` and `StartWorkerProcess` functions now have an extra `language` parameter. - `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.	2018-08-26 22:06:25 -07:00
Yuhong Guo	697bfb14db	Hotfix for glog PR (#2734 )	2018-08-24 16:30:51 -07:00
Philipp Moritz	b4c47a5861	Upgrade arrow to include more detailed flushing message (#2706 )	2018-08-24 11:44:04 -07:00
Stephanie Wang	1b3de31ff1	[xray] Fix bug where driver task ID is assumed to be nil (#2725 ) ## What do these changes do? #2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.	2018-08-23 14:44:47 -07:00
Yuhong Guo	eec1a3eb89	Support pluggable backend log lib with glog (#2695 ) * [WIP] Support different backend log lib * Refine code, unify level, address comment * Address comment and change formatter * Fix linux building failure. * Fix lint * Remove log4cplus. * Add log init to raylet main and add test to travis. * Address comment and refine. * Update logging_test.cc	2018-08-23 09:43:38 -07:00
Stephanie Wang	8fd5757aaa	[xray] Don't process any more messages from dead node managers (#2688 )	2018-08-19 21:11:40 -07:00
Wang Qing	06a58016d8	[multi-language part 2] Change the command line arguments to start raylet (#2670 )	2018-08-16 21:59:44 -07:00
Hao Chen	a719e089b0	[multi-language part 1] add a 'language' field to task specification (#2639 )	2018-08-16 21:26:42 -07:00
Stephanie Wang	e3e0cfce87	[xray] Resubmit tasks that fail to be forwarded (#2645 )	2018-08-16 00:12:56 -07:00
Philipp Moritz	6cb6dd30d1	silence shutdown callback (#2662 )	2018-08-15 22:48:00 -07:00
tianyapiaozi	98fed67b45	fix offset by one issue in the local scheduler (#2652 )	2018-08-15 10:10:30 -07:00
Yuhong Guo	eeb15771ba	Add `ray.internal.free` (#2542 )	2018-08-14 22:01:23 -07:00
Stephanie Wang	62649715ca	[xray] Cache a task's object dependencies (#2623 ) * Cache a Task's object dependencies * Cache the parent task IDs for lineage cache entries * Cache the parent task IDs in lineage cache entries * revert * Fix test * remove unused line * Fix test	2018-08-14 20:25:41 -07:00
Stephanie Wang	dede80f3df	[xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642 ) * Loosen checks in the lineage cache and log appropriate warnings in the node manager * revert test	2018-08-14 15:25:32 -07:00
Yuhong Guo	4bd98eed45	Support building Java and Python version at the same time. (#2640 ) * Support building Java and Python version at the same time. * Remove duplicated definition. * Refine the building process of local_scheduler * Refine * Add comment for languages * Modify instruction and add python,jave building to CI. * change according to comment	2018-08-14 11:33:51 -07:00
Stephanie Wang	806fdf2f05	[xray] Object manager retries Pull requests (#2630 ) * Move all ObjectManager members to bottom of class def * Better Pull requests - suppress duplicate Pulls - retry the Pull at the next client after a timeout - cancel a Pull if the object no longer appears on any clients * increase object manager Pull timeout * Make the component failure test harder. * note * Notify SubscribeObjectLocations caller of empty list * Address melih's comments * Fix wait... * Make component failure test easier for legacy ray * lint	2018-08-13 19:15:55 -07:00
Stephanie Wang	4a7be6f46d	[xray] Make sure raylet does not crash if remote raylet dies (#2619 ) * Log a warning on remote object manager failures * Mark a task that was failed to be forwarded as pending * Raylet component failure test and make it harder * Turn on component failure test for xray * Remove return status from ReleaseSender * lint	2018-08-09 20:36:30 -07:00
Hao Chen	170e08cf02	fix a bug in killing unregistered workers (#2613 )	2018-08-09 17:57:25 -07:00
Philipp Moritz	143a118fbf	[xray] Fix valgrind crash when memory profiling raylet (#2583 ) * use different random number generator to be compatible with older valgrind versions * seed from time * style * fix * remove more random devices * also remove random_device from global scheduler * rename mutex * linting	2018-08-09 15:37:17 -07:00
Stephanie Wang	f093ed1fc6	[xray] Fix crash in case of spurious reconstruction (#2609 ) * Exit if task already queued * address comments	2018-08-09 14:46:46 -07:00
Stephanie Wang	2de9bfc7e3	[xray] Log warnings for asio handlers that take too long (#2601 ) * Add fatal check for heartbeat drift * Log warning messages for handlers that take too long * Add debug labels to all ClientConnections	2018-08-09 14:39:23 -07:00
Stephanie Wang	d49b4bef0a	[xray] Basic task reconstruction mechanism (#2526 ) ## What do these changes do? This implements basic task reconstruction in raylet. There are two parts to this PR: 1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary. 2. Task resubmission once a raylet becomes responsible for reconstructing a task. Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this: 1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR. 2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted). Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.	2018-08-09 07:24:37 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Alexey Tumanov	df7ee7ff1e	raylet memory corruption fixes (#2591 ) * raylet memory corruption fixes * add util function to translate boost error to ray status * tcp client connection now using ray status utility function * lint	2018-08-08 19:50:43 -07:00
Stephanie Wang	6ab01a2cad	[xray] Fix bug when counting a task's lineage size (#2600 )	2018-08-08 00:00:17 -07:00
Ujval Misra	a0691ee49b	[xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534 ) * Add set to lineage cache entry to track nodes already forwarded to. * Uncommitted lineage function naming, documentation. * Simple test for uncommitted lineage with a marked task. * Rebased, changed tests to use ClientID::nil. * Bug fix, change MergeLineageHelper function type. * Formatting. * Checks and test changes based on PR comments. * GetUncommittedLineage now always returns at least the requested task ID. * Bug fix (return at least requested task ID) * Formatting	2018-08-07 21:10:23 -07:00
Philipp Moritz	e7f76d7914	[xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586 )	2018-08-07 13:45:51 -07:00
Philipp Moritz	25f0094ee4	Fix copying the plasma fbs directory from arrow (#2579 )	2018-08-07 00:04:37 -07:00
Yuhong Guo	d35ce7fa63	Use real callback index in subscribe_callback_index_ (#2473 )	2018-08-06 15:29:56 -07:00
Alexey Tumanov	85b8b2a395	mark all remaining placeable tasks pending with task dependency manager (#2528 )	2018-08-06 13:08:11 -07:00
Melih Elibol	34d3a46f48	[xray] Revert dynamic chunk size optimization for ObjectManager. (#2557 ) * Revert dynamic chunk size optimization. * fix mac build issues.	2018-08-05 02:09:37 -07:00

... 21 22 23 24 25 ...

1638 commits