hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 21:36:39 -04:00

Author	SHA1	Message	Date
Robert Nishihara	ba7efafa67	Remove force_start argument from StartWorkerProcess. (#2762 ) This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.	2018-08-30 13:43:47 +08:00
Robert Nishihara	132f133214	Limit number of concurrent workers started by hardware concurrency. (#2753 ) * Limit number of concurrent workers started by hardware concurrency. * Check if std:🧵:hardware_concurrency() returns 0. * Pass in max concurrency from Python. * Fix Java call to startRaylet. * Fix typo * Remove unnecessary cast. * Fix linting. * Cleanups on Java side. * Comment back in actor test. * Require maximum_startup_concurrency to be at least 1. * Fix linting and test. * Improve documentation. * Fix typo.	2018-08-29 14:53:40 +08:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Wang Qing	b4cba9a49f	[java] Fix the logic of generating TaskID (#2747 ) ## What do these changes do? Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code. In this change, I rewrote the logic of generating `TaskID` in java which is the same as the python's. In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too. ## Related issue number [#2608](https://github.com/ray-project/ray/issues/2608)	2018-08-27 13:11:33 -07:00
Hao Chen	f37c260bdb	[multi-language part 3] support multiple languages in raylet backend (#2672 ) This PR enables multi-language support in the raylet backend. - `Worker` class now has a `language` label; - `WorkerPool`: - It now maintains one set of states for each language. - `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id. - `Size` and `StartWorkerProcess` functions now have an extra `language` parameter. - `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.	2018-08-26 22:06:25 -07:00
Yuhong Guo	697bfb14db	Hotfix for glog PR (#2734 )	2018-08-24 16:30:51 -07:00
Philipp Moritz	b4c47a5861	Upgrade arrow to include more detailed flushing message (#2706 )	2018-08-24 11:44:04 -07:00
Stephanie Wang	1b3de31ff1	[xray] Fix bug where driver task ID is assumed to be nil (#2725 ) ## What do these changes do? #2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.	2018-08-23 14:44:47 -07:00
Yuhong Guo	eec1a3eb89	Support pluggable backend log lib with glog (#2695 ) * [WIP] Support different backend log lib * Refine code, unify level, address comment * Address comment and change formatter * Fix linux building failure. * Fix lint * Remove log4cplus. * Add log init to raylet main and add test to travis. * Address comment and refine. * Update logging_test.cc	2018-08-23 09:43:38 -07:00
Stephanie Wang	8fd5757aaa	[xray] Don't process any more messages from dead node managers (#2688 )	2018-08-19 21:11:40 -07:00
Wang Qing	06a58016d8	[multi-language part 2] Change the command line arguments to start raylet (#2670 )	2018-08-16 21:59:44 -07:00
Hao Chen	a719e089b0	[multi-language part 1] add a 'language' field to task specification (#2639 )	2018-08-16 21:26:42 -07:00
Stephanie Wang	e3e0cfce87	[xray] Resubmit tasks that fail to be forwarded (#2645 )	2018-08-16 00:12:56 -07:00
Philipp Moritz	6cb6dd30d1	silence shutdown callback (#2662 )	2018-08-15 22:48:00 -07:00
tianyapiaozi	98fed67b45	fix offset by one issue in the local scheduler (#2652 )	2018-08-15 10:10:30 -07:00
Yuhong Guo	eeb15771ba	Add `ray.internal.free` (#2542 )	2018-08-14 22:01:23 -07:00
Stephanie Wang	62649715ca	[xray] Cache a task's object dependencies (#2623 ) * Cache a Task's object dependencies * Cache the parent task IDs for lineage cache entries * Cache the parent task IDs in lineage cache entries * revert * Fix test * remove unused line * Fix test	2018-08-14 20:25:41 -07:00
Stephanie Wang	dede80f3df	[xray] Reduce fatal checks in the lineage cache that fail during reconstruction (#2642 ) * Loosen checks in the lineage cache and log appropriate warnings in the node manager * revert test	2018-08-14 15:25:32 -07:00
Yuhong Guo	4bd98eed45	Support building Java and Python version at the same time. (#2640 ) * Support building Java and Python version at the same time. * Remove duplicated definition. * Refine the building process of local_scheduler * Refine * Add comment for languages * Modify instruction and add python,jave building to CI. * change according to comment	2018-08-14 11:33:51 -07:00
Stephanie Wang	806fdf2f05	[xray] Object manager retries Pull requests (#2630 ) * Move all ObjectManager members to bottom of class def * Better Pull requests - suppress duplicate Pulls - retry the Pull at the next client after a timeout - cancel a Pull if the object no longer appears on any clients * increase object manager Pull timeout * Make the component failure test harder. * note * Notify SubscribeObjectLocations caller of empty list * Address melih's comments * Fix wait... * Make component failure test easier for legacy ray * lint	2018-08-13 19:15:55 -07:00
Stephanie Wang	4a7be6f46d	[xray] Make sure raylet does not crash if remote raylet dies (#2619 ) * Log a warning on remote object manager failures * Mark a task that was failed to be forwarded as pending * Raylet component failure test and make it harder * Turn on component failure test for xray * Remove return status from ReleaseSender * lint	2018-08-09 20:36:30 -07:00
Hao Chen	170e08cf02	fix a bug in killing unregistered workers (#2613 )	2018-08-09 17:57:25 -07:00
Philipp Moritz	143a118fbf	[xray] Fix valgrind crash when memory profiling raylet (#2583 ) * use different random number generator to be compatible with older valgrind versions * seed from time * style * fix * remove more random devices * also remove random_device from global scheduler * rename mutex * linting	2018-08-09 15:37:17 -07:00
Stephanie Wang	f093ed1fc6	[xray] Fix crash in case of spurious reconstruction (#2609 ) * Exit if task already queued * address comments	2018-08-09 14:46:46 -07:00
Stephanie Wang	2de9bfc7e3	[xray] Log warnings for asio handlers that take too long (#2601 ) * Add fatal check for heartbeat drift * Log warning messages for handlers that take too long * Add debug labels to all ClientConnections	2018-08-09 14:39:23 -07:00
Stephanie Wang	d49b4bef0a	[xray] Basic task reconstruction mechanism (#2526 ) ## What do these changes do? This implements basic task reconstruction in raylet. There are two parts to this PR: 1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary. 2. Task resubmission once a raylet becomes responsible for reconstructing a task. Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this: 1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR. 2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted). Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.	2018-08-09 07:24:37 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Alexey Tumanov	df7ee7ff1e	raylet memory corruption fixes (#2591 ) * raylet memory corruption fixes * add util function to translate boost error to ray status * tcp client connection now using ray status utility function * lint	2018-08-08 19:50:43 -07:00
Stephanie Wang	6ab01a2cad	[xray] Fix bug when counting a task's lineage size (#2600 )	2018-08-08 00:00:17 -07:00
Ujval Misra	a0691ee49b	[xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534 ) * Add set to lineage cache entry to track nodes already forwarded to. * Uncommitted lineage function naming, documentation. * Simple test for uncommitted lineage with a marked task. * Rebased, changed tests to use ClientID::nil. * Bug fix, change MergeLineageHelper function type. * Formatting. * Checks and test changes based on PR comments. * GetUncommittedLineage now always returns at least the requested task ID. * Bug fix (return at least requested task ID) * Formatting	2018-08-07 21:10:23 -07:00
Philipp Moritz	e7f76d7914	[xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586 )	2018-08-07 13:45:51 -07:00
Philipp Moritz	25f0094ee4	Fix copying the plasma fbs directory from arrow (#2579 )	2018-08-07 00:04:37 -07:00
Yuhong Guo	d35ce7fa63	Use real callback index in subscribe_callback_index_ (#2473 )	2018-08-06 15:29:56 -07:00
Alexey Tumanov	85b8b2a395	mark all remaining placeable tasks pending with task dependency manager (#2528 )	2018-08-06 13:08:11 -07:00
Melih Elibol	34d3a46f48	[xray] Revert dynamic chunk size optimization for ObjectManager. (#2557 ) * Revert dynamic chunk size optimization. * fix mac build issues.	2018-08-05 02:09:37 -07:00
Wang Qing	e4f68ff8cf	[Java Worker] Support raylet on Java (#2479 )	2018-08-01 17:52:49 -07:00
Zhijun Fu	ca36827f01	[Issues 2403][xray] Fix raylet performance issues on scheduling queue (#2438 ) * merge from ray * Revert "merge from ray" This reverts commit 32b181ebbb1fa184026631e1a7368112c4c3118d. * fix raylet performance regression * address comments * Update code after merging latest changes * fix lint * address comments	2018-08-01 14:41:20 -07:00
Stephanie Wang	e90ecef297	[xray] Try to flush children of a task that is evicted from the lineage cache (#2531 )	2018-08-01 00:23:02 -07:00
Stephanie Wang	a45f9cfafc	[xray] Implement task lease table, logic for deciding when to reconstruct a task (#2497 )	2018-07-30 14:42:28 -07:00
Ion	80db69d245	State transition diagram documentation. (#2502 ) * Added description of transition diagram and a few name changes for imporved clarity. * rename some methods and update task_states.rst	2018-07-28 22:28:45 -07:00
Robert Nishihara	2be1ccbd8f	Raise application-level exceptions for some failure scenarios. (#2429 ) * Raise application level exception for actor methods that can't be executed and failed tasks. * Retry task forwarding for actor tasks. * Small cleanups * Move constant to ray_config. * Create ForwardTaskOrResubmit method. * Minor * Clean up queued tasks for dead actors. * Some cleanups. * Linting * Notify task_dependency_manager_ about failed tasks. * Manage timer lifetime better. * Use smart pointers to deallocate the timer. * Fix * add comment	2018-07-27 19:53:30 -04:00
Stephanie Wang	6675361684	[xray] Track `ray.get` calls as task dependencies (#2362 )	2018-07-27 11:59:17 -07:00
Zhijun Fu	9ad6a973a0	[xray] lineage optimization: avoid unnecessary lineage entry allocation & free (#2463 ) * merge from ray * Revert "merge from ray" This reverts commit 32b181ebbb1fa184026631e1a7368112c4c3118d. * [xray] avoid unnecessary lineage entry allocation & free * address comments * address review comments * address comments	2018-07-26 10:44:38 -04:00
Yuhong Guo	b35ce5dbf1	Update Arrow Package with breaking changes (#2440 ) * Merge the breaking change of Arrow Package. * Fix typo * Fix lint. * put forward declarations into header * fix * add protocol.h * fix linting	2018-07-25 14:28:33 -07:00
Philipp Moritz	e821f852ef	[xray] Silence some object manager logging (#2437 )	2018-07-20 13:10:03 -07:00
Robert Nishihara	eed39163f9	Add callback to node manager for client removed event. (#2417 ) * Add callback to node manager for client removed event. * Fix linting.	2018-07-18 16:59:04 -07:00
Philipp Moritz	4c82ac72df	Upgrade arrow to include the plasma TensorFlow op (#2412 )	2018-07-18 12:33:02 -07:00
Yuhong Guo	206254bcf3	Add const to to_plasma_id function to make it usable by const ObjectID (#2404 ) * Add const to to_plasma_id to make it usable by const ObjectID * Separate the building script to another PR.	2018-07-16 11:05:29 -07:00
Hao Chen	c1575e98c1	Make local scheduler client thread-safe (#2386 ) * Make local scheduler client thread-safe for python * lock write_messages * remove allow-threads * fix linter * rename _write_message to do_write_message	2018-07-13 16:19:00 -07:00
Philipp Moritz	fbde8cad74	Update apache arrow to include TensorFlow fix (#2345 )	2018-07-06 13:18:56 -07:00

... 15 16 17 18 19 ...

1323 commits