hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-11 21:56:39 -04:00

Author	SHA1	Message	Date
Zhijun Fu	3df1e1c471	Add missing lock in FreeObjects of object buffer pool (#3647 ) Object manager uses multi-threading for transferring objects between different nodes, the plasma client used in object_buffer_pool_ needs to be protected by lock. We have met crashes caused by missing lock in FreeObjects() interface, this PR fixes that issue.	2018-12-28 11:47:31 -08:00
Hao Chen	0b682d043e	Fix memory leak in PyRayletCient (#3640 ) 1) if using `PyObject_GetIter`, the caller must call `Py_DECREF` to avoid memory leak. But with `PyList_GetItem`, `Py_DECREF` isn't needed. 2) the `Py_BuildValue` call in `wait` doesn't need to increment ref count.	2018-12-27 17:39:02 -08:00
Hao Chen	f4011754d6	Fix: ServerConnection should be closed before being removed (#3626 ) Otherwise, in the event of a remote raylet crashing, the connection might be held by boost asio forever, and the pending callbacks will never get invoked. See also #3586.	2018-12-25 11:01:53 -08:00
Robert Nishihara	ddd4c842f1	Initialize some variables in constructor instead of header file. (#3617 ) * Initialize some variables in constructor instead of header file	2018-12-23 02:44:23 -08:00
Alexey Tumanov	bada42c334	object store notification mgr: fix using uninitialized variables (#3592 ) Initialize private class variables to avoid valgrind errors. They are used before initialization.	2018-12-22 19:51:22 -08:00
Philipp Moritz	e578a38116	Fix TensorFlow and PyTorch compatibility (#3574 ) * remove tensorflow workaround * update docker * add boost threads * add date_time, too * change link order * cosmetics	2018-12-22 13:25:48 -08:00
Alexey Tumanov	6b179cb8a7	change the order of allocation for io_service and gcs client in raylet main (#3597 )	2018-12-21 00:13:28 -08:00
Hao Chen	132a23354e	Fix pending callback not called when ServerConnection destructs (#3572 )	2018-12-19 17:29:36 -08:00
Yuhong Guo	fb33fa9097	Enable function_descriptor in backend to replace the function_id (#3028 )	2018-12-18 18:53:59 -05:00
Stephanie Wang	26ca40817e	Convert UniqueID::nil() to a constructor (#3564 ) * Initialize UniqueID to nil * Return reference to static const variable	2018-12-18 11:59:02 -08:00
Yuhong Guo	75ddf7cca4	Fix 2 small bugs (#3573 )	2018-12-18 14:52:21 -05:00
Robert Nishihara	417c7f2d6f	Update arrow and remove plasma_manager references. (#3545 )	2018-12-15 23:36:02 -08:00
Philipp Moritz	b3bf608608	Update arrow to reduce plasma IPCs. (#3497 )	2018-12-14 23:49:37 -05:00
Stephanie Wang	fcc37021b2	Throw exception for `ray.get` of an evicted actor object (#3490 ) * Add a flag for whether an object has been created before * Add regression test * doc * Share object directory between object and node managers * Treat evicted actor tasks as failed * minor * Check return value * Fix bug where object locations weren't getting updated on client death * Fix mac build * Use RayTaskError	2018-12-14 11:41:27 -08:00
Yuhong Guo	a4abe6c0fe	Add test to test raylet client connection when raylet crashes. (#3518 )	2018-12-13 23:40:50 -08:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Alexey Tumanov	2455de78ce	save initial config instead of initial resource config (#3532 )	2018-12-13 20:39:42 -08:00
Si-Yuan	84fae57ab5	Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511 ) * refactoring * fix bugs * create client class * create client class for java; bug fix * remove legacy code * improve code by using std::string, std::unique_ptr rename private fields and removing legacy code * rename class * improve naming * fix * rename files * fix names * change name * change return types * make a mutex private field * fix comments * fix bugs * lint * bug fix * bug fix * move too short functions into the header file * Loose crash conditions for some APIs. * Apply suggestions from code review Co-Authored-By: suquark <suquark@gmail.com> * format * update * rename python APIs * fix java * more fixes * change types of cpython interface * more fixes * improve error processing * improve error processing for java wrapper * lint * fix java * make fields const * use pointers for [out] parameters * fix java & error msg * fix resource leak, etc.	2018-12-13 13:39:10 -08:00
Eric Liang	20c7fad4f4	Move actor table to primary redis context	2018-12-12 16:51:29 -08:00
Eric Liang	cffe8f9806	Add option to evict keys LRU from the sharded redis tables (#3499 ) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format	2018-12-09 05:48:52 -08:00
Yuhong Guo	0136af5aac	Add return value for recontruction RPC. (#3493 ) * Add return value for recontruct RPC. * Fix comment function name	2018-12-09 00:08:44 -08:00
Stephanie Wang	4abafd7e62	Fix bug in ray.wait (#3445 ) ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely: 1. Objects A and B are put in the cluster. 2. Client calls ray.wait([A, B], num_returns=1). 3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each. 4. Callback for A fires. The wait completes and the request is removed. 5. Callback for B fires. The wait request no longer exists and raylet crashes.	2018-12-01 19:40:33 -08:00
Stephanie Wang	48a5935224	Fault tolerance for actor creation (#3422 ) * Add regression test * Request actor creation if no actor location found * Comments * Address comments * Increase test timeout * Trigger test	2018-11-29 10:48:35 -08:00
Tianming Xu	139fbf7884	Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403 )	2018-11-27 23:51:18 -08:00
Eric Liang	c2108ca64f	Don't put entire actor registry in debug string since it's too long (#3395 )	2018-11-27 16:48:12 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Eric Liang	686cf20951	Remove uses of std::list::size (#3358 ) * worker pool and client conn * Fix linting * unordered set * move	2018-11-20 14:47:55 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00
Ujval Misra	b0bfd104f2	Batch heartbeats from node manager together in the monitor. (#3011 )	2018-11-20 09:52:27 -08:00
Robert Nishihara	f2b5500642	Add ordered_set container. (#3352 ) * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator	2018-11-19 17:01:18 -08:00
Eric Liang	d4dbd27e0d	Don't retry IPC connect an absurd number of times (#3355 )	2018-11-19 16:23:59 -08:00
Robert Nishihara	5cbc597494	Suppress duplicate pre-emptive object pushes. (#3276 ) * Suppress duplicate pre-emptive object pushes. * Add test. * Fix linting * Remove timer and inline recent_pushes_ into local_objects_. * Improve test. * Fix * Fix linting * Enable retrying pull from same object manager. Randomize object manager. * Speed up test * Linting * Add test. * Minor * Lengthen pull timeout and reissue pull every time a new object becomes available. * Increase pull timeout in test. * Wait for nodes to start in object manager test. * Wait longer for nodes to start up in test. * Small fixes. * _submit -> _remote * Change assert to warning.	2018-11-16 23:02:45 -08:00
Robert Nishihara	60b22d9a72	Don't unsubscribe dependencies for infeasible tasks. (#3338 ) * Make scheduling queues RemoveTasks return task states as well. * Add test * Don't unsubscribe for infeasible tasks when spilling over. * Linting * Address comments.	2018-11-16 11:33:00 -08:00
Eric Liang	e0bf9d7305	Add debug string to raylet (#3317 ) * initial debug string * format * wip debug string * fix compile * fix * update * finished * to file * logs dir * use temp root * fix * override	2018-11-15 21:47:50 -08:00
Philipp Moritz	1be1455d86	Fix redis crash when duplicate messages are appended to log. (#3316 )	2018-11-15 15:09:39 -08:00
Philipp Moritz	b6a12d1f97	Fix socket retry message (#3325 )	2018-11-15 12:14:19 -08:00
Stephanie Wang	577c1dda74	Release sender connections as soon as WriteMessageAsync completes (#3313 )	2018-11-13 21:32:24 -05:00
Ion	d681893b0f	Speed up task dispatch. (#3234 ) * speed up task dispatch * minor changes * improved comments * improved comments * change argument of DispatchTasks to list of tasks * dispatch only tasks whose dependencies have been fullfiled * some updated comments * refactored DispatchQueue() and Assigntask() to avoid the copy of the ready list * minor fixes * some more minor fixes * some more minor fixes * added more comments * better comments? * fixed all feedback comments, minus making the argument of AssignTask() const * Assigntask() now taskes a const argument * Do the task copy outside of the callback * fix linting	2018-11-10 09:55:12 -08:00
Eric Liang	9b2794101d	[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228 )	2018-11-08 00:04:20 -08:00
Stephanie Wang	d950e92f63	Allow multiple threads to call ray.get and ray.wait (#3244 ) * Handle multiple threads calling ray.get * Multithreaded ray.wait * Pass in current task ID in java backend * Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get * Fix test * Some cleanups * Improve error message * Add assertion * Cleanup, throw error in HandleTaskUnblocked if task not actually blocked * lint * Fix python worker reset * Fix references to reconstruct_objects * Linting * java lint * Fix java * Fix iterator	2018-11-07 22:39:28 -08:00
Richard Liaw	0bab8ed95c	Expose internal config parameters for starting Ray (#3246 ) ## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number	2018-11-07 21:46:02 -08:00
Eric Liang	29e3362905	Better errors on process deaths (#3252 )	2018-11-07 14:08:16 -08:00
Robert Nishihara	1dd5d92789	Enable timeline visualizations of object transfers. (#3255 ) * Plot object transfers. * Linting	2018-11-07 12:45:59 -08:00
Philipp Moritz	4182b85611	Cache resources in SchedulingQueue (#3232 ) * cache resources * fix * documentation and remove old code * fix PR * update documentation * linting	2018-11-06 21:23:31 -08:00
Stephanie Wang	ca585703b2	Refactor ObjectDirectory to reduce and fix callback usage (#3227 )	2018-11-06 20:33:10 -08:00
Wang Qing	4968cc5d70	Fix a small typo (#3240 )	2018-11-05 18:30:53 -08:00
Stephanie Wang	bf88aa5013	Increase timeout before reconstruction is triggered (#3217 ) * Increase timeout to 10s * Skip eviction reconstruction tests * Add stress test for many actors to one * Fix test by shortening it. * lower number of processes in stress test * Skip slow test	2018-11-05 18:03:50 -08:00
Ion	d8ae9de99c	Caching task resource requirements. (#3231 ) * caching resource requirements * small fixes * avoid copying the resource map	2018-11-05 15:14:09 -08:00
Philipp Moritz	0da15b1c1f	Fix build system dependency for local_scheduler_client (#3215 )	2018-11-03 13:19:02 -07:00

1 2 3 4 5 ...

658 commits