hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-12 06:06:39 -04:00

Author	SHA1	Message	Date
Stephanie Wang	fcc37021b2	Throw exception for `ray.get` of an evicted actor object (#3490 ) * Add a flag for whether an object has been created before * Add regression test * doc * Share object directory between object and node managers * Treat evicted actor tasks as failed * minor * Check return value * Fix bug where object locations weren't getting updated on client death * Fix mac build * Use RayTaskError	2018-12-14 11:41:27 -08:00
Yuhong Guo	a4abe6c0fe	Add test to test raylet client connection when raylet crashes. (#3518 )	2018-12-13 23:40:50 -08:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Alexey Tumanov	2455de78ce	save initial config instead of initial resource config (#3532 )	2018-12-13 20:39:42 -08:00
Si-Yuan	84fae57ab5	Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511 ) * refactoring * fix bugs * create client class * create client class for java; bug fix * remove legacy code * improve code by using std::string, std::unique_ptr rename private fields and removing legacy code * rename class * improve naming * fix * rename files * fix names * change name * change return types * make a mutex private field * fix comments * fix bugs * lint * bug fix * bug fix * move too short functions into the header file * Loose crash conditions for some APIs. * Apply suggestions from code review Co-Authored-By: suquark <suquark@gmail.com> * format * update * rename python APIs * fix java * more fixes * change types of cpython interface * more fixes * improve error processing * improve error processing for java wrapper * lint * fix java * make fields const * use pointers for [out] parameters * fix java & error msg * fix resource leak, etc.	2018-12-13 13:39:10 -08:00
Eric Liang	20c7fad4f4	Move actor table to primary redis context	2018-12-12 16:51:29 -08:00
Eric Liang	cffe8f9806	Add option to evict keys LRU from the sharded redis tables (#3499 ) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format	2018-12-09 05:48:52 -08:00
Yuhong Guo	0136af5aac	Add return value for recontruction RPC. (#3493 ) * Add return value for recontruct RPC. * Fix comment function name	2018-12-09 00:08:44 -08:00
Stephanie Wang	4abafd7e62	Fix bug in ray.wait (#3445 ) ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely: 1. Objects A and B are put in the cluster. 2. Client calls ray.wait([A, B], num_returns=1). 3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each. 4. Callback for A fires. The wait completes and the request is removed. 5. Callback for B fires. The wait request no longer exists and raylet crashes.	2018-12-01 19:40:33 -08:00
Stephanie Wang	48a5935224	Fault tolerance for actor creation (#3422 ) * Add regression test * Request actor creation if no actor location found * Comments * Address comments * Increase test timeout * Trigger test	2018-11-29 10:48:35 -08:00
Tianming Xu	139fbf7884	Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403 )	2018-11-27 23:51:18 -08:00
Eric Liang	c2108ca64f	Don't put entire actor registry in debug string since it's too long (#3395 )	2018-11-27 16:48:12 -08:00
Stephanie Wang	6b3236349c	Fix memory leak in lineage cache (#3366 ) * Move children_ map inside Lineage * Update lineage_cache.cc * Test and fixes * Remove unused	2018-11-21 16:18:39 -08:00
Stephanie Wang	3e33f6f71b	Fix failure handling for actor death (#3359 ) * Broadcast actor death, clean up dummy objects * Reduce logging and clean up state when failing a task * lint * Make actor failure test nicer, reduce node timeout	2018-11-21 12:26:22 -08:00
Eric Liang	686cf20951	Remove uses of std::list::size (#3358 ) * worker pool and client conn * Fix linting * unordered set * move	2018-11-20 14:47:55 -08:00
Philipp Moritz	d3697ce4e1	Ready queue refactor to make Dispatching tasks more efficient (#3324 ) * put queues outside * working version, still needs to be optimized * implement round robin * proper round robin * fix spillback * update * fix * cleanup * more cleanups * fix * fix * add documentation * explanation for hash combiner * speed it up * cleanup and linting * linting * comments * Update scheduling_queue.h * temp commit * fixes * update * fix * cleanup * cleanup * lint * more prints * more prints * increase sleep * documentation * sleep * fix * fix * sleep longer * update * fix * fix * fix * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * fixes * use ordered set * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator * fix * fix test * linting * lint * update * add documentation * linting	2018-11-20 13:14:12 -08:00
Ujval Misra	b0bfd104f2	Batch heartbeats from node manager together in the monitor. (#3011 )	2018-11-20 09:52:27 -08:00
Robert Nishihara	f2b5500642	Add ordered_set container. (#3352 ) * Add ordered_set container. * Fix * Linting * Constructors * Remove O(n) call to list.size(). * Fix. * Add documentation. * Add iterators to ordered_set container implementation. * iterator_type -> iterator * Make typedefs private * Add const_iterator	2018-11-19 17:01:18 -08:00
Eric Liang	d4dbd27e0d	Don't retry IPC connect an absurd number of times (#3355 )	2018-11-19 16:23:59 -08:00
Robert Nishihara	5cbc597494	Suppress duplicate pre-emptive object pushes. (#3276 ) * Suppress duplicate pre-emptive object pushes. * Add test. * Fix linting * Remove timer and inline recent_pushes_ into local_objects_. * Improve test. * Fix * Fix linting * Enable retrying pull from same object manager. Randomize object manager. * Speed up test * Linting * Add test. * Minor * Lengthen pull timeout and reissue pull every time a new object becomes available. * Increase pull timeout in test. * Wait for nodes to start in object manager test. * Wait longer for nodes to start up in test. * Small fixes. * _submit -> _remote * Change assert to warning.	2018-11-16 23:02:45 -08:00
Robert Nishihara	60b22d9a72	Don't unsubscribe dependencies for infeasible tasks. (#3338 ) * Make scheduling queues RemoveTasks return task states as well. * Add test * Don't unsubscribe for infeasible tasks when spilling over. * Linting * Address comments.	2018-11-16 11:33:00 -08:00
Eric Liang	e0bf9d7305	Add debug string to raylet (#3317 ) * initial debug string * format * wip debug string * fix compile * fix * update * finished * to file * logs dir * use temp root * fix * override	2018-11-15 21:47:50 -08:00
Philipp Moritz	1be1455d86	Fix redis crash when duplicate messages are appended to log. (#3316 )	2018-11-15 15:09:39 -08:00
Philipp Moritz	b6a12d1f97	Fix socket retry message (#3325 )	2018-11-15 12:14:19 -08:00
Stephanie Wang	577c1dda74	Release sender connections as soon as WriteMessageAsync completes (#3313 )	2018-11-13 21:32:24 -05:00
Ion	d681893b0f	Speed up task dispatch. (#3234 ) * speed up task dispatch * minor changes * improved comments * improved comments * change argument of DispatchTasks to list of tasks * dispatch only tasks whose dependencies have been fullfiled * some updated comments * refactored DispatchQueue() and Assigntask() to avoid the copy of the ready list * minor fixes * some more minor fixes * some more minor fixes * added more comments * better comments? * fixed all feedback comments, minus making the argument of AssignTask() const * Assigntask() now taskes a const argument * Do the task copy outside of the callback * fix linting	2018-11-10 09:55:12 -08:00
Eric Liang	9b2794101d	[minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228 )	2018-11-08 00:04:20 -08:00
Stephanie Wang	d950e92f63	Allow multiple threads to call ray.get and ray.wait (#3244 ) * Handle multiple threads calling ray.get * Multithreaded ray.wait * Pass in current task ID in java backend * Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get * Fix test * Some cleanups * Improve error message * Add assertion * Cleanup, throw error in HandleTaskUnblocked if task not actually blocked * lint * Fix python worker reset * Fix references to reconstruct_objects * Linting * java lint * Fix java * Fix iterator	2018-11-07 22:39:28 -08:00
Richard Liaw	0bab8ed95c	Expose internal config parameters for starting Ray (#3246 ) ## What do these changes do? This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly. Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible. #3239 depends on this. TODO: - [x] Add documentation to method arguments before merging. - [x] Add test to verify this works? ## Related issue number	2018-11-07 21:46:02 -08:00
Eric Liang	29e3362905	Better errors on process deaths (#3252 )	2018-11-07 14:08:16 -08:00
Robert Nishihara	1dd5d92789	Enable timeline visualizations of object transfers. (#3255 ) * Plot object transfers. * Linting	2018-11-07 12:45:59 -08:00
Philipp Moritz	4182b85611	Cache resources in SchedulingQueue (#3232 ) * cache resources * fix * documentation and remove old code * fix PR * update documentation * linting	2018-11-06 21:23:31 -08:00
Stephanie Wang	ca585703b2	Refactor ObjectDirectory to reduce and fix callback usage (#3227 )	2018-11-06 20:33:10 -08:00
Wang Qing	4968cc5d70	Fix a small typo (#3240 )	2018-11-05 18:30:53 -08:00
Stephanie Wang	bf88aa5013	Increase timeout before reconstruction is triggered (#3217 ) * Increase timeout to 10s * Skip eviction reconstruction tests * Add stress test for many actors to one * Fix test by shortening it. * lower number of processes in stress test * Skip slow test	2018-11-05 18:03:50 -08:00
Ion	d8ae9de99c	Caching task resource requirements. (#3231 ) * caching resource requirements * small fixes * avoid copying the resource map	2018-11-05 15:14:09 -08:00
Philipp Moritz	0da15b1c1f	Fix build system dependency for local_scheduler_client (#3215 )	2018-11-03 13:19:02 -07:00
Stephanie Wang	aacbd007a0	[xray] Implement faster flush policy for lineage cache (#3071 ) * Policy that flushes the lineage stash immediately * Fix bug where remote tasks in uncommitted lineage weren't getting subscribed to, add reg test * test * Fix bug where waiting task was getting subscribed * Cleanup * Update src/ray/raylet/lineage_cache.cc Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu> * Update src/ray/raylet/lineage_cache.cc Co-Authored-By: stephanie-wang <swang@cs.berkeley.edu> * cleanup * cleanup * Add another test for task with many parents * fix, unsubscribe to new waiting tasks * Unsubscribe as soon as the commit notification is handled	2018-10-30 09:59:50 -07:00
Robert Nishihara	fd854ff090	Allow the node manager port and object manager port to be set through… (#3130 ) * Allow the node manager port and object manager port to be set through ray start. * Linting * Fix Java test * Address comments.	2018-10-28 17:28:41 -07:00
Yuhong Guo	befbf78048	Delete empty pubsub keys (#3146 ) We found that there are large amount of pub-sub keys with no content in it (This case is worse when wait-id is used in the key name.). This logic of deleting empty pub-sub keys from GCS was in legacy ray but not in raylet.	2018-10-27 11:58:39 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	9c1826ed69	Use XRay backend by default. (#3020 ) * Use XRay backend by default. * Remove irrelevant valgrind tests. * Fix * Move tests around. * Fix * Fix test * Fix test. * String/unicode fix. * Fix test * Fix unicode issue. * Minor changes * Fix bug in test_global_state.py. * Fix test. * Linting * Try arrow change and other object manager changes. * Use newer plasma client API * Small updates * Revert plasma client api change. * Update * Update arrow and allow SendObjectHeaders to fail. * Update arrow * Update python/ray/experimental/state.py Co-Authored-By: robertnishihara <robertnishihara@gmail.com> * Address comments.	2018-10-23 12:46:39 -07:00
Philipp Moritz	8d8b6e5bfa	Retry connections to redis for async and subscribe contexts (#3105 ) This is fixing a problem that @devin-petersohn observed on the windows subsystem for linux. In theory, redis should be up once the async connect is happening and there should be no retries needed for the async connect. However on the windows subsystem for linux, the async connect was failing even though the synchronous one was working. Maybe windows has a different semantics here than linux.	2018-10-22 22:31:13 -07:00
Wang Qing	a4db5bbaea	Fill driver id into actor notification when finishing assigned task. (#3080 ) ## What do these changes do? Fill driver id into actor notification when finishing assigned task. Also it improves codes.	2018-10-21 11:12:20 +08:00
Eric Liang	9d23fa03c9	[xray] All messages on main asio event loop should be written asynchronously (#3023 ) * copy over ref code * wip async writes * compiles * fix error handling * add test * amend * fix test * clang fmgt * clang format * wip * yapf * rename format script * test error * clangfmt * add test to list * warn * ref test * fix test * comment * add capture * Update client_connection.cc * wip * fix compile	2018-10-18 21:56:22 -07:00
Yuhong Guo	653c5b114a	[c++] Refine Log Code (#2816 ) * Support setting logging level from env variable * Remove Env Variable related code * lint	2018-10-18 10:51:36 -07:00
Peter Schafhalter	a41bbc10ef	Add password authentication to Redis ports (#2952 ) * Implement Redis authentication * Throw exception for legacy Ray * Add test * Formatting * Fix bugs in CLI * Fix bugs in Raylet * Move default password to constants.h * Use pytest.fixture * Fix bug * Authenticate using formatted strings * Add missing passwords * Add test * Improve authentication of async contexts * Disable Redis authentication for credis * Update test for credis * Fix rebase artifacts * Fix formatting * Add workaround for issue #3045 * Increase timeout for test * Improve C++ readability * Fixes for CLI * Add security docs * Address comments * Address comments * Adress comments * Use ray.get * Fix lint	2018-10-16 22:48:30 -07:00
Robert Nishihara	faa31ae018	Introduce concept of resources required for placing a task. (#2837 ) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java	2018-10-04 10:35:39 -07:00
Richard Liaw	01bb073569	Suppress errors when worker or driver intentionally disconnects. (#2935 )	2018-10-04 00:06:34 -07:00
Robert Nishihara	3ce8eb2d4c	Test dying_worker_get and dying_worker_wait for xray. (#2997 ) This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.	2018-10-02 00:08:47 -07:00

1 2 3 4 5 ...

595 commits