hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 04:46:38 -04:00

Author	SHA1	Message	Date
Stephanie Wang	ff8e7f8259	Actor checkpointing for distributed actor handles (#1498 ) * Expose calls to get and set the actor frontier * Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded * Prototype for actor checkpointing * Filter out duplicate tasks on the local scheduler * Clean up some of the Python checkpointing code * More cleanups * Documentation * cleanup and fix unit test * Allow remote checkpoint calls through actor handle * Check whether object is local before reconstructing * Enable checkpointing for distributed actor handles, refactor tests * Fix local scheduler tests * lint * Address comments * lint * Skip tests that fail on new GCS * style * Don't put same object twice when setting the actor frontier * Address Philipp's comments, cleaner fbs naming	2018-02-07 11:19:32 -08:00
Philipp Moritz	a3f8fa426b	Start integrating new GCS APIs (#1379 ) * Start integrating new GCS calls * fixes * tests * cleanup * cleanup and valgrind fix * update tests * fix valgrind * fix more valgrind * fixes * add separate tests for GCS * fix linting * update tests * cleanup * fix python linting * more fixes * fix linting * add plasma manager callback * add some documentation * fix linting * fix linting * fixes * update * fix linting * fix * add spillback count * fixes * linting * fixes * fix linting * fix * fix * fix	2018-01-31 11:01:12 -08:00
Stephanie Wang	74718efa73	Nondeterministic reconstruction for actors (#1344 ) * Add failing unit test for nondeterministic reconstruction * Retry scheduling actor tasks if reassigned to local scheduler * Update execution edges asynchronously upon dispatch for nondeterministic reconstruction * Fix bug for updating checkpoint task execution dependencies * Update comments for deterministic reconstruction * cleanup * Add (and skip) failing test case for nondeterministic reconstruction * Suppress test output	2018-01-21 13:44:13 -08:00
Robert Nishihara	96463c680c	Allow actor methods to return multiple object IDs. (#1296 ) * Allow actor methods to return multiple object IDs. * Add test. * Fixes * Remove outdated comment. * Add comment and assert	2017-12-09 10:37:57 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Robert Nishihara	477a40f76d	Prohibit returning actor handles and also update actor documentation. (#1246 ) * Prohibit returning actor handles and also update actor documentation. * Clarify documentation.	2017-11-23 09:37:24 -08:00
Stephanie Wang	af47737bd5	Prototype distributed actor handles (#1137 ) * Add actor handle ID to the task spec * Local scheduler dispatches actor tasks according to a task counter per handle * Fix python test * Allow passing actor handles into tasks. Not completely working yet. Also this is very messy. * Fixes, should be roughly working now. * Refactor actor handle wrapper * Fix __init__ tests * Terminate actor when the original handle goes out of scope * TODO and a couple test cases * Make tests for unsupported cases * Fix Python mode tests * Linting. * Cache actor definitions that occur before ray.init() is called. * Fix export actor class * Deterministically compute actor handle ID * Fix __getattribute__ * Fix string encoding for python3 * doc * Add comment and assertion.	2017-10-19 23:49:59 -07:00
Stephanie Wang	15486a14a0	Refactor actor task queues (#1118 ) * Refactor add_task_to_actor_queue into queue_actor_task and insert_actor_task_queue * Refactor actor task queue to share the waiting task queue * Fix	2017-10-13 20:52:11 -07:00
Stephanie Wang	3764f2f2e1	Actor checkpointing with object lineage reconstruction (#1004 ) * Worker reports error in previous task, actor task counter is incremented after task is successful * Refactor actor task execution - Return new task counter in GetTaskRequest - Update worker state for actor tasks inside of the actor method executor * Manually invoked checkpoint method * Scheduling for actor checkpoint methods * Fix python bugs in checkpointing * Return task success from worker to local scheduler instead of actor counter * Kill local schedulers halfway through actor execution instead of waiting for all tasks to execute once * Remove redundant actor tasks during dispatch, reconstruct missing dependencies for actor tasks * Make executor for temporary actor methods * doc * Set default argument for whether the previous task was a success * Refactor actor method call * Simplify checkpoint task submission * lint * fix philipp's comments * Add missing line * Make actor reconstruction tests run faster * Unimportant whitespace. * Unimportant whitespace. * Update checkpoint method signature * Documentation and handle exceptions during checkpoint save/resume * Rename get_task message field to actor_checkpoint_failed * Fix bug. * Remove debugging check, redirect test output	2017-10-12 09:53:32 -07:00
Robert Nishihara	7a954f4b5f	Use monotonic clock for some python tests. (#1112 )	2017-10-11 19:58:59 -07:00
Robert Nishihara	4669c59fa8	Release GPU resources as soon as an actor exits. (#1088 ) * Release GPU resources as soon as an actor exits. * Add a test. * Store local_scheduler_id and driver_id in the worker object instead of the actor object.	2017-10-06 17:58:19 -07:00
Stephanie Wang	aebe9f9374	Fix actor garbage collection by breaking cyclic references (#1064 ) * Fix bug in wait_for_pid_to_exit, add test for actor deletion. * Fix actor garbage collection by breaking cyclic references * Add test for calling actor method immediately after actor creation. * Fix bug, must dispatch tasks when workers are killed. * Fix python test * Fix cyclic reference problem by creating ActorMethod objects on the fly. * Try simply increasing the time allowed for many_drivers_test.py.	2017-10-05 00:55:33 -07:00
Stephanie Wang	99c8b1f38c	Actor fault tolerance using object lineage reconstruction (#902 ) * Revert Python actor reconstruction * Actor reconstruction using object lineage * Add dummy arguments and return values for actor tasks * Pin dummy outputs for actor tasks * Skip checkpointing test for now * TODOs * minor edits * Generate dummy object dependencies in Python, not C * Fix linting. * Move actor counter and dummy objects inside of the actor handle * Refactor Worker._process_task, suppress exception propagation for sequential actor tasks	2017-09-10 19:29:28 -07:00
Robert Nishihara	dbe3d9351c	Prototype actor checkpointing. (#814 ) * Initial testing of checkpointing functions. * Save checkpoints in Redis. * Pipe checkpoint_interval through remote decorator. * Add a test. * Small cleanups. * Submit dummy tasks when reconstructing tasks before the most recent tasks so that we don't end up reconstructing the arguments for those tasks. * Remove old checkpoints to save space. * Fix linting.	2017-08-07 17:52:39 -07:00
Robert Nishihara	cb84972f6b	Recreate actors when local schedulers die. (#804 ) * Reconstruct actor state when local schedulers fail. * Simplify construction of arguments to pass into default_worker.py from local scheduler. * Remove deprecated ray.actor. * Simplify actor reconstruction method. * Fix linting. * Small fixes.	2017-08-02 18:02:52 -07:00
Robert Nishihara	52a27be364	Better logging in tests. (#790 )	2017-07-31 22:30:46 -07:00
Robert Nishihara	e0867c8845	Switch Python indentation from 2 spaces to 4 spaces. (#726 ) * 4 space indentation for actor.py. * 4 space indentation for worker.py. * 4 space indentation for more files. * 4 space indentation for some test files. * Check indentation in Travis. * 4 space indentation for some rl files. * Fix failure test. * Fix multi_node_test. * 4 space indentation for more files. * 4 space indentation for remaining files. * Fixes.	2017-07-13 21:53:57 +00:00
Robert Nishihara	5ebc2f3f2e	Do resource bookkeeping for actor methods. (#682 ) * Dispatch regular and actor tasks when resources become available. * Make actor methods do resource bookkeeping and add test. * Remove unnecessary field. * Fix linting. * Fix actor test. * Maintain set of actors with pending tasks to speed up task dispatch. * Exit early from task dispatch if there are no resources available. * Fix linting. * Fix error. * Fix bug related to iterator invalidation. * When an actor is removed, remove it from the set of actors with pending tasks.	2017-06-21 05:52:45 +00:00
Robert Nishihara	019ba07e9c	Correct actor class name and module. (#675 ) * Correct actor class name and module. * Add test. * Fix linting.	2017-06-17 05:44:42 +00:00
Robert Nishihara	d0bfc0a849	Clean up actor workers when actor handle goes out of scope. (#617 )	2017-06-01 07:02:43 +00:00
Robert Nishihara	c647dd5f6c	Make it possible to use actor definitions within remote functions and other actors. (#587 ) * Enable remote function and actor definitions to close over actor definitions. * Give better error message if actor objects are pickled. * Add tests for closing over actor definitions. * Fix linting.	2017-05-24 15:43:32 -07:00
Robert Nishihara	ec2534422b	Remove register_class from API. (#550 ) * Perform ray.register_class under the hood. * Fix bug. * Release worker lock when waiting for imports to arrive in get. * Remove calls to register_class from examples and tests. * Clear serialization state between tests. * Fix bug and add test for multiple custom classes with same name. * Fix failure test. * Fix linting and cleanups to python code. * Fixes to documentation. * Implement recursion depth for recursively registering classes. * Fix linting. * Push warning to user if waiting for class for too long. * Fix typos. * Don't export FunctionToRun if pickling the function fails. * Don't broadcast class definition when pickling class.	2017-05-16 18:38:52 -07:00
Robert Nishihara	9f91eb8c91	Change API for remote function declaration, actor instantiation, and actor method invocation. (#541 ) * Direction substitution of @ray.remote -> @ray.task. * Changes to make '@ray.task' work. * Instantiate actors with Class.remote() instead of Class(). * Convert actor instantiation in tests and examples from Class() to Class.remote(). * Change actor method invocation from object.method() to object.method.remote(). * Update tests and examples to invoke actor methods with .remote(). * Fix bugs in jenkins tests. * Fix example applications. * Change @ray.task back to @ray.remote. * Changes to make @ray.actor -> @ray.remote work. * Direct substitution of @ray.actor -> @ray.remote. * Fixes. * Raise exception if @ray.actor decorator is used. * Simplify ActorMethod class.	2017-05-14 00:01:20 -07:00
Robert Nishihara	f32368bcbe	Prevent actors from being placed on removed nodes or nodes with no CPUs. (#527 ) * Make note about bug in which actor creation notification message is not received. * Prevent actors from being created on removed nodes. * Prevent actors from being created on nodes with no CPUs. * Fix linting. * Add test for scheduling actors on local schedulers with no CPUs. * Improve error message when actors created before ray.init called.	2017-05-08 20:39:43 -07:00
Robert Nishihara	c688a64235	Expose GPU IDs to remote functions. (#496 ) * Change local scheduler bookkeeping to use GPU IDs. * Update actor test. * Add tests for actors and tasks simultaneously using GPUs. * Add additional task GPU ID test. * Fix linting. * Make redis GPU assignment ignore GPU IDs. * Small fix.	2017-05-07 13:03:49 -07:00
Robert Nishihara	245c8ab888	Make sure user seeding does not affect actor ID generation. (#506 ) * Make sure user seeding does not affect actor ID generation. * Fix linting. * Add test.	2017-05-03 16:29:55 -07:00
Robert Nishihara	f4c1adae17	Unify function signature handling between remote functions and actor … (#441 ) * Unify function signature handling between remote functions and actor methods. * Fixes. * Fix tests.	2017-04-08 21:34:13 -07:00
Robert Nishihara	7cd00741b1	Suppress irrelevant Redis connection errors. (#434 ) * Suppress error messages in worker import thread when Redis terminates. * Suppress some warnings from one of the tests.	2017-04-07 23:19:24 -07:00
Robert Nishihara	ba02fc0eb0	Run flake8 in Travis and make code PEP8 compliant. (#387 )	2017-03-21 12:57:54 -07:00
Philipp Moritz	068429ffd8	Convert local scheduler messages to flatbuffers (#340 ) * use flatbuffer messages for local scheduler * make sure constructor gets called for C++ object ObjectInfoT * fix typo * fix Robert's comments * Small change to actor test. * fix valgrind error * linting * free notification * fix * valgrind * fix valgrind * fix other bugs * valgrind fix * fixes * more fixes * Small changes to comments.	2017-03-15 16:27:52 -07:00
Robert Nishihara	39b7abefc5	Fix test failures in actor_test.py. (#317 )	2017-03-01 23:26:39 -08:00
Robert Nishihara	e399f57e6b	Let actors use GPUs. (#302 ) * Add num_cpus and num_gpus to actor decorator. * Assign GPU IDs to actors. * Add additional actor test. * Remove duplicated line. * Factor out local scheduler selection method. * Add test and simplify local scheduler selection.	2017-02-21 01:13:04 -08:00
Robert Nishihara	abd9987e3b	Fix unreliable actor test. (#295 )	2017-02-18 00:51:08 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00

34 commits