hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-11 21:56:39 -04:00

Author	SHA1	Message	Date
Melih Elibol	4b1c8be4fe	Fix setting log-level to debug. (#1432 )	2018-01-21 21:51:05 -08:00
Stephanie Wang	74718efa73	Nondeterministic reconstruction for actors (#1344 ) * Add failing unit test for nondeterministic reconstruction * Retry scheduling actor tasks if reassigned to local scheduler * Update execution edges asynchronously upon dispatch for nondeterministic reconstruction * Fix bug for updating checkpoint task execution dependencies * Update comments for deterministic reconstruction * cleanup * Add (and skip) failing test case for nondeterministic reconstruction * Suppress test output	2018-01-21 13:44:13 -08:00
Philipp Moritz	3d224c4edf	Second Part of Internal API Refactor (#1326 )	2017-12-26 16:22:04 -08:00
Robert Nishihara	76b6b4a2d3	When killing worker, release resources before dispatching tasks. (#1327 )	2017-12-15 18:12:03 -08:00
Stephanie Wang	12fdb3f53a	Convert actor dummy objects to task execution edges. (#1281 ) * Define execution dependencies flatbuffer and add to Redis commands * Convert TaskSpec to TaskExecutionSpec * Add execution dependencies to Python bindings * Submitting actor tasks uses execution dependency API instead of dummy argument * Fix dependency getters and some cleanup for fetching missing dependencies * C++ convention * Make TaskExecutionSpec a C++ class * Convert local scheduler to use TaskExecutionSpec class * Convert some pointers to references * Finish conversion to TaskExecutionSpec class * fix * Fix * Fix memory errors? * Cast flatbuffers GetSize to size_t * Fixes * add more retries in global scheduler unit test * fix linting and cast fbb.GetSize to size_t * Style and doc * Fix linting and simplify from_flatbuf.	2017-12-14 20:47:54 -08:00
Robert Nishihara	f75b51d178	Register Common.error with local scheduler extension module. (#1316 ) * Register Common.error with local scheduler extension module. * Add test.	2017-12-13 11:55:54 -08:00
Stephanie Wang	044548bcff	Mark the killed as done outside of loop (#1284 )	2017-12-02 14:42:16 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Eric Liang	9233e496cc	Raise exception when getting the task results of workers that died (#1224 ) * wip * with test * add timeout * also add test for f * remove on cleanup * update * wip * fix tests * mark actor removed in redis * clang-format * fix bug when no-inprogress tasks * try to set task status done * Add comment.	2017-11-20 15:18:39 -08:00
Peter Schafhalter	d986294c2b	Replace UT strings in local scheduler (#1213 ) * Convert to string using std::string * Fix linting issue * Fix linting * Construct db_connect_args using vector * Use vector size() instead of num_args * Hopefully fix linting now	2017-11-17 16:14:46 -08:00
Stephanie Wang	c70430f322	Fix bugs in plasma manager transfer (#1188 ) * Plasma client test for plasma abort * Use ray-project/arrow:abort-objects branch * Set plasma manager connection cursor to -1 when not in use * Handle transfer errors between plasma managers, abort unsealed objects * Add TODO for local scheduler exiting on plasma manager death * Revert "Plasma client test for plasma abort" This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14. * Upgrade arrow to version with PlasmaClient::Abort * Fix plasma manager test * Fix plasma test * Temporarily use arrow fork for testing * fix and set arrow commit * Fix plasma test * Fix plasma manager test and make write_object_chunk consistent with read_object_chunk * style * upgrade arrow	2017-11-15 22:32:38 -08:00
Peter Schafhalter	9a6a056609	Convert UT datastructures in tests (#1203 ) * bind_ipc_sock_retry returns std::string * snprintf -> std::snprintf * Fix formatting * Use stringstream instead of snprintf * Fix typo	2017-11-11 16:55:05 -08:00
Philipp Moritz	e798a652bc	Change TaskSpec to allow multiple object IDs per argument. (#1204 ) * Implement object ID bags * linting * fix tests * fix linting * fix comments	2017-11-10 16:33:34 -08:00
Stephanie Wang	07f0532b9b	Local scheduler filters out dead clients during reconstruction (#1182 ) * Object table lookup returns vector of DBClientID instead of address strings * Add node IP address to DBClient notification * DB client cache stores entire DB client, convert addresses to std::string * get cached db client returns the client * Expose a call to initialize the redis cache * Local scheduler filters out dead clients during reconstruction * Remove node ip address from dbclient, use aux_address for plasma managers * Get entire db client entry when not found in cache * Fix common tests * Fix address in tests * Push error to driver if driver task did the put * Address Robert's comments and cleanup * Remove unused Redis command * Fix db test	2017-11-10 11:29:24 -08:00
Robert Nishihara	1c6b30b5e2	Move all config constants into single file. (#1192 ) * Initial pass at factoring out C++ configuration into a single file. * Expose config through Python. * Forward declarations. * Fixes with Python extensions * Remove old code. * Consistent naming for constants. * Fixes * Fix linting. * More linting. * Whitespace * rename config -> _config. * Move config inside a class. * update naming convention * Fix linting. * More linting * More linting. * Add in some more constants. * Fix linting	2017-11-08 11:10:38 -08:00
Stephanie Wang	af47737bd5	Prototype distributed actor handles (#1137 ) * Add actor handle ID to the task spec * Local scheduler dispatches actor tasks according to a task counter per handle * Fix python test * Allow passing actor handles into tasks. Not completely working yet. Also this is very messy. * Fixes, should be roughly working now. * Refactor actor handle wrapper * Fix __init__ tests * Terminate actor when the original handle goes out of scope * TODO and a couple test cases * Make tests for unsupported cases * Fix Python mode tests * Linting. * Cache actor definitions that occur before ray.init() is called. * Fix export actor class * Deterministically compute actor handle ID * Fix __getattribute__ * Fix string encoding for python3 * doc * Add comment and assertion.	2017-10-19 23:49:59 -07:00
Robert Nishihara	1cdc2fb011	Clean up event loop and callbacks when processes exit. (#1125 ) * Clean up event loop and callbacks when processes exit. * Fix bug.	2017-10-19 17:07:03 -07:00
Robert Nishihara	f3e3c7ec71	Add is_actor_checkpoint_method to TaskSpec. (#1117 ) * Add is_actor_checkpoint_method to TaskSpec. * Fix linting. * Fix rebase error. * Fix errors from rebase.	2017-10-15 16:52:10 -07:00
Stephanie Wang	15486a14a0	Refactor actor task queues (#1118 ) * Refactor add_task_to_actor_queue into queue_actor_task and insert_actor_task_queue * Refactor actor task queue to share the waiting task queue * Fix	2017-10-13 20:52:11 -07:00
Robert Nishihara	486cb64e3f	Compile with -Werror and -Wall (#1116 ) * Compile global scheduler with -Werror -Wall. * Compile plasma manager with -Werror -Wall. * Compile local scheduler with -Werror -Wall. * Compile common code with -Werror -Wall. * Signed/unsigned comparisons. * More signed/unsigned fixes. * More signed/unsigned fixes and added extern keyword. * Fix linting. * Don't check strict-aliasing because Python.h doesn't pass.	2017-10-12 21:00:23 -07:00
Stephanie Wang	3764f2f2e1	Actor checkpointing with object lineage reconstruction (#1004 ) * Worker reports error in previous task, actor task counter is incremented after task is successful * Refactor actor task execution - Return new task counter in GetTaskRequest - Update worker state for actor tasks inside of the actor method executor * Manually invoked checkpoint method * Scheduling for actor checkpoint methods * Fix python bugs in checkpointing * Return task success from worker to local scheduler instead of actor counter * Kill local schedulers halfway through actor execution instead of waiting for all tasks to execute once * Remove redundant actor tasks during dispatch, reconstruct missing dependencies for actor tasks * Make executor for temporary actor methods * doc * Set default argument for whether the previous task was a success * Refactor actor method call * Simplify checkpoint task submission * lint * fix philipp's comments * Add missing line * Make actor reconstruction tests run faster * Unimportant whitespace. * Unimportant whitespace. * Update checkpoint method signature * Documentation and handle exceptions during checkpoint save/resume * Rename get_task message field to actor_checkpoint_failed * Fix bug. * Remove debugging check, redirect test output	2017-10-12 09:53:32 -07:00
Robert Nishihara	b585001881	When a task is passed to the global scheduler, if it is not received,… (#1106 ) * When a task is passed to the global scheduler, if it is not received, then try again. * Call give_task_to_global_scheduler directly (same with local).	2017-10-12 00:04:38 -07:00
Stephanie Wang	aebe9f9374	Fix actor garbage collection by breaking cyclic references (#1064 ) * Fix bug in wait_for_pid_to_exit, add test for actor deletion. * Fix actor garbage collection by breaking cyclic references * Add test for calling actor method immediately after actor creation. * Fix bug, must dispatch tasks when workers are killed. * Fix python test * Fix cyclic reference problem by creating ActorMethod objects on the fly. * Try simply increasing the time allowed for many_drivers_test.py.	2017-10-05 00:55:33 -07:00
Robert Nishihara	1488975d1b	Add timing statement to loop that calls redis_get_cached_db_client be… (#1045 ) * Add timing statement to loop that calls redis_get_cached_db_client because it has been slow in the past. * Fix linting. * Refactoring to make manager vectors into std::vector. * Fix linting. * Fixes.	2017-10-02 10:46:21 -07:00
Robert Nishihara	a31d138f21	Don't log when a worker can't be started. (#1056 )	2017-10-02 10:32:46 -07:00
Robert Nishihara	ce278aa06a	Fix valgrind tests. (#1037 ) * Comment out local scheduler valgrind test. * Fix free/delete error. * More free -> delete errors * One more free -> delete and also clean up callback state in plasma manager. * Add set -x to run_valgrind scripts. * Fix valgrind error in CreateLocalSchedulerInfoMessage.	2017-09-30 00:11:09 -07:00
Eric Liang	ba153adc4c	Downgrade severity of most common messages (#1039 ) * downgrade severity of most common messages * update	2017-09-30 00:01:49 -07:00
Stephanie Wang	74ac80631b	Local scheduler sends a null heartbeat to global scheduler (#962 ) * Local scheduler sends a null heartbeat to global scheduler to notify death * Add whitespace. * Speed up component failures test * Free local scheduler state upon plasma manager disconnection	2017-09-12 10:45:21 -07:00
Stephanie Wang	99c8b1f38c	Actor fault tolerance using object lineage reconstruction (#902 ) * Revert Python actor reconstruction * Actor reconstruction using object lineage * Add dummy arguments and return values for actor tasks * Pin dummy outputs for actor tasks * Skip checkpointing test for now * TODOs * minor edits * Generate dummy object dependencies in Python, not C * Fix linting. * Move actor counter and dummy objects inside of the actor handle * Refactor Worker._process_task, suppress exception propagation for sequential actor tasks	2017-09-10 19:29:28 -07:00
Philipp Moritz	7030ef366f	Rebase Ray on latest arrow (remove numbuf from Ray). (#910 ) * remove some stuff * put get roundtrip working * fixes * more fixes * cleanup * fix tests * latest arrow * fixes * fix tests * fix linting * rebase * fixes * fix bug * bring back libgcc error * fix linting * use official arrow repo * fixes	2017-09-04 22:58:49 -07:00
Stephanie Wang	7496c98010	Fault tolerance race (#894 ) * Remove race between local scheduler disconnecting and global scheduler assigning a task * Fix number of workers started in component failures test * Fix race between global scheduler retrying a task assignment and monitor cleaning up task table. The global scheduler should only retry the task assignment if the local scheduler is still alive. * Clean up task_table_update callback if failure * Look up current local scheduler mapping when retrying actor task submission * Log warning if no subscribers received a task table update * Clean up database handle memory in local scheduler	2017-08-30 22:20:50 -07:00
Alexey Tumanov	fc885bd918	Adding basic support for a user-interpretable resource label (#761 ) * adding support for the user-interpretable label(UIR) * more plumbing for num_uirs further upstream; set to infty when specified on cmd line * pass default num_uirs for actors; update GlobalStateAPI * support num_uirs in ray.init() * local scheduler resource accounting: support num_uirs; prep for vectorized resource accounting * global scheduler test updated * Fix bug introduced by rebase. * Rename UIR -> CustomResource and add test. * Small changes and use constexpr instead of macros. * Linting and some renaming. * Reorder some code. * Remove cpus_in_use and fix bug. * Add another test and make a small change. * Rephrase documentation about feature stability.	2017-08-08 02:53:59 -07:00
Philipp Moritz	054ae4180e	Fix installation instruction for ubuntu 14.04 (#805 ) * fix installation instruction for ubuntu 14.04 * upgrade cmake requirements * fix	2017-08-02 18:14:14 -07:00
Robert Nishihara	cb84972f6b	Recreate actors when local schedulers die. (#804 ) * Reconstruct actor state when local schedulers fail. * Simplify construction of arguments to pass into default_worker.py from local scheduler. * Remove deprecated ray.actor. * Simplify actor reconstruction method. * Fix linting. * Small fixes.	2017-08-02 18:02:52 -07:00
Philipp Moritz	c3b39b4d86	Pull Plasma from Apache Arrow and remove Plasma store from Ray. (#692 ) * Rebase Ray on top of Plasma in Apache Arrow * add thirdparty building scripts * use rebased arrow * fix * fix build * fix python visibility * comment out C tests for now * fix multithreading * fix * reduce logging * fix plasma manager multithreading * make sure old and new object IDs can coexist peacefully * more rebasing * update * fixes * fix * install pyarrow * install cython * fix * install newer cmake * fix * rebase on top of latest arrow * getting runtest.py run locally (needed to comment out a test for that to work) * work on plasma tests * more fixes * fix local scheduler tests * fix global scheduler test * more fixes * fix python 3 bytes vs string * fix manager tests valgrind * fix documentation building * fix linting * fix c++ linting * fix linting * add tests back in * Install without sudo. * Set PKG_CONFIG_PATH in build.sh so that Ray can find plasma. * Install pkg-config * Link -lpthread, note that find_package(Threads) doesn't seem to work reliably. * Comment in testGPUIDs in runtest.py. * Set PKG_CONFIG_PATH when building pyarrow. * Pull apache/arrow and not pcmoritz/arrow. * Fix installation in docker image. * adapt to changes of the plasma api * Fix installation of pyarrow module. * Fix linting. * Use correct python executable to build pyarrow.	2017-07-31 21:04:15 -07:00
alanamarzoev	8464d77c76	Change event logs to store one Redis ZSET per worker. (#705 ) * Changing to zset * Fixed bug. * Fixed another bug. * Modified task_profiles. * Removed extra file. * Modified task_profiles test. * WIP * WIP * Undid changes * Updated * WIP * Made changes according to comments. * Removed unneeded print. * Removed ujson usage. * failing test * tests passing * Fixed linting errors and modified style. * Fixed bug. * Fixed linting * Fixed according to comments. * Redis crashing? * Fixed linting * Fixed linting	2017-07-09 01:42:29 +02:00
Robert Nishihara	6c45657280	Reset the SIGCHLD handler after forking a worker to avoid influencing the worker. (#713 )	2017-07-07 14:50:37 +00:00
Robert Nishihara	1941e0f7b1	Fix compilation on CentOS. (#699 )	2017-06-26 05:54:21 +00:00
Robert Nishihara	ad480f8165	Don't reconstruct all objects in every fetch request in local scheduler. (#686 ) * Don't reconstruct all objects in every fetch request in local scheduler. * Separate out fetch timer and reconstruction timer. * Fix bug. * Bug fix. * Fix naming convention for global variables. * Address comments. * Make reconstruct_counter a static variable. * Fix linting. * Redo reconstruct handler using a set of objects to fetch. * Fix linting. * Replace set with vector.	2017-06-23 21:08:02 +00:00
Robert Nishihara	5ebc2f3f2e	Do resource bookkeeping for actor methods. (#682 ) * Dispatch regular and actor tasks when resources become available. * Make actor methods do resource bookkeeping and add test. * Remove unnecessary field. * Fix linting. * Fix actor test. * Maintain set of actors with pending tasks to speed up task dispatch. * Exit early from task dispatch if there are no resources available. * Fix linting. * Fix error. * Fix bug related to iterator invalidation. * When an actor is removed, remove it from the set of actors with pending tasks.	2017-06-21 05:52:45 +00:00
Robert Nishihara	3052ce25a6	Divide up large fetch requests from local scheduler, also print warni… (#683 ) * Divide up large fetch requests from local scheduler, also print warning if fetch handler is slow. * Fix linting. * Fix typo.	2017-06-19 22:57:51 +00:00
Robert Nishihara	9e4a3e4972	Replace some UT data structures in local scheduler with C++ STL. (#680 ) * Replace a local scheduler ut_array with a std::vector. * Replace vector of sizes in local scheduler with std::pair. * Remove utarray include. * Replace utarray with std::vector for reading local scheduler input messages. * Remove more UT data structures. * Remove UT includes. * Fix linting. * Include stdlib.h to find size_t. * Remove includes of stdbool.h. * Replace std::pair with TaskQueueEntry. * Fix redis tests. * Reinstate tests.	2017-06-19 21:58:42 +00:00
Robert Nishihara	f12db5f0e2	Divide large plasma requests into smaller chunks, and wait longer before reissuing large requests. (#678 ) * Divide large get requests into smaller chunks. * Divide fetches into smaller chunks. * Wait longer in worker and manager before reissuing fetch requests if there are many outstanding fetch requests. * Log warning if a handler in the local scheduler or plasma manager takes more than one second.	2017-06-18 04:42:15 +00:00
Robert Nishihara	96962cdee0	Log fatal error if plasma manager or local scheduler heartbeats take too long. (#676 ) * Log fatal error if plasma manager or local scheduler take too long to send heartbeat. * Fix linting. * Use int64_t for milliseconds since unix epoch.	2017-06-16 19:11:01 +00:00
Philipp Moritz	54925996ca	Allow remote functions to specify max executions and kill worker once limit is reached. (#660 ) * implement restarting workers after certain number of task executions * Clean up python code. * Don't start new worker when an actor disconnects. * Move wait_for_pid_to_exit to test_utils.py. * Add test. * Fix linting errors. * Fix linting. * Fix typo.	2017-06-13 00:34:58 -07:00
Robert Nishihara	a4d8e13094	Suppress excess warning messages related to intentional actor deaths. (#627 ) * Don't submit the actor destructor tasks when the job is exiting. * Don't propagate error messages to the driver when an actor exits intentionally.	2017-06-01 20:10:40 +00:00
Robert Nishihara	5f193afb87	Tell local scheduler to ignore SIGCHLD so that workers don't become zombies. (#620 )	2017-06-01 06:37:28 +00:00
Philipp Moritz	b94b4a35e0	Make the Plasma store ready for Arrow integration (#579 ) * port plasma to arrow * fixes * refactor plasma client * more modernization * fix plasma manager tests * everything compiles * fix plasma client tests * update plasma serialization tests * fix plasma manager tests * fix bug * updates * fix bug * fix tests * fix rebase * address comments * fix travis valgrind build * fix linting * fix include order again * fix linting * address comments	2017-05-31 16:24:23 -07:00
Richard Shin	16050eca8d	Don't link Python extensions to libpython*.so (#598 )	2017-05-25 19:01:12 -07:00
Philipp Moritz	3885d1b286	make builds with CMake incremental (#592 )	2017-05-24 21:52:33 -07:00

1 2

76 commits