hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-11 05:46:37 -04:00

Author	SHA1	Message	Date
Melih Elibol	6e06a9e338	XRay Task Forwarding Milestone (#1785 ) Summary: Able to run 1000 tasks with object dependencies on a set of distributed Raylets. Raylet Changes: Finalized ClientConnection class. Task forwarding. NM-to-NM heartbeats. NM resource accounting for tasks. Simple scheduling policy with task forwarding. Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding. LineageCache Changes: LineageCache without cleanup of tasks committed by remote nodes. Lineage cache writeback and cleanup implementation. ObjectManager Changes: Object manager event loop/ClientConnection refactor. Multithreaded object manager (disabled in this PR). Testing Changes: Integration tests for task submission on multiple Raylets. Stress tests for object manager (with GCS and object store integration). Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: Alexey Tumanov <atumanov@gmail.com>	2018-03-31 18:02:58 -07:00
Stephanie Wang	8704c8618c	Request and cancel notifications in the new GCS API (#1758 ) * Add TableRequestNotifications and TableCancelNotifications to Redis modules * Add RequestNotifications and CancelNotifications to generic GCS Table * Add tests for subscribing to specific keys * Remove TODO! * Return the current value at the key directly from RequestNotifications instead of through publish * Add unit test for Lookup failure callback * Modify tests to account for empty subscription response * Remove ObjectTable notification methods * Clean up message parsing and doc in redis context * Use vectors of DataT in all GCS callbacks * Clean up SubscriptionCallback * Move Table definitions into tables.cc * Refactor and document redis modules * doc * Fix new GCS build * Cleanups * Revert "Fix new GCS build" This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96. * Use vectors for internal callback interface, user-facing interface takes a reference to a single item * Fix new GCS build * Add unit test for Lookup failure callback * Fix compiler errors * Cleanup * Publish the entry ID with the notification * Check that the ID for a notification matches in client tests	2018-03-22 10:31:07 -07:00
Robert Nishihara	4658d0a180	Print error when actor takes too long to start, and refactor error me… (#1747 ) * Print error when actor takes too long to start, and refactor error message pushing. * Print warning every ten seconds. * Fix linting and tests. * Fix tests.	2018-03-19 20:24:35 -07:00
Melih Elibol	3c080f4baa	Add a callback for gcs table lookup failures. (#1702 ) * Add callback to gcs client for table lookup failures. * update plasma_manager reflecting changes to gcs callback.	2018-03-15 22:25:01 -07:00
Stephanie Wang	6114b6d20e	Implement the client table for the new GCS (#1674 ) * Add subscription callback to CallbackData * Implement ClientTable * Hook up ClientTable to AsyncGCSClient * Add client_info to GCSClient Connect interface * client table callbacks * Unit test for client table * Doc * Fix idempotency check * Fix mac build * Fix memory issues in gcs client test * Fix disconnection bug * lint	2018-03-11 19:17:18 -07:00
Robert Nishihara	0fcceef772	Update logging and check macros. (#1627 ) * Update logging and check macros. * Fix linting. * Fix RAY_DCHECK and unused variable. * Fix linting	2018-02-28 15:13:00 -08:00
Robert Nishihara	89db7841d2	Update arrow version. (#1512 )	2018-02-07 23:05:16 -08:00
Melih Elibol	d8850eac4b	Suppress object transfer requests when object is already being received. (#1430 ) * added deterministic check for objects received in fetch_timeout_handler. * use receive time, in case something goes wrong after object is received. * increase timeout for removal. * indentation fix. * make log info log debug. clean up debug log. * undo unecessary changes. * changed description var. * shorten line 949. * incorporate feedback. * linting; make is_object_received function consts. * change semantics of received_objects to objects being received. added checks to both points at which objects are re-requested. updated object receive initialization accordingly. * eliminate erase on receive init. check call to request_transfer_from instead of request_transfer. * updated comments. * added todo for multiple object transfers. * linting.	2018-02-01 22:45:31 -08:00
Philipp Moritz	a3f8fa426b	Start integrating new GCS APIs (#1379 ) * Start integrating new GCS calls * fixes * tests * cleanup * cleanup and valgrind fix * update tests * fix valgrind * fix more valgrind * fixes * add separate tests for GCS * fix linting * update tests * cleanup * fix python linting * more fixes * fix linting * add plasma manager callback * add some documentation * fix linting * fix linting * fixes * update * fix linting * fix * add spillback count * fixes * linting * fixes * fix linting * fix * fix * fix	2018-01-31 11:01:12 -08:00
Robert Nishihara	5acc98e629	Update arrow with better dataframe serialization and get rid of custo… (#1413 ) * Update arrow with better dataframe serialization and get rid of custom dataframe serializers. * Update plasma client API. * Fix potential bug. * Bug fix. * Update arrow to use deduplicated file descriptors and mutable buffers. * Fix tests. * Update commit. * Update commit. * Update commit. * Update commit. * Update commit * Update commit back to arrow codebase.'	2018-01-24 10:03:29 -08:00
Philipp Moritz	3d224c4edf	Second Part of Internal API Refactor (#1326 )	2017-12-26 16:22:04 -08:00
Stephanie Wang	12fdb3f53a	Convert actor dummy objects to task execution edges. (#1281 ) * Define execution dependencies flatbuffer and add to Redis commands * Convert TaskSpec to TaskExecutionSpec * Add execution dependencies to Python bindings * Submitting actor tasks uses execution dependency API instead of dummy argument * Fix dependency getters and some cleanup for fetching missing dependencies * C++ convention * Make TaskExecutionSpec a C++ class * Convert local scheduler to use TaskExecutionSpec class * Convert some pointers to references * Finish conversion to TaskExecutionSpec class * fix * Fix * Fix memory errors? * Cast flatbuffers GetSize to size_t * Fixes * add more retries in global scheduler unit test * fix linting and cast fbb.GetSize to size_t * Style and doc * Fix linting and simplify from_flatbuf.	2017-12-14 20:47:54 -08:00
Robert Nishihara	2f750e9ba7	Add parentheses around one-line if statement. (#1318 )	2017-12-13 23:48:53 -08:00
Robert Nishihara	c21e189371	Allow scheduling with arbitrary user-defined resource labels. (#1236 ) * Enable scheduling with custom resource labels. * Fix. * Minor fixes and ref counting fix. * Linting * Use .data() instead of .c_str(). * Fix linting. * Fix ResourcesTest.testGPUIDs test by waiting for workers to start up. * Sleep in test so that all tasks are submitted before any completes.	2017-12-01 11:41:40 -08:00
Stephanie Wang	c70430f322	Fix bugs in plasma manager transfer (#1188 ) * Plasma client test for plasma abort * Use ray-project/arrow:abort-objects branch * Set plasma manager connection cursor to -1 when not in use * Handle transfer errors between plasma managers, abort unsealed objects * Add TODO for local scheduler exiting on plasma manager death * Revert "Plasma client test for plasma abort" This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14. * Upgrade arrow to version with PlasmaClient::Abort * Fix plasma manager test * Fix plasma test * Temporarily use arrow fork for testing * fix and set arrow commit * Fix plasma test * Fix plasma manager test and make write_object_chunk consistent with read_object_chunk * style * upgrade arrow	2017-11-15 22:32:38 -08:00
Stephanie Wang	07f0532b9b	Local scheduler filters out dead clients during reconstruction (#1182 ) * Object table lookup returns vector of DBClientID instead of address strings * Add node IP address to DBClient notification * DB client cache stores entire DB client, convert addresses to std::string * get cached db client returns the client * Expose a call to initialize the redis cache * Local scheduler filters out dead clients during reconstruction * Remove node ip address from dbclient, use aux_address for plasma managers * Get entire db client entry when not found in cache * Fix common tests * Fix address in tests * Push error to driver if driver task did the put * Address Robert's comments and cleanup * Remove unused Redis command * Fix db test	2017-11-10 11:29:24 -08:00
Robert Nishihara	1c6b30b5e2	Move all config constants into single file. (#1192 ) * Initial pass at factoring out C++ configuration into a single file. * Expose config through Python. * Forward declarations. * Fixes with Python extensions * Remove old code. * Consistent naming for constants. * Fixes * Fix linting. * More linting. * Whitespace * rename config -> _config. * Move config inside a class. * update naming convention * Fix linting. * More linting * More linting. * Add in some more constants. * Fix linting	2017-11-08 11:10:38 -08:00
Robert Nishihara	1cdc2fb011	Clean up event loop and callbacks when processes exit. (#1125 ) * Clean up event loop and callbacks when processes exit. * Fix bug.	2017-10-19 17:07:03 -07:00
Robert Nishihara	486cb64e3f	Compile with -Werror and -Wall (#1116 ) * Compile global scheduler with -Werror -Wall. * Compile plasma manager with -Werror -Wall. * Compile local scheduler with -Werror -Wall. * Compile common code with -Werror -Wall. * Signed/unsigned comparisons. * More signed/unsigned fixes. * More signed/unsigned fixes and added extern keyword. * Fix linting. * Don't check strict-aliasing because Python.h doesn't pass.	2017-10-12 21:00:23 -07:00
Robert Nishihara	9f1e385335	Return errno from handle_sigpipe. (#1051 )	2017-10-11 18:36:28 -07:00
Peter Schafhalter	46f6c163dc	Converted ClientConnection to C++ standard library (#1099 )	2017-10-11 11:12:15 -07:00
Robert Nishihara	1488975d1b	Add timing statement to loop that calls redis_get_cached_db_client be… (#1045 ) * Add timing statement to loop that calls redis_get_cached_db_client because it has been slow in the past. * Fix linting. * Refactoring to make manager vectors into std::vector. * Fix linting. * Fixes.	2017-10-02 10:46:21 -07:00
Robert Nishihara	ce278aa06a	Fix valgrind tests. (#1037 ) * Comment out local scheduler valgrind test. * Fix free/delete error. * More free -> delete errors * One more free -> delete and also clean up callback state in plasma manager. * Add set -x to run_valgrind scripts. * Fix valgrind error in CreateLocalSchedulerInfoMessage.	2017-09-30 00:11:09 -07:00
Eric Liang	ba153adc4c	Downgrade severity of most common messages (#1039 ) * downgrade severity of most common messages * update	2017-09-30 00:01:49 -07:00
Peter Schafhalter	10027974b1	Replaced ObjectWaitRequests with unordered map (#990 ) * Replaced ObjectWaitRequests with unordered map * Pass C++ STL object by reference * Formatting changes and typos.	2017-09-28 15:29:26 -07:00
Peter Schafhalter	bb76d4ca0a	PlasmaRequestBuffer data structure updates (#1023 ) * Replaced utstring with std::string * Converted transfer_queue to a list * Converted pending_object_transfers to unordered_map * Fix free/delete bug and small modifications.	2017-09-27 19:50:37 -07:00
Peter Schafhalter	6e9657e696	Replaced utstring with std::string (#1009 )	2017-09-24 22:42:17 -07:00
Peter Schafhalter	241612709e	Data structure updates to plasma manager (#937 ) * Implemented local_available_objects as an unordered set * Implemented fetch_requests as an unordered map * Fixed bug and changed fetch_requests from pointer to object * free(PlasmaManagerState ) -> delete PlasmaManagerState * removed unnecessary newline * Make local_available_objects not a pointer. * Attempt to safely iterate over unordered_map and remove elements.	2017-09-15 20:09:29 -07:00
Peter Schafhalter	8906a920f7	Implemented wait_requests as vector (#943 )	2017-09-08 13:39:54 -07:00
Robert Nishihara	37282330c0	Allow plasma manager to gracefully handle EPROTOTYPE. (#802 ) * Allow plasma manager to gracefully handle EPROTOTYPE. * Fix linting.	2017-08-01 23:33:25 -07:00
Philipp Moritz	c3b39b4d86	Pull Plasma from Apache Arrow and remove Plasma store from Ray. (#692 ) * Rebase Ray on top of Plasma in Apache Arrow * add thirdparty building scripts * use rebased arrow * fix * fix build * fix python visibility * comment out C tests for now * fix multithreading * fix * reduce logging * fix plasma manager multithreading * make sure old and new object IDs can coexist peacefully * more rebasing * update * fixes * fix * install pyarrow * install cython * fix * install newer cmake * fix * rebase on top of latest arrow * getting runtest.py run locally (needed to comment out a test for that to work) * work on plasma tests * more fixes * fix local scheduler tests * fix global scheduler test * more fixes * fix python 3 bytes vs string * fix manager tests valgrind * fix documentation building * fix linting * fix c++ linting * fix linting * add tests back in * Install without sudo. * Set PKG_CONFIG_PATH in build.sh so that Ray can find plasma. * Install pkg-config * Link -lpthread, note that find_package(Threads) doesn't seem to work reliably. * Comment in testGPUIDs in runtest.py. * Set PKG_CONFIG_PATH when building pyarrow. * Pull apache/arrow and not pcmoritz/arrow. * Fix installation in docker image. * adapt to changes of the plasma api * Fix installation of pyarrow module. * Fix linting. * Use correct python executable to build pyarrow.	2017-07-31 21:04:15 -07:00
Robert Nishihara	ad480f8165	Don't reconstruct all objects in every fetch request in local scheduler. (#686 ) * Don't reconstruct all objects in every fetch request in local scheduler. * Separate out fetch timer and reconstruction timer. * Fix bug. * Bug fix. * Fix naming convention for global variables. * Address comments. * Make reconstruct_counter a static variable. * Fix linting. * Redo reconstruct handler using a set of objects to fetch. * Fix linting. * Replace set with vector.	2017-06-23 21:08:02 +00:00
Robert Nishihara	5ebc2f3f2e	Do resource bookkeeping for actor methods. (#682 ) * Dispatch regular and actor tasks when resources become available. * Make actor methods do resource bookkeeping and add test. * Remove unnecessary field. * Fix linting. * Fix actor test. * Maintain set of actors with pending tasks to speed up task dispatch. * Exit early from task dispatch if there are no resources available. * Fix linting. * Fix error. * Fix bug related to iterator invalidation. * When an actor is removed, remove it from the set of actors with pending tasks.	2017-06-21 05:52:45 +00:00
Robert Nishihara	9e4a3e4972	Replace some UT data structures in local scheduler with C++ STL. (#680 ) * Replace a local scheduler ut_array with a std::vector. * Replace vector of sizes in local scheduler with std::pair. * Remove utarray include. * Replace utarray with std::vector for reading local scheduler input messages. * Remove more UT data structures. * Remove UT includes. * Fix linting. * Include stdlib.h to find size_t. * Remove includes of stdbool.h. * Replace std::pair with TaskQueueEntry. * Fix redis tests. * Reinstate tests.	2017-06-19 21:58:42 +00:00
Robert Nishihara	f12db5f0e2	Divide large plasma requests into smaller chunks, and wait longer before reissuing large requests. (#678 ) * Divide large get requests into smaller chunks. * Divide fetches into smaller chunks. * Wait longer in worker and manager before reissuing fetch requests if there are many outstanding fetch requests. * Log warning if a handler in the local scheduler or plasma manager takes more than one second.	2017-06-18 04:42:15 +00:00
Robert Nishihara	96962cdee0	Log fatal error if plasma manager or local scheduler heartbeats take too long. (#676 ) * Log fatal error if plasma manager or local scheduler take too long to send heartbeat. * Fix linting. * Use int64_t for milliseconds since unix epoch.	2017-06-16 19:11:01 +00:00
Robert Nishihara	1916475e14	Increase socket listen backlog from 5 to 128. (#661 )	2017-06-11 06:34:16 +00:00
Philipp Moritz	b94b4a35e0	Make the Plasma store ready for Arrow integration (#579 ) * port plasma to arrow * fixes * refactor plasma client * more modernization * fix plasma manager tests * everything compiles * fix plasma client tests * update plasma serialization tests * fix plasma manager tests * fix bug * updates * fix bug * fix tests * fix rebase * address comments * fix travis valgrind build * fix linting * fix include order again * fix linting * address comments	2017-05-31 16:24:23 -07:00
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
Stephanie Wang	e50a23b820	Fix bug with reused file descriptors (#471 ) * Fix bug with reused file descriptors * Remove client connection if write_object_chunk fails * Handle ECONNRESET on unsuccessful write * lint * Back to lowercase * fix compilation * fix linting	2017-05-02 19:45:27 -07:00
Alexey Tumanov	6f9225490b	Plasma manager performance: speed up wait with a wait request object map (#427 ) * plasma manager perf: speedup wait with a wait request object map * removing duplicate == operator in plasma store * fix serialization test * code cleanup * minor cleanup * factoring out uniqueid hash and equality operators into common * plasma manager: c++ify the WaitRequest struct * plasma manager: get rid of the initial object request malloc * cleanup * linting * cleanups and fix compiler warnings * compiler warnings and linting	2017-04-07 12:32:12 -07:00
Stephanie Wang	93679df724	Stopped nodes can rejoin immediately (#428 ) * Ignore deleted clients when reading address info from Redis * Remove self from db_client table when exiting cleanly * Fix valgrind test * Do not call plasma_perform_release when disconnecting	2017-04-05 23:50:38 -07:00
Stephanie Wang	083e7a28ad	Push an error to the driver when the workload hangs on `ray.put` reconstruction (#382 ) * Fix worker blocked bug * tmp * Push an error to the driver on ray.put for non-driver tasks * Fix result table tests * Fix test, logging * Address comments * Fix suppression bug * Fix redis module test * Edit error message * Get values in chunks during reconstruction * Test case for driver ray.put errors * Error for evicting ray.put objects from the driver * Fix tests * Reduce verbosity * Documentation	2017-03-21 00:16:48 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Philipp Moritz	068429ffd8	Convert local scheduler messages to flatbuffers (#340 ) * use flatbuffer messages for local scheduler * make sure constructor gets called for C++ object ObjectInfoT * fix typo * fix Robert's comments * Small change to actor test. * fix valgrind error * linting * free notification * fix * valgrind * fix valgrind * fix other bugs * valgrind fix * fixes * more fixes * Small changes to comments.	2017-03-15 16:27:52 -07:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Robert Nishihara	65a8659f3d	Some plasma manager transfer optimizations. (#334 ) * Change tranfer queue to doubly-linked list to speed up append. * Maintain set of pending transfers to make deduplication easy. * Fix naming convention for structs in plasma manager.	2017-03-04 23:15:17 -08:00
Philipp Moritz	793a102846	Make Ray code C++ compatible (#321 ) * convert Ray to C++ * const correctness	2017-03-01 01:17:24 -08:00

48 commits