hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-12 06:06:39 -04:00

Author	SHA1	Message	Date
Philipp Moritz	4618fd45b1	Port Ray to latest Arrow version (#370 ) * rebase on top of latest arrow * clang-format * address comments * fix	2017-03-20 16:31:46 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Philipp Moritz	068429ffd8	Convert local scheduler messages to flatbuffers (#340 ) * use flatbuffer messages for local scheduler * make sure constructor gets called for C++ object ObjectInfoT * fix typo * fix Robert's comments * Small change to actor test. * fix valgrind error * linting * free notification * fix * valgrind * fix valgrind * fix other bugs * valgrind fix * fixes * more fixes * Small changes to comments.	2017-03-15 16:27:52 -07:00
Robert Nishihara	53dffe0bf2	Use flatbuffers for some messages from Redis. (#341 ) * Compile the Ray redis module with C++. * Redo parsing of object table notifications with flatbuffers. * Update redis module python tests. * Redo parsing of task table notifications with flatbuffers. * Fix linting. * Redo parsing of db client notifications with flatbuffers. * Redo publishing of local scheduler heartbeats with flatbuffers. * Fix linting. * Remove usage of fixed-width formatting of scheduling state in channel name. * Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion. * Fix linting and tests. * fix * cleanup * simplify logic in ReplyWithTask	2017-03-10 18:35:25 -08:00
Philipp Moritz	0de57be085	upgrade flatbuffers to 1.6.0 (#345 )	2017-03-07 21:33:46 -08:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Philipp Moritz	0b8d279ef2	Convert task_spec to flatbuffers (#255 ) * convert Ray to C++ * convert task_spec to flatbuffers * fix * it compiles * latest * tests are passing * task2 -> task * fix * fix * fix * fix * fix * linting * fix valgrind * upgrade flatbuffers * use debug mode for valgrind * fix naming and comments * downgrade flatbuffers * fix linting * reintroduce TaskSpec_free * rename TaskSpec -> TaskInfo * refactoring * linting	2017-03-05 02:05:02 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Philipp Moritz	793a102846	Make Ray code C++ compatible (#321 ) * convert Ray to C++ * const correctness	2017-03-01 01:17:24 -08:00
Alexey Tumanov	b91d9cba45	Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325 ) * adding flatbuffers and migrating flatcc to flatbuffers for plasma * variable name changes in plasma_protocol and plasma flatbuffers schema * quick fix * cleanups and remove flatcc * more cleanup * add doc * linting * fix linting * fix mac os x build * linting * cleanup * c++ fix for plasma flatbuffers * Remove flatcc from CMakeLists.txt. * linting; trigger travis	2017-02-28 18:47:40 -08:00
Robert Nishihara	1a997ed279	Move documentation to ReadTheDocs. (#326 )	2017-02-27 21:14:31 -08:00
Robert Nishihara	1ae7e7d29e	Rename photon -> local scheduler. (#322 )	2017-02-27 12:24:07 -08:00
Philipp Moritz	a30eed452e	Change type naming convention. (#315 ) * Rename object_id -> ObjectID. * Rename ray_logger -> RayLogger. * rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID * Rename plasma_store_info -> PlasmaStoreInfo. * Rename plasma_store_state -> PlasmaStoreState. * Rename plasma_object -> PlasmaObject. * Rename object_request -> ObjectRequests. * Rename eviction_state -> EvictionState. * Bug fix. * rename db_handle -> DBHandle * Rename local_scheduler_state -> LocalSchedulerState. * rename db_client_id -> DBClientID * rename task -> Task * make redis.c C++ compatible * Rename scheduling_algorithm_state -> SchedulingAlgorithmState. * Rename plasma_connection -> PlasmaConnection. * Rename client_connection -> ClientConnection. * Fixes from rebase. * Rename local_scheduler_client -> LocalSchedulerClient. * Rename object_buffer -> ObjectBuffer. * Rename client -> Client. * Rename notification_queue -> NotificationQueue. * Rename object_get_requests -> ObjectGetRequests. * Rename get_request -> GetRequest. * Rename object_info -> ObjectInfo. * Rename scheduler_object_info -> SchedulerObjectInfo. * Rename local_scheduler -> LocalScheduler and some fixes. * Rename local_scheduler_info -> LocalSchedulerInfo. * Rename global_scheduler_state -> GlobalSchedulerState. * Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState. * Rename object_size_entry -> ObjectSizeEntry. * Rename aux_address_entry -> AuxAddressEntry. * Rename various ID helper methods. * Rename Task helper methods. * Rename db_client_cache_entry -> DBClientCacheEntry. * Rename local_actor_info -> LocalActorInfo. * Rename actor_info -> ActorInfo. * Rename retry_info -> RetryInfo. * Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData. * Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData. * Rename table_callback_data -> TableCallbackData. * Rename object_info_subscribe_data -> ObjectInfoSubscribeData. * Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData. * Rename more redis call data structures. * Rename photon_conn PhotonConnection. * Rename photon_mock -> PhotonMock. * Fix formatting errors.	2017-02-26 00:32:43 -08:00
Stephanie Wang	be1618f041	Availability after worker failure (#316 ) * Availability after a killed worker * Workers exit cleanly * Memory cleanup in photon C tests * Worker failure in multinode * Consolidate worker cleanup handlers * Update the result table before handling a task submission * KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS * Log a warning instead of crashing if no result table entry found	2017-02-25 20:19:36 -08:00
Robert Nishihara	232601f90d	Change all table calls to use default retry behavior. (#312 ) * Change all table calls to use default retry behavior and change default retry behavior. * Add warning for table retries.	2017-02-24 12:41:32 -08:00
Robert Nishihara	7f5be96683	Remove object table tests that are failing. (#310 )	2017-02-23 13:39:59 -08:00
Robert Nishihara	3e67d28922	Address numbuf compiler warnings. (#300 )	2017-02-20 22:42:03 -08:00
Stephanie Wang	334aed9fa9	Fetch the object after requesting reconstruction during ray.get (#301 ) * Fetch the object after requesting reconstruction during ray.get * revert * Fix documentation and memory leak * Fix hanging reconstruction bug * Fix for python3	2017-02-20 21:41:34 -08:00
Stephanie Wang	67c591c33b	Retry connections in photon connect, consolidate code in io.c (#294 )	2017-02-17 23:41:21 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00
Alexey Tumanov	dfb6107b22	General attribute-based heterogeneity support with hard and soft constraints (#248 ) * attribute-based heterogeneity-awareness in global scheduler and photon * minor post-rebase fix * photon: enforce dynamic capacity constraint on task dispatch * globalsched: cap the number of times we try to schedule a task in round robin * propagating ability to specify resource capacity to ray.init * adding resources to remote function export and fetch/register * globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat) * Add some integration tests. * globalsched: cleanup + factor out constraint checking * lots of style * task_spec_required_resource: global refactor * clang format * clang format + comment update in photon * clang format photon comment * valgrind * reduce verbosity for Travis * Add test for scheduler load balancing. * addressing comments * refactoring global scheduler algorithm * Minor cleanups. * Linting. * Fix array_test.py and linting. * valgrind fix for photon tests * Attempt to fix stress tests. * fix hashmap free * fix hashmap free comment * memset photon resource vectors to 0 in case they get used before the first heartbeat * More whitespace changes. * Undo whitespace error I introduced.	2017-02-09 01:34:14 -08:00
Philipp Moritz	fefc7d9b49	fix segfault in photon.Task (#253 )	2017-02-07 11:17:11 -08:00
Robert Nishihara	2d1c980ad7	Refactor local scheduler to remove worker indices. (#245 ) * Refactor local scheduler to remove worker indices. * Change scheduling state enum to int in all function signatures. * Bug fix, don't use pointers into a resizable array. * Remove total_num_workers. * Fix tests.	2017-02-05 14:52:28 -08:00
Stephanie Wang	241b539ff8	Reconstruction for evicted objects (#181 ) * First pass at reconstruction in the worker Modify reconstruction stress testing to start Plasma service before rest of Ray cluster TODO about reconstructing ray.puts Fix ray.put error for double creates Distinguish between empty entry and no entry in object table Fix test case Fix Python test Fix tests * Only call reconstruct on objects we have not yet received * Address review comments * Fix reconstruction for Python3 * remove unused code * Address Robert's comments, stress tests are crashing * Test and update the task's scheduling state to suppress duplicate reconstruction requests. * Split result table into two lookups, one for task ID and the other as a test-and-set for the task state * Fix object table tests * Fix redis module result_table_lookup test case * Multinode reconstruction tests * Fix python3 test case * rename * Use new start_redis * Remove unused code * lint * indent * Address Robert's comments * Use start_redis from ray.services in state table tests * Remove unnecessary memset	2017-02-01 19:18:46 -08:00
Robert Nishihara	ab8c3432f7	Add driver ID to task spec and add driver ID to Python error handling. (#225 ) * Add driver ID to task spec and add driver ID to Python error handling. * Make constants global variables. * Add test for error isolation.	2017-01-25 22:53:48 -08:00
Stephanie Wang	f1987cdc16	Split local scheduler task queue (#211 ) * Split local scheduler task queue into waiting and dispatch queue * Fix memory leak * Add a new task scheduling status for when a task has been queued locally * Fix global scheduler test case and add task status doc * Documentation * Address Philipp's comments * Move tasks back to the waiting queue if their dependencies become unavailable * Update existing task table entries instead of overwriting	2017-01-18 20:27:40 -08:00
Philipp Moritz	a708e36225	Switch build system to use CMake completely. (#200 ) * switch to CMake completely ... * cleanup * Run C tests, update installation instructions.	2017-01-17 16:56:40 -08:00
Robert Nishihara	973716d310	Use cloudpickle 0.2.2. (#189 )	2017-01-08 17:30:06 -08:00
Alexey Tumanov	674ec3a3cb	generate pytask from string and string from pytask (#188 ) * pytask creation from bytestring: saving work * pytask now works * documentation and tests * linting * Lint and fix test case	2017-01-08 02:16:40 -08:00
Robert Nishihara	651aa6007a	Log profiling information from worker. (#178 ) * Log timing events on workers. * Have workers log to the event log through the local scheduler. * Fixes and address comments. * bug fix * styling	2017-01-05 16:47:16 -08:00
Robert Nishihara	acf1703afd	Implement naive scheduling algorithm using local scheduler load. (#164 ) * Implement naive scheduling algorithm using local scheduler load. * Have the global scheduler estimate load on local schedulers better. * Fixes.	2016-12-28 22:33:20 -08:00
Robert Nishihara	10e067e5e5	Delay releasing a maximum number of bytes in the plasma client. (#160 ) * Send message from plasma client to get plasma store capacity. * Release objects from plasma client if they are too large. * Use doubly-linked list instead of ring buffer for plasma client release history. * Address comments. * Fix problem with slicing PlasmaBuffer objects. * Fix crash in plasma manager during transfer. * Formatting. * Make plasma client cache larger and make caching test not throw exceptions on Travis.	2016-12-27 19:51:26 -08:00
Robert Nishihara	985c424172	Use redismodules for task table and result table. (#156 ) * Switch to using redis modules for task table. * Switch to using redis modules for the task table. * Fix some tests. * Fix naming and remove code duplication. * Remove duplication in redis modules and add more cleanups. * Address comments.	2016-12-25 23:57:05 -08:00
Philipp Moritz	8309e3f355	Redis string formatting (#157 ) * redis string formatting * fixes * add documentation * fixes	2016-12-25 22:43:07 -08:00
Robert Nishihara	3d697c7ed2	Introduce local scheduler heartbeats which carry load information. (#155 ) * Introduce local scheduler heartbeats which carry load information.	2016-12-24 20:02:25 -08:00
Alexey Tumanov	46a887039e	Global scheduler - per-task transfer-aware policy (#145 ) * global scheduler with object transfer cost awareness -- upstream rebase * debugging global scheduler: multiple subscriptions * global scheduler: utarray push bug fix; tasks change state to SCHEDULED * change global scheduler test to be an integraton test * unit and integration tests are passing for global scheduler * improve global scheduler test: break up into several * global scheduler checkpoint: fix photon object id bug in test * test with timesync between object and task notifications; TODO: handle OoO object+task notifications in GS * fallback to base policy if no object dependencies are cached (may happen due to OoO object+task notification arrivals * clean up printfs; handle a missing LS in LS cache * Minor changes to Python test and factor out some common code. * refactoring handle task waiting * addressing comments * log_info -> log_debug * Change object ID printing. * PRId64 merge * Python 3 fix. * PRId64. * Python 3 fix. * resurrect differentiation between no args and missing object info; spacing * Valgrind fix. * Run all global scheduler tests in valgrind. * clang format * Comments and documentation changes. * Minor cleanups. * fix whitespace * Fix. * Documentation fix.	2016-12-22 03:11:46 -08:00
Robert Nishihara	6cd02d71f8	Fixes and cleanups for the multinode setting. (#143 ) * Add function for driver to get address info from Redis. * Use Redis address instead of Redis port. * Configure Redis to run in unprotected mode. * Add method for starting Ray processes on non-head node. * Pass in correct node ip address to start_plasma_manager. * Script for starting Ray processes. * Handle the case where an object already exists in the store. Maybe this should also compare the object hashes. * Have driver get info from Redis when start_ray_local=False. * Fix. * Script for killing ray processes. * Catch some errors when the main_loop in a worker throws an exception. * Allow redirecting stdout and stderr to /dev/null. * Wrap start_ray.py in a shell script. * More helpful error messages. * Fixes. * Wait for redis server to start up before configuring it. * Allow seeding of deterministic object ID generation. * Small change.	2016-12-21 18:53:12 -08:00
Robert Nishihara	c9c1b3e6af	Change db_connect to allow different arguments from different processes. (#142 ) * Allow db_connect to take a variable number of arguments. * Fix tests. * Fixes. * Formatting. * Fixes. * Simplifications. * Fix typo.	2016-12-20 20:21:35 -08:00
Philipp Moritz	0ca0864856	Use flatcc for serialization of IPC messages. (#140 ) * added Phllipp's updates * Switch to using flatbuffers for IPC. * Various changes. * convert remaining messages and cleanups * fix * fix function signatures * fix valgrind errors * clang-format * final commit * Fix valgrind test.	2016-12-20 14:46:25 -08:00
Stephanie Wang	d729f9b7ea	Object table remove (#139 ) * Object table remove redis module * Test case for object table remove redis module * Client code for object_table_remove * Delete object notifications in plasma * Test for object deletion notifications * Fix subscribe deletion test * Address Robert's comments * free hash table entry	2016-12-19 23:18:57 -08:00
Alexey Tumanov	cb3e6cde9e	passing object info information with redis module (#138 ) * adding object broadcast channel; published on each object table add * publishing data size to the bcast channel * bug fix: objectkey * update object tests to test for data size: C + py * remove debug * clang format * Minor changes. * Fix error. * merging with Robert's comments * clang format for the object table test upgrade	2016-12-19 21:07:25 -08:00
Robert Nishihara	269f37e26f	Implement object table notification subscriptions and switch to using Redis modules for object table. (#134 ) * Implement RAY.OBJECT_TABLE_REQUEST_NOTIFICATIONS. * Call object_table_request_notifications from plasma manager. * Use Redis modules for object table. * Cleaning up code. * More checks. * Formatting. * Make object table tests pass. * Formatting. * Add prefix to the object notification channel name. * Formatting. * Fixes. * Increase time in redismodule test.	2016-12-18 18:19:02 -08:00
Robert Nishihara	c89bf4e5bc	Fix improper handling of NULL characters when opening Redis keys. (#136 ) * Fix improper handling of NULL characters when opening Redis keys. * Add test.	2016-12-18 13:06:28 -08:00
Robert Nishihara	edf8d1ee9f	Fix Python3 error in tests. (#135 )	2016-12-17 12:42:37 -08:00
Stephanie Wang	e23661c375	Task table Redis module (#125 ) * Task table redis module implementation * Publish tasks and take in individual fields as args, not task object * Scheduling state integer has width 1, error on illegal put * Unit tests for task table and more documentation * Task table subscribe, fix publish topics and address Philipp and Alexey's comments * Helper function to create prefixed strings * Factor out the table prefixes in the test cases	2016-12-16 14:40:44 -08:00
Robert Nishihara	58a873eb20	Deploy Redis module and start using custom Redis commands. (#128 ) * Add RAY.CONNECT Redis command. * Add RAY.GET_CLIENT_ADDRESS command. * Build and clean Redis in common Makefile. * Use custom Redis module in Ray and use custom CONNECT and GET_CLIENT_ADDRESS commands. * Fixes. * Remove mapping from redis client ID to ray db client ID. * Fix.	2016-12-16 14:40:44 -08:00
Stephanie Wang	b0ba54e4c0	Fix psubscribe bug in object_table_subscribe (#126 ) * Fix psubscribe * Add TODO about subscription callbacks	2016-12-16 14:40:44 -08:00
Robert Nishihara	79dd1815a2	Python 3 compatibility. (#121 ) * Make common module Python 3 compatible. * Make plasma module Python 3 compatible. * Make photon module Python 3 compatible. * Make numbuf module Python 3 compatible. * Remaining changes for Python 3 compatibility. * Test Python 3 in Travis. * Fixes.	2016-12-16 14:40:37 -08:00
Alexey Tumanov	946242929f	Plasma photon association: passing through plasma address with photon db connection (#123 ) * passing plasma ip:port association with photon through redis to global scheduler * Fix test. * sanity-checking aux_address inside db_connect_extended * clang format * fix photon tests * clang format photon tests	2016-12-13 17:21:38 -08:00
Robert Nishihara	bce7e0fc07	Add include for usleep. (#124 )	2016-12-13 14:24:59 -08:00

1 2 3 4

193 commits