hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-12 06:06:39 -04:00

Author	SHA1	Message	Date
Robert Nishihara	cb84972f6b	Recreate actors when local schedulers die. (#804 ) * Reconstruct actor state when local schedulers fail. * Simplify construction of arguments to pass into default_worker.py from local scheduler. * Remove deprecated ray.actor. * Simplify actor reconstruction method. * Fix linting. * Small fixes.	2017-08-02 18:02:52 -07:00
Robert Nishihara	8c8258de20	Move worker methods into Worker class and expose more TaskSpec fields to Python. (#796 ) * Move worker methods inside worker class. Move some helper methods from actor.py into utils.py and state.py. * Add more methods exposing task spec fields to Python. * Fix linting. * Fix error. * Remove unused code in default worker.	2017-08-01 17:16:57 -07:00
Philipp Moritz	c3b39b4d86	Pull Plasma from Apache Arrow and remove Plasma store from Ray. (#692 ) * Rebase Ray on top of Plasma in Apache Arrow * add thirdparty building scripts * use rebased arrow * fix * fix build * fix python visibility * comment out C tests for now * fix multithreading * fix * reduce logging * fix plasma manager multithreading * make sure old and new object IDs can coexist peacefully * more rebasing * update * fixes * fix * install pyarrow * install cython * fix * install newer cmake * fix * rebase on top of latest arrow * getting runtest.py run locally (needed to comment out a test for that to work) * work on plasma tests * more fixes * fix local scheduler tests * fix global scheduler test * more fixes * fix python 3 bytes vs string * fix manager tests valgrind * fix documentation building * fix linting * fix c++ linting * fix linting * add tests back in * Install without sudo. * Set PKG_CONFIG_PATH in build.sh so that Ray can find plasma. * Install pkg-config * Link -lpthread, note that find_package(Threads) doesn't seem to work reliably. * Comment in testGPUIDs in runtest.py. * Set PKG_CONFIG_PATH when building pyarrow. * Pull apache/arrow and not pcmoritz/arrow. * Fix installation in docker image. * adapt to changes of the plasma api * Fix installation of pyarrow module. * Fix linting. * Use correct python executable to build pyarrow.	2017-07-31 21:04:15 -07:00
Robert Nishihara	8ad9ced99b	Fix task ID hash computation. (#774 )	2017-07-26 10:08:38 -07:00
alanamarzoev	8464d77c76	Change event logs to store one Redis ZSET per worker. (#705 ) * Changing to zset * Fixed bug. * Fixed another bug. * Modified task_profiles. * Removed extra file. * Modified task_profiles test. * WIP * WIP * Undid changes * Updated * WIP * Made changes according to comments. * Removed unneeded print. * Removed ujson usage. * failing test * tests passing * Fixed linting errors and modified style. * Fixed bug. * Fixed linting * Fixed according to comments. * Redis crashing? * Fixed linting * Fixed linting	2017-07-09 01:42:29 +02:00
Robert Nishihara	9e4a3e4972	Replace some UT data structures in local scheduler with C++ STL. (#680 ) * Replace a local scheduler ut_array with a std::vector. * Replace vector of sizes in local scheduler with std::pair. * Remove utarray include. * Replace utarray with std::vector for reading local scheduler input messages. * Remove more UT data structures. * Remove UT includes. * Fix linting. * Include stdlib.h to find size_t. * Remove includes of stdbool.h. * Replace std::pair with TaskQueueEntry. * Fix redis tests. * Reinstate tests.	2017-06-19 21:58:42 +00:00
alanamarzoev	4d5ac9dad5	Include object size and hash in the table returned by the object_table function in the GlobalStateAPI. (#665 ) * added log_table function and a test * fixed log_files and added task_profiles * fixed formatting * fixed linting errors * fixes * removed file * more fixes * hopefully fixed * Small changes. * Fix linting. * Fix bug in log monitor. * Small changes. * Fix bug in travis. * Including data_size and hash in the ResultTableReply. * Included data_size and hash info in object_table. * Fixed bugs in ray_redis_module.cc. * Removing commented out code. * Fixes * Freed hash and data_size strings after using, and checked if they're null along with task_id and is_put. * Changed it so that data_size is set correctly. * Removed iostream import. * Included a check to ensure that the Redis string to long long conversion was successful. * Included separate data_size and hash null checks. * Fixed bug. * Made linting changes. * Another linting error. * Slight simplication.	2017-06-16 23:17:11 -07:00
Robert Nishihara	96962cdee0	Log fatal error if plasma manager or local scheduler heartbeats take too long. (#676 ) * Log fatal error if plasma manager or local scheduler take too long to send heartbeat. * Fix linting. * Use int64_t for milliseconds since unix epoch.	2017-06-16 19:11:01 +00:00
Robert Nishihara	1916475e14	Increase socket listen backlog from 5 to 128. (#661 )	2017-06-11 06:34:16 +00:00
Eric Liang	d4d2c03ac5	Remove timeout for Redis commands. (#649 ) * update * Remove interaction between callback data identifier and event loop. * Remove tests that no longer apply.	2017-06-09 15:55:36 -07:00
Richard Shin	16050eca8d	Don't link Python extensions to libpython*.so (#598 )	2017-05-25 19:01:12 -07:00
Philipp Moritz	3885d1b286	make builds with CMake incremental (#592 )	2017-05-24 21:52:33 -07:00
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
Philipp Moritz	08e988aee5	Modernize plasma store (C to C++ changes). (#546 )	2017-05-15 01:19:44 -07:00
Philipp Moritz	1dddd5336a	Fix actor bug arising from overwriting task specifications in the local scheduler (#513 ) * copy task specifications put into the actor task cache so it won't get overwritten when the scheduler receives the next task * cleanup * cleanup and fix * linting * fix jenkins test * fix linting	2017-05-06 17:39:35 -07:00
Stephanie Wang	e50a23b820	Fix bug with reused file descriptors (#471 ) * Fix bug with reused file descriptors * Remove client connection if write_object_chunk fails * Handle ECONNRESET on unsuccessful write * lint * Back to lowercase * fix compilation * fix linting	2017-05-02 19:45:27 -07:00
Robert Nishihara	0ac125e9b2	Clean up when a driver disconnects. (#462 ) * Clean up state when drivers exit. * Remove unnecessary field in ActorMapEntry struct. * Have monitor release GPU resources in Redis when driver exits. * Enable multiple drivers in multi-node tests and test driver cleanup. * Make redis GPU allocation a redis transaction and small cleanups. * Fix multi-node test. * Small cleanups. * Make global scheduler take node_ip_address so it appears in the right place in the client table. * Cleanups. * Fix linting and cleanups in local scheduler. * Fix removed_driver_test. * Fix bug related to vector -> list. * Fix linting. * Cleanup. * Fix multi node tests. * Fix jenkins tests. * Add another multi node test with many drivers. * Fix linting. * Make the actor creation notification a flatbuffer message. * Revert "Make the actor creation notification a flatbuffer message." This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39. * Add comment explaining flatbuffer problems.	2017-04-24 18:10:21 -07:00
Alexey Tumanov	a67a107e0e	Fix int-type compilation problem on redhat. (#472 )	2017-04-19 02:43:33 -07:00
Richard Shin	cf68cf743c	Change UniqueID hash function to look at the lowest instead of highest bytes. (#469 )	2017-04-18 15:31:49 -07:00
Robert Nishihara	05fd4c2c37	Changes to local scheduler client protocol. (#435 ) * Make local scheduler clients receive reply upon registration. * Fix tests and linting.	2017-04-07 23:03:37 -07:00
Alexey Tumanov	6f9225490b	Plasma manager performance: speed up wait with a wait request object map (#427 ) * plasma manager perf: speedup wait with a wait request object map * removing duplicate == operator in plasma store * fix serialization test * code cleanup * minor cleanup * factoring out uniqueid hash and equality operators into common * plasma manager: c++ify the WaitRequest struct * plasma manager: get rid of the initial object request malloc * cleanup * linting * cleanups and fix compiler warnings * compiler warnings and linting	2017-04-07 12:32:12 -07:00
Robert Nishihara	fa363a5a3a	Notify driver when a worker dies while executing a task. (#419 ) * Notify driver when a worker dies while executing a task. * Fix linting. * Don't push error when local scheduler is cleaning up.	2017-04-06 00:02:39 -07:00
Stephanie Wang	93679df724	Stopped nodes can rejoin immediately (#428 ) * Ignore deleted clients when reading address info from Redis * Remove self from db_client table when exiting cleanly * Fix valgrind test * Do not call plasma_perform_release when disconnecting	2017-04-05 23:50:38 -07:00
Robert Nishihara	f1b48f2fd4	Avoid publishing in the task table unnecessarily. (#416 )	2017-03-30 13:41:32 -07:00
Stephanie Wang	036b873bf2	Implement local scheduler task queues using C++ data structures (#392 ) * Switch to using C++ lists for task queues * Init and free methods for TaskQueueEntry * Switch from utarray to c++ vector for TaskQueueEntry * Get rid of some pointers * Back to O(1) deletion from waiting_task_queue * Fix comments * Cut code * Non const iterators * Fix Alexey's comments	2017-03-30 00:40:01 -07:00
Alexey Tumanov	78e1167a42	Parallelize make in build.sh. (#371 ) * parallelize build.sh make * Encode in cmake the dependency of ray_redis_module on autogenerated flatbuffer files.	2017-03-27 20:55:50 -07:00
Stephanie Wang	083e7a28ad	Push an error to the driver when the workload hangs on `ray.put` reconstruction (#382 ) * Fix worker blocked bug * tmp * Push an error to the driver on ray.put for non-driver tasks * Fix result table tests * Fix test, logging * Address comments * Fix suppression bug * Fix redis module test * Edit error message * Get values in chunks during reconstruction * Test case for driver ray.put errors * Error for evicting ray.put objects from the driver * Fix tests * Reduce verbosity * Documentation	2017-03-21 00:16:48 -07:00
Philipp Moritz	4618fd45b1	Port Ray to latest Arrow version (#370 ) * rebase on top of latest arrow * clang-format * address comments * fix	2017-03-20 16:31:46 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Philipp Moritz	068429ffd8	Convert local scheduler messages to flatbuffers (#340 ) * use flatbuffer messages for local scheduler * make sure constructor gets called for C++ object ObjectInfoT * fix typo * fix Robert's comments * Small change to actor test. * fix valgrind error * linting * free notification * fix * valgrind * fix valgrind * fix other bugs * valgrind fix * fixes * more fixes * Small changes to comments.	2017-03-15 16:27:52 -07:00
Robert Nishihara	53dffe0bf2	Use flatbuffers for some messages from Redis. (#341 ) * Compile the Ray redis module with C++. * Redo parsing of object table notifications with flatbuffers. * Update redis module python tests. * Redo parsing of task table notifications with flatbuffers. * Fix linting. * Redo parsing of db client notifications with flatbuffers. * Redo publishing of local scheduler heartbeats with flatbuffers. * Fix linting. * Remove usage of fixed-width formatting of scheduling state in channel name. * Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion. * Fix linting and tests. * fix * cleanup * simplify logic in ReplyWithTask	2017-03-10 18:35:25 -08:00
Philipp Moritz	0de57be085	upgrade flatbuffers to 1.6.0 (#345 )	2017-03-07 21:33:46 -08:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Philipp Moritz	0b8d279ef2	Convert task_spec to flatbuffers (#255 ) * convert Ray to C++ * convert task_spec to flatbuffers * fix * it compiles * latest * tests are passing * task2 -> task * fix * fix * fix * fix * fix * linting * fix valgrind * upgrade flatbuffers * use debug mode for valgrind * fix naming and comments * downgrade flatbuffers * fix linting * reintroduce TaskSpec_free * rename TaskSpec -> TaskInfo * refactoring * linting	2017-03-05 02:05:02 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Philipp Moritz	793a102846	Make Ray code C++ compatible (#321 ) * convert Ray to C++ * const correctness	2017-03-01 01:17:24 -08:00
Alexey Tumanov	b91d9cba45	Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325 ) * adding flatbuffers and migrating flatcc to flatbuffers for plasma * variable name changes in plasma_protocol and plasma flatbuffers schema * quick fix * cleanups and remove flatcc * more cleanup * add doc * linting * fix linting * fix mac os x build * linting * cleanup * c++ fix for plasma flatbuffers * Remove flatcc from CMakeLists.txt. * linting; trigger travis	2017-02-28 18:47:40 -08:00
Robert Nishihara	1a997ed279	Move documentation to ReadTheDocs. (#326 )	2017-02-27 21:14:31 -08:00
Robert Nishihara	1ae7e7d29e	Rename photon -> local scheduler. (#322 )	2017-02-27 12:24:07 -08:00
Philipp Moritz	a30eed452e	Change type naming convention. (#315 ) * Rename object_id -> ObjectID. * Rename ray_logger -> RayLogger. * rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID * Rename plasma_store_info -> PlasmaStoreInfo. * Rename plasma_store_state -> PlasmaStoreState. * Rename plasma_object -> PlasmaObject. * Rename object_request -> ObjectRequests. * Rename eviction_state -> EvictionState. * Bug fix. * rename db_handle -> DBHandle * Rename local_scheduler_state -> LocalSchedulerState. * rename db_client_id -> DBClientID * rename task -> Task * make redis.c C++ compatible * Rename scheduling_algorithm_state -> SchedulingAlgorithmState. * Rename plasma_connection -> PlasmaConnection. * Rename client_connection -> ClientConnection. * Fixes from rebase. * Rename local_scheduler_client -> LocalSchedulerClient. * Rename object_buffer -> ObjectBuffer. * Rename client -> Client. * Rename notification_queue -> NotificationQueue. * Rename object_get_requests -> ObjectGetRequests. * Rename get_request -> GetRequest. * Rename object_info -> ObjectInfo. * Rename scheduler_object_info -> SchedulerObjectInfo. * Rename local_scheduler -> LocalScheduler and some fixes. * Rename local_scheduler_info -> LocalSchedulerInfo. * Rename global_scheduler_state -> GlobalSchedulerState. * Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState. * Rename object_size_entry -> ObjectSizeEntry. * Rename aux_address_entry -> AuxAddressEntry. * Rename various ID helper methods. * Rename Task helper methods. * Rename db_client_cache_entry -> DBClientCacheEntry. * Rename local_actor_info -> LocalActorInfo. * Rename actor_info -> ActorInfo. * Rename retry_info -> RetryInfo. * Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData. * Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData. * Rename table_callback_data -> TableCallbackData. * Rename object_info_subscribe_data -> ObjectInfoSubscribeData. * Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData. * Rename more redis call data structures. * Rename photon_conn PhotonConnection. * Rename photon_mock -> PhotonMock. * Fix formatting errors.	2017-02-26 00:32:43 -08:00
Stephanie Wang	be1618f041	Availability after worker failure (#316 ) * Availability after a killed worker * Workers exit cleanly * Memory cleanup in photon C tests * Worker failure in multinode * Consolidate worker cleanup handlers * Update the result table before handling a task submission * KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS * Log a warning instead of crashing if no result table entry found	2017-02-25 20:19:36 -08:00
Robert Nishihara	232601f90d	Change all table calls to use default retry behavior. (#312 ) * Change all table calls to use default retry behavior and change default retry behavior. * Add warning for table retries.	2017-02-24 12:41:32 -08:00
Robert Nishihara	7f5be96683	Remove object table tests that are failing. (#310 )	2017-02-23 13:39:59 -08:00
Robert Nishihara	3e67d28922	Address numbuf compiler warnings. (#300 )	2017-02-20 22:42:03 -08:00
Stephanie Wang	334aed9fa9	Fetch the object after requesting reconstruction during ray.get (#301 ) * Fetch the object after requesting reconstruction during ray.get * revert * Fix documentation and memory leak * Fix hanging reconstruction bug * Fix for python3	2017-02-20 21:41:34 -08:00
Stephanie Wang	67c591c33b	Retry connections in photon connect, consolidate code in io.c (#294 )	2017-02-17 23:41:21 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00
Alexey Tumanov	dfb6107b22	General attribute-based heterogeneity support with hard and soft constraints (#248 ) * attribute-based heterogeneity-awareness in global scheduler and photon * minor post-rebase fix * photon: enforce dynamic capacity constraint on task dispatch * globalsched: cap the number of times we try to schedule a task in round robin * propagating ability to specify resource capacity to ray.init * adding resources to remote function export and fetch/register * globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat) * Add some integration tests. * globalsched: cleanup + factor out constraint checking * lots of style * task_spec_required_resource: global refactor * clang format * clang format + comment update in photon * clang format photon comment * valgrind * reduce verbosity for Travis * Add test for scheduler load balancing. * addressing comments * refactoring global scheduler algorithm * Minor cleanups. * Linting. * Fix array_test.py and linting. * valgrind fix for photon tests * Attempt to fix stress tests. * fix hashmap free * fix hashmap free comment * memset photon resource vectors to 0 in case they get used before the first heartbeat * More whitespace changes. * Undo whitespace error I introduced.	2017-02-09 01:34:14 -08:00
Philipp Moritz	fefc7d9b49	fix segfault in photon.Task (#253 )	2017-02-07 11:17:11 -08:00
Robert Nishihara	2d1c980ad7	Refactor local scheduler to remove worker indices. (#245 ) * Refactor local scheduler to remove worker indices. * Change scheduling state enum to int in all function signatures. * Bug fix, don't use pointers into a resizable array. * Remove total_num_workers. * Fix tests.	2017-02-05 14:52:28 -08:00

1 2 3

120 commits