hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-13 14:46:38 -04:00

Author	SHA1	Message	Date
Robert Nishihara	96913be939	Treat actor creation like a regular task. (#1668 ) * Treat actor creation like a regular task. * Small cleanups. * Change semantics of actor resource handling. * Bug fix. * Minor linting * Bug fix * Fix jenkins test. * Fix actor tests * Some cleanups * Bug fix * Fix bug. * Remove cached actor tasks when a driver is removed. * Add more info to taskspec in global state API. * Fix cyclic import bug in tune. * Fix * Fix linting. * Fix linting. * Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs. * Bug fix. * Add test for 0 CPU case * Fix linting * Address comments. * Fix typos and add comment. * Add assertion and fix test.	2018-03-16 11:18:07 -07:00
Robert Nishihara	0fcceef772	Update logging and check macros. (#1627 ) * Update logging and check macros. * Fix linting. * Fix RAY_DCHECK and unused variable. * Fix linting	2018-02-28 15:13:00 -08:00
Eric Liang	9233e496cc	Raise exception when getting the task results of workers that died (#1224 ) * wip * with test * add timeout * also add test for f * remove on cleanup * update * wip * fix tests * mark actor removed in redis * clang-format * fix bug when no-inprogress tasks * try to set task status done * Add comment.	2017-11-20 15:18:39 -08:00
Stephanie Wang	07f0532b9b	Local scheduler filters out dead clients during reconstruction (#1182 ) * Object table lookup returns vector of DBClientID instead of address strings * Add node IP address to DBClient notification * DB client cache stores entire DB client, convert addresses to std::string * get cached db client returns the client * Expose a call to initialize the redis cache * Local scheduler filters out dead clients during reconstruction * Remove node ip address from dbclient, use aux_address for plasma managers * Get entire db client entry when not found in cache * Fix common tests * Fix address in tests * Push error to driver if driver task did the put * Address Robert's comments and cleanup * Remove unused Redis command * Fix db test	2017-11-10 11:29:24 -08:00
Robert Nishihara	1c6b30b5e2	Move all config constants into single file. (#1192 ) * Initial pass at factoring out C++ configuration into a single file. * Expose config through Python. * Forward declarations. * Fixes with Python extensions * Remove old code. * Consistent naming for constants. * Fixes * Fix linting. * More linting. * Whitespace * rename config -> _config. * Move config inside a class. * update naming convention * Fix linting. * More linting * More linting. * Add in some more constants. * Fix linting	2017-11-08 11:10:38 -08:00
Stephanie Wang	74ac80631b	Local scheduler sends a null heartbeat to global scheduler (#962 ) * Local scheduler sends a null heartbeat to global scheduler to notify death * Add whitespace. * Speed up component failures test * Free local scheduler state upon plasma manager disconnection	2017-09-12 10:45:21 -07:00
Peter Schafhalter	2c19ae97a3	Implemented db_client_cache as unordered_map (#921 ) * Implemented db_client_cache as unordered_map * Fix for memory leak * Fixed linting	2017-09-03 17:26:05 -07:00
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
Robert Nishihara	0ac125e9b2	Clean up when a driver disconnects. (#462 ) * Clean up state when drivers exit. * Remove unnecessary field in ActorMapEntry struct. * Have monitor release GPU resources in Redis when driver exits. * Enable multiple drivers in multi-node tests and test driver cleanup. * Make redis GPU allocation a redis transaction and small cleanups. * Fix multi-node test. * Small cleanups. * Make global scheduler take node_ip_address so it appears in the right place in the client table. * Cleanups. * Fix linting and cleanups in local scheduler. * Fix removed_driver_test. * Fix bug related to vector -> list. * Fix linting. * Cleanup. * Fix multi node tests. * Fix jenkins tests. * Add another multi node test with many drivers. * Fix linting. * Make the actor creation notification a flatbuffer message. * Revert "Make the actor creation notification a flatbuffer message." This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39. * Add comment explaining flatbuffer problems.	2017-04-24 18:10:21 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Philipp Moritz	a30eed452e	Change type naming convention. (#315 ) * Rename object_id -> ObjectID. * Rename ray_logger -> RayLogger. * rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID * Rename plasma_store_info -> PlasmaStoreInfo. * Rename plasma_store_state -> PlasmaStoreState. * Rename plasma_object -> PlasmaObject. * Rename object_request -> ObjectRequests. * Rename eviction_state -> EvictionState. * Bug fix. * rename db_handle -> DBHandle * Rename local_scheduler_state -> LocalSchedulerState. * rename db_client_id -> DBClientID * rename task -> Task * make redis.c C++ compatible * Rename scheduling_algorithm_state -> SchedulingAlgorithmState. * Rename plasma_connection -> PlasmaConnection. * Rename client_connection -> ClientConnection. * Fixes from rebase. * Rename local_scheduler_client -> LocalSchedulerClient. * Rename object_buffer -> ObjectBuffer. * Rename client -> Client. * Rename notification_queue -> NotificationQueue. * Rename object_get_requests -> ObjectGetRequests. * Rename get_request -> GetRequest. * Rename object_info -> ObjectInfo. * Rename scheduler_object_info -> SchedulerObjectInfo. * Rename local_scheduler -> LocalScheduler and some fixes. * Rename local_scheduler_info -> LocalSchedulerInfo. * Rename global_scheduler_state -> GlobalSchedulerState. * Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState. * Rename object_size_entry -> ObjectSizeEntry. * Rename aux_address_entry -> AuxAddressEntry. * Rename various ID helper methods. * Rename Task helper methods. * Rename db_client_cache_entry -> DBClientCacheEntry. * Rename local_actor_info -> LocalActorInfo. * Rename actor_info -> ActorInfo. * Rename retry_info -> RetryInfo. * Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData. * Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData. * Rename table_callback_data -> TableCallbackData. * Rename object_info_subscribe_data -> ObjectInfoSubscribeData. * Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData. * Rename more redis call data structures. * Rename photon_conn PhotonConnection. * Rename photon_mock -> PhotonMock. * Fix formatting errors.	2017-02-26 00:32:43 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00
Stephanie Wang	241b539ff8	Reconstruction for evicted objects (#181 ) * First pass at reconstruction in the worker Modify reconstruction stress testing to start Plasma service before rest of Ray cluster TODO about reconstructing ray.puts Fix ray.put error for double creates Distinguish between empty entry and no entry in object table Fix test case Fix Python test Fix tests * Only call reconstruct on objects we have not yet received * Address review comments * Fix reconstruction for Python3 * remove unused code * Address Robert's comments, stress tests are crashing * Test and update the task's scheduling state to suppress duplicate reconstruction requests. * Split result table into two lookups, one for task ID and the other as a test-and-set for the task state * Fix object table tests * Fix redis module result_table_lookup test case * Multinode reconstruction tests * Fix python3 test case * rename * Use new start_redis * Remove unused code * lint * indent * Address Robert's comments * Use start_redis from ray.services in state table tests * Remove unnecessary memset	2017-02-01 19:18:46 -08:00
Robert Nishihara	3d697c7ed2	Introduce local scheduler heartbeats which carry load information. (#155 ) * Introduce local scheduler heartbeats which carry load information.	2016-12-24 20:02:25 -08:00
Stephanie Wang	d729f9b7ea	Object table remove (#139 ) * Object table remove redis module * Test case for object table remove redis module * Client code for object_table_remove * Delete object notifications in plasma * Test for object deletion notifications * Fix subscribe deletion test * Address Robert's comments * free hash table entry	2016-12-19 23:18:57 -08:00
Robert Nishihara	269f37e26f	Implement object table notification subscriptions and switch to using Redis modules for object table. (#134 ) * Implement RAY.OBJECT_TABLE_REQUEST_NOTIFICATIONS. * Call object_table_request_notifications from plasma manager. * Use Redis modules for object table. * Cleaning up code. * More checks. * Formatting. * Make object table tests pass. * Formatting. * Add prefix to the object notification channel name. * Formatting. * Fixes. * Increase time in redismodule test.	2016-12-18 18:19:02 -08:00
Robert Nishihara	c740b165f4	Retry first connection to redis in db_connect. (#112 ) * Retry first connection to redis in db_connect. * Declare usleep. * Formatting.	2016-12-09 17:21:49 -08:00
Alexey Tumanov	0abbf5a113	End-to-end object size information passthrough (#105 ) * rebase Alexey's PR on top * rebase on master * fix test failure waiting for plasma manager to exit * clang format * addressing comments * Minor formatting and naming fixes.	2016-12-09 00:51:44 -08:00
Wapaul1	9a513363f9	Init_table_callback now takes ownership of passed in data (#80 ) * temp commit * Stuff * Ownership is now taken by init table callback * Fixed lint errors * Fixed travis warnings * Fixed spacing * add .gitkeep * fix global scheduler * Whitespace.	2016-12-03 13:49:09 -08:00
Robert Nishihara	c8c3983195	Use sizeof(field) instead of sizeof(type) and other fixes. (#47 ) * Use sizeof(field) instead of sizeof(type) and other fixes. * Fix formatting. * Bug fix. * Zero-initialize structs. There are many more instances of these that I haven't changed yet. * Bug fix. * Revert from atexit to signaling to fix valgrind tests. * Address Philipp's comments.	2016-11-19 12:19:49 -08:00
Robert Nishihara	d77b685a90	Global scheduler skeleton (#45 ) * Initial scheduler commit * global scheduler * add global scheduler * Implement global scheduler skeleton. * Formatting. * Allow local scheduler to be started without a connection to redis so that we can test it without a global scheduler. * Fail if there are no local schedulers when the global scheduler receives a task. * Initialize uninitialized value and formatting fix. * Generalize local scheduler table to db client table. * Remove code duplication in local scheduler and add flag for whether a task came from the global scheduler or not. * Queue task specs in the local scheduler instead of tasks. * Simple global scheduler tests, including valgrind. * Factor out functions for starting processes. * Fixes.	2016-11-18 19:57:51 -08:00
Stephanie Wang	9d1e750e8f	Merge task table and task log into a single table (#30 ) * Merge task table and task log * Fix test in db tests * Address Robert's comments and some better error checking * Add a LOG_FATAL that exits the program	2016-11-10 18:13:26 -08:00
Ion	ee3718c80c	Ion and Philipp's table retries (#10 ) * Ion and Philipp's table retries * Refactor the retry struct: - Rename it from retry_struct to retry_info - Retry information contains the failure callback, not the retry callback - All functions take in retry information as an arg instead of its expanded fields * Rename cb -> callback * Remove prints * Fix compiler warnings * Change some CHECKs to greatest ASSERTs * Key outstanding callbacks hash table with timer ID instead of callback data pointer * Use the new retry API for table commands * Memory cleanup in plasma unit tests * fix Robert's comments * add valgrind for common	2016-10-29 15:22:33 -07:00
Robert Nishihara	1915539c5f	Rearrange files to prepare to merge into Ray.	2016-10-25 13:59:47 -07:00

26 commits