hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Robert Nishihara	320109a5bd	By default, start a number of workers equal to the number of CPUs. (#430 ) * By default, start a number of workers equal to the number of CPUs. * Fix stress tests.	2017-04-06 00:02:58 -07:00
Stephanie Wang	93679df724	Stopped nodes can rejoin immediately (#428 ) * Ignore deleted clients when reading address info from Redis * Remove self from db_client table when exiting cleanly * Fix valgrind test * Do not call plasma_perform_release when disconnecting	2017-04-05 23:50:38 -07:00
Philipp Moritz	4043769ba2	Make putting large objects work. (#411 ) * putting large objects * add more checks * support large objects * fix test * fix linting * upgrade to latest arrow version * check malloc return code * print mmap file sizes * printing * revert to dlmalloc * add prints * more prints * add printing * printing * fix * update * fix * update * print * initialization * temp * fix * update * fix linting * comment out object_store_full tests * fix test * fix test * evict objects if dlmalloc fails * fix stresstests * Fix linting. * Uncomment large-memory tests. * Increase memory for docker image for jenkins tests. * Reduce large memory tests. * Further reduce large memory tests.	2017-04-05 01:04:05 -07:00
Robert Nishihara	0925e11c48	Exclude function source from function ID hash in Python interpreter. (#395 ) * Exclude function source code from function ID hash in Python interpreter. * Remove try except block.	2017-03-25 11:31:21 -07:00
Robert Nishihara	ba02fc0eb0	Run flake8 in Travis and make code PEP8 compliant. (#387 )	2017-03-21 12:57:54 -07:00
Stephanie Wang	083e7a28ad	Push an error to the driver when the workload hangs on `ray.put` reconstruction (#382 ) * Fix worker blocked bug * tmp * Push an error to the driver on ray.put for non-driver tasks * Fix result table tests * Fix test, logging * Address comments * Fix suppression bug * Fix redis module test * Edit error message * Get values in chunks during reconstruction * Test case for driver ray.put errors * Error for evicting ray.put objects from the driver * Fix tests * Reduce verbosity * Documentation	2017-03-21 00:16:48 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Robert Nishihara	f1d4dda8cb	Put all log files in redis and visualize them in UI. (#350 ) * Start process for monitoring log files and push changes to redis. * Display log files in UI. * Bug fix for recent tasks. * Use flatbuffers to parse local scheduler heartbeats.	2017-03-16 15:27:00 -07:00
Robert Nishihara	3333e1d6b9	Fix bug in parsing of tasks in monitor. (#372 )	2017-03-15 20:32:23 -07:00
Robert Nishihara	3b7788bf88	Disallow calling ray.put on an object ID. (#353 )	2017-03-11 12:09:28 -08:00
Robert Nishihara	53dffe0bf2	Use flatbuffers for some messages from Redis. (#341 ) * Compile the Ray redis module with C++. * Redo parsing of object table notifications with flatbuffers. * Update redis module python tests. * Redo parsing of task table notifications with flatbuffers. * Fix linting. * Redo parsing of db client notifications with flatbuffers. * Redo publishing of local scheduler heartbeats with flatbuffers. * Fix linting. * Remove usage of fixed-width formatting of scheduling state in channel name. * Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion. * Fix linting and tests. * fix * cleanup * simplify logic in ReplyWithTask	2017-03-10 18:35:25 -08:00
Wapaul1	c66178bcd7	Resnet Adapted to Ray (#229 ) * Initial conversion * Further changes * fixes * some changes * Fixes * Added data pipeline * Added updates to cifar * Currently borken need sep pr * Added test for retriving variables from an optimizer * Removed FlAG ref in environment variables * Added comments to test * Addressed comments * Added updates * Made further changes for tfutils * Fixed finalized bug * Removed ipython * Added accuracy printing * Temp commit * added fixes * changes * Added writing to file * Fixes for gpus * Cleaned up code * Temp commit * Gpu support fully implemented * Updated to use num_gpus for actors * Finished testing gpus implementation * Changed to be more in line with origin implementation * Updated test to use actors * Added support for cpu only systems * Now works with no cpus * Minor changes and some documentation.	2017-03-07 01:07:32 -08:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Robert Nishihara	a7ddac6fb1	Properly mock ray submodules when building documentation. (#337 )	2017-03-04 23:02:56 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Robert Nishihara	6a4bde54dc	Only install ray python packages. (#330 ) * Only install ray python packages. * Add some __init__.py files. * Install Ray before building documentation. * Fix install-ray.sh. * Fix.	2017-03-01 23:34:44 -08:00
Robert Nishihara	1a997ed279	Move documentation to ReadTheDocs. (#326 )	2017-02-27 21:14:31 -08:00
Robert Nishihara	1ae7e7d29e	Rename photon -> local scheduler. (#322 )	2017-02-27 12:24:07 -08:00
Stephanie Wang	be1618f041	Availability after worker failure (#316 ) * Availability after a killed worker * Workers exit cleanly * Memory cleanup in photon C tests * Worker failure in multinode * Consolidate worker cleanup handlers * Update the result table before handling a task submission * KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS * Log a warning instead of crashing if no result table entry found	2017-02-25 20:19:36 -08:00
Robert Nishihara	aa174e6311	Fix global scheduler test failure. (#314 )	2017-02-24 11:05:45 -08:00
Robert Nishihara	54238c4ad0	Propagate errors from importing actors. (#309 ) * Propagate errors from importing actors. * Fix bug.	2017-02-22 15:15:45 -08:00
Robert Nishihara	e399f57e6b	Let actors use GPUs. (#302 ) * Add num_cpus and num_gpus to actor decorator. * Assign GPU IDs to actors. * Add additional actor test. * Remove duplicated line. * Factor out local scheduler selection method. * Add test and simplify local scheduler selection.	2017-02-21 01:13:04 -08:00
Stephanie Wang	334aed9fa9	Fetch the object after requesting reconstruction during ray.get (#301 ) * Fetch the object after requesting reconstruction during ray.get * revert * Fix documentation and memory leak * Fix hanging reconstruction bug * Fix for python3	2017-02-20 21:41:34 -08:00
Robert Nishihara	2220a33b62	In UI, add timing information for tasks and show cluster scheduling. (#297 ) * In UI, add timing information for tasks and show cluster scheduling. * Factor out html generation as function.	2017-02-19 15:12:08 -08:00
Robert Nishihara	124baa7472	Fix bug in redis module tests. (#292 ) * Fix bug in redis module tests. * Sleep while waiting for next message.	2017-02-18 00:55:57 -08:00
Stephanie Wang	a0dd3a44c0	Dynamically grow worker pool to partially solve hanging workloads (#286 ) * First pass at a policy to solve deadlock * Address Robert's comments * stress test * unit test * Fix test cases * Fix test for python3 * add more logging * White space.	2017-02-17 17:08:52 -08:00
Robert Nishihara	0bbf08a4ac	Fix test_illegal_put failure in plasma test. (#289 ) * Fix test_illegal_put failure in plasma test. * Check that exactly one plasma manager has died.	2017-02-17 11:06:25 -08:00
Johann Schleier-Smith	c9bc488ee0	Redirect process output to log files (#267 ) * redirect process output to log files * formatting fixes * Generate all log files in start_ray_processes. * Fix bug.	2017-02-16 20:34:45 -08:00
Robert Nishihara	88a5b4e77b	Simplify imports and exports and provide driver isolation for remote functions. (#288 ) * Remove import counter and export counter. * Provide isolation between drivers for remote functions. * Add test for driver function isolation. * Hash source code into function ID to reduce likelihood of collisions. * Fix failure test example. * Replace assertTrue with assertIn to improve failure messages in tests. * Fix failure test.	2017-02-16 11:30:35 -08:00
Wapaul1	883f945db4	Updated tfutils to use new op naming (#284 ) * Updated tfutils to use new op naming * Reverted tensorflow 12.0.0	2017-02-15 17:47:53 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00
Robert Nishihara	072eadd57f	Pipe num_cpus and num_gpus through from start_ray.py. (#275 ) * Pipe num_cpus and num_gpus through from start_ray.py. * Improve load balancing tests. * Fix bug. * Factor out some testing code.	2017-02-13 17:43:23 -08:00
Robert Nishihara	3934d5f6eb	Remove old files and remove old documentation for copying files around cluster. (#274 )	2017-02-13 11:20:04 -08:00
Robert Nishihara	cb7f6ca9b5	Attempt to start web UI when starting Ray. (#269 ) * Attempt to start web UI when starting Ray. * Add instructions for using web UI to cluster documentation. * Don't check if port 8080 is open. * Remove print statement.	2017-02-12 15:17:58 -08:00
Robert Nishihara	f6ce9dfa6c	Allow start_ray.sh to take an object manager port. (#272 ) * Allow start_ray.sh to take a object manager port. * Fix typo and add test. * Small cleanups.	2017-02-12 12:39:32 -08:00
Johann Schleier-Smith	7bf80b6b22	bug fix on printing exception traceback (#268 )	2017-02-10 21:05:05 -08:00
Stephanie Wang	2b8e6485e3	Start and clean up workers from the local scheduler. (#250 ) * Start and clean up workers from the local scheduler Ability to kill workers in photon scheduler Test for old method of starting workers Common codepath for killing workers Common codepath for killing workers Photon test case for starting and killing workers fix build Fix component failure test Register a worker's pid as part of initial connection Address comments and revert photon_connect Set PATH during travis install Fix * Fix photon test case to accept clients on plasma manager fd	2017-02-10 12:46:23 -08:00
Robert Nishihara	249b667b0e	Raise exception in Python if wait is called with duplicate object IDs. (#262 )	2017-02-09 23:32:19 -08:00
Robert Nishihara	0aa234fb9c	Fix CXX numbuf error message for Anaconda 3.6. (#258 )	2017-02-09 23:29:43 -08:00
Alexey Tumanov	dfb6107b22	General attribute-based heterogeneity support with hard and soft constraints (#248 ) * attribute-based heterogeneity-awareness in global scheduler and photon * minor post-rebase fix * photon: enforce dynamic capacity constraint on task dispatch * globalsched: cap the number of times we try to schedule a task in round robin * propagating ability to specify resource capacity to ray.init * adding resources to remote function export and fetch/register * globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat) * Add some integration tests. * globalsched: cleanup + factor out constraint checking * lots of style * task_spec_required_resource: global refactor * clang format * clang format + comment update in photon * clang format photon comment * valgrind * reduce verbosity for Travis * Add test for scheduler load balancing. * addressing comments * refactoring global scheduler algorithm * Minor cleanups. * Linting. * Fix array_test.py and linting. * valgrind fix for photon tests * Attempt to fix stress tests. * fix hashmap free * fix hashmap free comment * memset photon resource vectors to 0 in case they get used before the first heartbeat * More whitespace changes. * Undo whitespace error I introduced.	2017-02-09 01:34:14 -08:00
Wapaul1	1a7e1c47cb	Added example for compute grads in ray tutorial (#238 ) * Added example for compute grads in ray * Added formatting * Removed need for placeholders in apply gradient * Streamlined examples * Fixed docs * Added formatting * Removed old references * Simplified code some * Addressed comments * Changes to first code block * Added test for training and updated code snippets * Formatting * Removed mean * Removed all mention of mean * Added comments * Added comments	2017-02-07 18:07:21 -08:00
Robert Nishihara	1fec94ef00	Display drivers in web UI. (#252 ) * Display drivers in web UI. * Display more rows in grid and factor out function in webui backend.	2017-02-07 14:21:25 -08:00
Stephanie Wang	241b539ff8	Reconstruction for evicted objects (#181 ) * First pass at reconstruction in the worker Modify reconstruction stress testing to start Plasma service before rest of Ray cluster TODO about reconstructing ray.puts Fix ray.put error for double creates Distinguish between empty entry and no entry in object table Fix test case Fix Python test Fix tests * Only call reconstruct on objects we have not yet received * Address review comments * Fix reconstruction for Python3 * remove unused code * Address Robert's comments, stress tests are crashing * Test and update the task's scheduling state to suppress duplicate reconstruction requests. * Split result table into two lookups, one for task ID and the other as a test-and-set for the task state * Fix object table tests * Fix redis module result_table_lookup test case * Multinode reconstruction tests * Fix python3 test case * rename * Use new start_redis * Remove unused code * lint * indent * Address Robert's comments * Use start_redis from ray.services in state table tests * Remove unnecessary memset	2017-02-01 19:18:46 -08:00
Johann Schleier-Smith	6ad2b5d87a	Add Redis port option to startup script (#232 ) * specify redis address when starting head * cleanup * update starting cluster documentation * Whitespace. * Address Philipp's comments. * Change redis_host -> redis_ip_address.	2017-01-31 00:28:00 -08:00
Wapaul1	db7297865f	Added functionality for retrieving variables from control dependencies (#220 ) * Added test for retriving variables from an optimizer * Added comments to test * Addressed comments * Fixed travis bug * Added fix to circular controls * Added set for explored operations and duplicate prefix stripping * Removed embeded ipython * Removed prefix, use seperate graph for each network * Removed redundant imports * Addressed comments and added separate graph to initializer * fix typos * get rid of prefix in documentation	2017-01-30 19:17:42 -08:00
Robert Nishihara	6703f7be6f	Provide functionality for local scheduler to start new workers. (#230 ) * Provide functionality for local scheduler to start new workers. * Pass full command for starting new worker in to local scheduler. * Separate out configuration state of local scheduler.	2017-01-27 01:28:48 -08:00
Stephanie Wang	a5c8f28f33	Plasma subscribe (#227 ) * Use object_info as notification, not just the object_id * Add a regression test for plasma managers connecting to store after some objects have been created * Send notifications for existing objects to new plasma subscribers * Continuously try the request to the plasma manager instead of setting a timeout in the test case * Use ray.services to start Redis in plasma test cases * fix test case	2017-01-25 22:57:15 -08:00
Robert Nishihara	ab8c3432f7	Add driver ID to task spec and add driver ID to Python error handling. (#225 ) * Add driver ID to task spec and add driver ID to Python error handling. * Make constants global variables. * Add test for error isolation.	2017-01-25 22:53:48 -08:00
Stephanie Wang	3c6686db08	Photon optimizations (#219 ) * Optimizations: - Track mapping of missing object to dependent tasks to avoid iterating over task queue - Perform all fetch requests for missing objects using the same timer * Fix bug and add regression test * Record task dependencies and active fetch requests in the same hash table * fix typo * Fix memory leak and add test cases for scheduling when dependencies are evicted * Fix python3 test case * Minor details.	2017-01-23 19:44:15 -08:00
Richard Liaw	4575cd88b2	Improve error messages when nodes can't communicate with each other. (#223 ) * Good error messages when nodes can't communicate with each other * Print more information when starting the head node. * Change retries back to 5.	2017-01-22 14:53:15 -08:00

... 147 148 149 150 151

7507 commits