hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Philipp Moritz	647e1d9fc3	Fix runtest.py on the ubuntu system python 3 (#599 ) * fix runtest.py on the ubuntu system python 3 * less strict version of the test	2017-05-26 15:22:36 -07:00
Robert Nishihara	c5bc76193f	Remove Ray environment variables from codebase. (#590 )	2017-05-24 18:29:40 -07:00
Robert Nishihara	c647dd5f6c	Make it possible to use actor definitions within remote functions and other actors. (#587 ) * Enable remote function and actor definitions to close over actor definitions. * Give better error message if actor objects are pickled. * Add tests for closing over actor definitions. * Fix linting.	2017-05-24 15:43:32 -07:00
Robert Nishihara	07b21e057c	Print the driver stdout/stderr if we fail to decode it in jenkins. (#567 ) * Print the driver stdout/stderr if we fail to decode it in jenkins. * Fix whitespace. * Add explanation.	2017-05-20 23:11:19 -07:00
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
shane	0a4304725f	adding -x for clearer output in build console log (#565 )	2017-05-18 17:04:56 -07:00
Philipp Moritz	28f0882387	Expose function table to python global control state API (#542 ) * expose function table to python global control state API * fix * fix linting * add test for function table	2017-05-16 20:06:13 -07:00
Robert Nishihara	ec2534422b	Remove register_class from API. (#550 ) * Perform ray.register_class under the hood. * Fix bug. * Release worker lock when waiting for imports to arrive in get. * Remove calls to register_class from examples and tests. * Clear serialization state between tests. * Fix bug and add test for multiple custom classes with same name. * Fix failure test. * Fix linting and cleanups to python code. * Fixes to documentation. * Implement recursion depth for recursively registering classes. * Fix linting. * Push warning to user if waiting for class for too long. * Fix typos. * Don't export FunctionToRun if pickling the function fails. * Don't broadcast class definition when pickling class.	2017-05-16 18:38:52 -07:00
Eric Liang	e2e9e4ce6f	Fix segmentation fault when calling ray.put on a dictionary with object keys (#548 ) * fix segfault when serializing dict key * fix style * fix test * Fix linting.	2017-05-15 01:09:13 -07:00
Robert Nishihara	9f91eb8c91	Change API for remote function declaration, actor instantiation, and actor method invocation. (#541 ) * Direction substitution of @ray.remote -> @ray.task. * Changes to make '@ray.task' work. * Instantiate actors with Class.remote() instead of Class(). * Convert actor instantiation in tests and examples from Class() to Class.remote(). * Change actor method invocation from object.method() to object.method.remote(). * Update tests and examples to invoke actor methods with .remote(). * Fix bugs in jenkins tests. * Fix example applications. * Change @ray.task back to @ray.remote. * Changes to make @ray.actor -> @ray.remote work. * Direct substitution of @ray.actor -> @ray.remote. * Fixes. * Raise exception if @ray.actor decorator is used. * Simplify ActorMethod class.	2017-05-14 00:01:20 -07:00
Robert Nishihara	f32368bcbe	Prevent actors from being placed on removed nodes or nodes with no CPUs. (#527 ) * Make note about bug in which actor creation notification message is not received. * Prevent actors from being created on removed nodes. * Prevent actors from being created on nodes with no CPUs. * Fix linting. * Add test for scheduling actors on local schedulers with no CPUs. * Improve error message when actors created before ray.init called.	2017-05-08 20:39:43 -07:00
Robert Nishihara	c688a64235	Expose GPU IDs to remote functions. (#496 ) * Change local scheduler bookkeeping to use GPU IDs. * Update actor test. * Add tests for actors and tasks simultaneously using GPUs. * Add additional task GPU ID test. * Fix linting. * Make redis GPU assignment ignore GPU IDs. * Small fix.	2017-05-07 13:03:49 -07:00
Robert Nishihara	8532ba4272	Serialize lambdas, sets, and types with pickle by default. (#511 ) * Serialize lambdas with pickle by default. * Serialize sets with pickle by default. * Serialize types with pickle by default. * Small update to documentation. * Update tests.	2017-05-04 00:16:35 -07:00
Robert Nishihara	245c8ab888	Make sure user seeding does not affect actor ID generation. (#506 ) * Make sure user seeding does not affect actor ID generation. * Fix linting. * Add test.	2017-05-03 16:29:55 -07:00
Robert Nishihara	1627f89945	Fix problem in which actors and workers running tasks are not killed by driver exit. (#490 ) * Augment test to verify that relevant workers and actors are killed during driver cleanup. * Fix bug in which we were only killing one worker when a driver exited. * Fix remove driver test. * Fix and augment test.	2017-04-26 15:13:39 -07:00
Robert Nishihara	0ac125e9b2	Clean up when a driver disconnects. (#462 ) * Clean up state when drivers exit. * Remove unnecessary field in ActorMapEntry struct. * Have monitor release GPU resources in Redis when driver exits. * Enable multiple drivers in multi-node tests and test driver cleanup. * Make redis GPU allocation a redis transaction and small cleanups. * Fix multi-node test. * Small cleanups. * Make global scheduler take node_ip_address so it appears in the right place in the client table. * Cleanups. * Fix linting and cleanups in local scheduler. * Fix removed_driver_test. * Fix bug related to vector -> list. * Fix linting. * Cleanup. * Fix multi node tests. * Fix jenkins tests. * Add another multi node test with many drivers. * Fix linting. * Make the actor creation notification a flatbuffer message. * Revert "Make the actor creation notification a flatbuffer message." This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39. * Add comment explaining flatbuffer problems.	2017-04-24 18:10:21 -07:00
Philipp Moritz	8ac6c59931	Remove n^2 algorithm in plasma get (#466 ) Remove n^2 algorithm in plasma get.	2017-04-17 23:37:33 -07:00
Robert Nishihara	c802e51d36	Re-enable recursive remote functions in a limited form. (#453 ) * Re-enable recursive remote functions in a limited form. * Fix linting.	2017-04-13 01:47:33 -07:00
Richard Liaw	c3a2505ffd	Loadbalancing Test issue (#452 ) * Limiting number of CPUs in loadbalancing test * fixes as requested	2017-04-11 22:33:58 -07:00
Robert Nishihara	f4c1adae17	Unify function signature handling between remote functions and actor … (#441 ) * Unify function signature handling between remote functions and actor methods. * Fixes. * Fix tests.	2017-04-08 21:34:13 -07:00
Robert Nishihara	7cd00741b1	Suppress irrelevant Redis connection errors. (#434 ) * Suppress error messages in worker import thread when Redis terminates. * Suppress some warnings from one of the tests.	2017-04-07 23:19:24 -07:00
Robert Nishihara	0eac3ccdd0	Reduce verbosity of component_failures_test.py. (#440 )	2017-04-07 23:05:29 -07:00
Robert Nishihara	7af6f462fb	Add API for querying global control state. (#431 ) * Add API for querying global control state. * Fix linting. * Fix errors in Python 2. * Fix bug in test. * Fix bug in test.	2017-04-06 23:51:12 -07:00
Robert Nishihara	320109a5bd	By default, start a number of workers equal to the number of CPUs. (#430 ) * By default, start a number of workers equal to the number of CPUs. * Fix stress tests.	2017-04-06 00:02:58 -07:00
Robert Nishihara	fa363a5a3a	Notify driver when a worker dies while executing a task. (#419 ) * Notify driver when a worker dies while executing a task. * Fix linting. * Don't push error when local scheduler is cleaning up.	2017-04-06 00:02:39 -07:00
Philipp Moritz	4043769ba2	Make putting large objects work. (#411 ) * putting large objects * add more checks * support large objects * fix test * fix linting * upgrade to latest arrow version * check malloc return code * print mmap file sizes * printing * revert to dlmalloc * add prints * more prints * add printing * printing * fix * update * fix * update * print * initialization * temp * fix * update * fix linting * comment out object_store_full tests * fix test * fix test * evict objects if dlmalloc fails * fix stresstests * Fix linting. * Uncomment large-memory tests. * Increase memory for docker image for jenkins tests. * Reduce large memory tests. * Further reduce large memory tests.	2017-04-05 01:04:05 -07:00
Robert Nishihara	ba02fc0eb0	Run flake8 in Travis and make code PEP8 compliant. (#387 )	2017-03-21 12:57:54 -07:00
Stephanie Wang	083e7a28ad	Push an error to the driver when the workload hangs on `ray.put` reconstruction (#382 ) * Fix worker blocked bug * tmp * Push an error to the driver on ray.put for non-driver tasks * Fix result table tests * Fix test, logging * Address comments * Fix suppression bug * Fix redis module test * Edit error message * Get values in chunks during reconstruction * Test case for driver ray.put errors * Error for evicting ray.put objects from the driver * Fix tests * Reduce verbosity * Documentation	2017-03-21 00:16:48 -07:00
Johann Schleier-Smith	29c8471fd4	Add multinode tests by simulating multiple nodes using Docker. (#378 ) * run test workloads for a Docker cluster * better manage docker image versions * Changes to make multinode docker tests work with Python 3. * option to mount local test directory on head node to speed development * Attempt to simplify multinode test setup. * Small change. * Add in development-mode to run multinode docker tests more easily during development. * add jenkins test script that links to Docker hash * Read docker SHA from build_docker.sh and add test that should fail. * Consolidate implementations and remove duplicate files. * Allow test to retry if it fails to schedule on all nodes. * Remove sleep when in docker multinode tests.	2017-03-18 23:44:54 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Robert Nishihara	6b1e8caf2d	Reduce stress_test verbosity. (#377 )	2017-03-16 20:10:56 -07:00
Philipp Moritz	068429ffd8	Convert local scheduler messages to flatbuffers (#340 ) * use flatbuffer messages for local scheduler * make sure constructor gets called for C++ object ObjectInfoT * fix typo * fix Robert's comments * Small change to actor test. * fix valgrind error * linting * free notification * fix * valgrind * fix valgrind * fix other bugs * valgrind fix * fixes * more fixes * Small changes to comments.	2017-03-15 16:27:52 -07:00
Robert Nishihara	3b7788bf88	Disallow calling ray.put on an object ID. (#353 )	2017-03-11 12:09:28 -08:00
Robert Nishihara	53dffe0bf2	Use flatbuffers for some messages from Redis. (#341 ) * Compile the Ray redis module with C++. * Redo parsing of object table notifications with flatbuffers. * Update redis module python tests. * Redo parsing of task table notifications with flatbuffers. * Fix linting. * Redo parsing of db client notifications with flatbuffers. * Redo publishing of local scheduler heartbeats with flatbuffers. * Fix linting. * Remove usage of fixed-width formatting of scheduling state in channel name. * Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion. * Fix linting and tests. * fix * cleanup * simplify logic in ReplyWithTask	2017-03-10 18:35:25 -08:00
Wapaul1	c66178bcd7	Resnet Adapted to Ray (#229 ) * Initial conversion * Further changes * fixes * some changes * Fixes * Added data pipeline * Added updates to cifar * Currently borken need sep pr * Added test for retriving variables from an optimizer * Removed FlAG ref in environment variables * Added comments to test * Addressed comments * Added updates * Made further changes for tfutils * Fixed finalized bug * Removed ipython * Added accuracy printing * Temp commit * added fixes * changes * Added writing to file * Fixes for gpus * Cleaned up code * Temp commit * Gpu support fully implemented * Updated to use num_gpus for actors * Finished testing gpus implementation * Changed to be more in line with origin implementation * Updated test to use actors * Added support for cpu only systems * Now works with no cpus * Minor changes and some documentation.	2017-03-07 01:07:32 -08:00
Stephanie Wang	da06b4db82	Warn the user when a nondeterministic task is detected. (#339 ) * WARN instead of FATAL for object hash mismatches, push error to driver * Document the callback signature for object_table_add/remove * Error table * Wait for all errors in python test * Fix doc * Fix state test	2017-03-07 00:32:15 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Robert Nishihara	39b7abefc5	Fix test failures in actor_test.py. (#317 )	2017-03-01 23:26:39 -08:00
Stephanie Wang	be1618f041	Availability after worker failure (#316 ) * Availability after a killed worker * Workers exit cleanly * Memory cleanup in photon C tests * Worker failure in multinode * Consolidate worker cleanup handlers * Update the result table before handling a task submission * KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS * Log a warning instead of crashing if no result table entry found	2017-02-25 20:19:36 -08:00
Robert Nishihara	54238c4ad0	Propagate errors from importing actors. (#309 ) * Propagate errors from importing actors. * Fix bug.	2017-02-22 15:15:45 -08:00
Robert Nishihara	e399f57e6b	Let actors use GPUs. (#302 ) * Add num_cpus and num_gpus to actor decorator. * Assign GPU IDs to actors. * Add additional actor test. * Remove duplicated line. * Factor out local scheduler selection method. * Add test and simplify local scheduler selection.	2017-02-21 01:13:04 -08:00
Stephanie Wang	334aed9fa9	Fetch the object after requesting reconstruction during ray.get (#301 ) * Fetch the object after requesting reconstruction during ray.get * revert * Fix documentation and memory leak * Fix hanging reconstruction bug * Fix for python3	2017-02-20 21:41:34 -08:00
Robert Nishihara	abd9987e3b	Fix unreliable actor test. (#295 )	2017-02-18 00:51:08 -08:00
Stephanie Wang	a0dd3a44c0	Dynamically grow worker pool to partially solve hanging workloads (#286 ) * First pass at a policy to solve deadlock * Address Robert's comments * stress test * unit test * Fix test cases * Fix test for python3 * add more logging * White space.	2017-02-17 17:08:52 -08:00
Robert Nishihara	88a5b4e77b	Simplify imports and exports and provide driver isolation for remote functions. (#288 ) * Remove import counter and export counter. * Provide isolation between drivers for remote functions. * Add test for driver function isolation. * Hash source code into function ID to reduce likelihood of collisions. * Fix failure test example. * Replace assertTrue with assertIn to improve failure messages in tests. * Fix failure test.	2017-02-16 11:30:35 -08:00
Philipp Moritz	12a68e84d2	Implement a first pass at actors in the API. (#242 ) * Implement actor field for tasks * Implement actor management in local scheduler. * initial python frontend for actors * import actors on worker * IPython code completion and tests * prepare creating actors through local schedulers * add actor id to PyTask * submit actor calls to local scheduler * starting to integrate * simple fix * Fixes from rebasing. * more work on python actors * Improve local scheduler actor handlers. * Pass actor ID to local scheduler when connecting a client. * first working version of actors * fixing actors * fix creating two copies of the same actor * fix actors * remove sleep * get rid of export synchronization * update * insert actor methods into the queue in the right order * remove print statements * make it compile again after rebase * Minor updates. * fix python actor ids * Pass actor_id to start_worker. * add test * Minor changes. * Update actor tests. * Temporary plan for import counter. * Temporarily fix import counters. * Fix some tests. * Fixes. * Make actor creation non-blocking. * Fix test? * Fix actors on Python 2. * fix rare case. * Fix python 2 test. * More tests. * Small fixes. * Linting. * Revert tensorflow version to 0.12.0 temporarily. * Small fix. * Enhance inheritance test.	2017-02-15 00:10:05 -08:00
Robert Nishihara	072eadd57f	Pipe num_cpus and num_gpus through from start_ray.py. (#275 ) * Pipe num_cpus and num_gpus through from start_ray.py. * Improve load balancing tests. * Fix bug. * Factor out some testing code.	2017-02-13 17:43:23 -08:00
Robert Nishihara	f6ce9dfa6c	Allow start_ray.sh to take an object manager port. (#272 ) * Allow start_ray.sh to take a object manager port. * Fix typo and add test. * Small cleanups.	2017-02-12 12:39:32 -08:00
Stephanie Wang	2b8e6485e3	Start and clean up workers from the local scheduler. (#250 ) * Start and clean up workers from the local scheduler Ability to kill workers in photon scheduler Test for old method of starting workers Common codepath for killing workers Common codepath for killing workers Photon test case for starting and killing workers fix build Fix component failure test Register a worker's pid as part of initial connection Address comments and revert photon_connect Set PATH during travis install Fix * Fix photon test case to accept clients on plasma manager fd	2017-02-10 12:46:23 -08:00
Robert Nishihara	ec175b7dfb	Check if processes are alive in test. (#261 )	2017-02-09 23:40:39 -08:00

... 2 3 4 5 6 ...

358 commits