hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
Robert Nishihara	8061b3b596	Revert "Suppress warning in start_ray.sh about leaving child processes running when parent exits. (#429 )" (#437 ) This reverts commit `85b373a4be`.	2017-04-07 17:32:28 -07:00
Robert Nishihara	320109a5bd	By default, start a number of workers equal to the number of CPUs. (#430 ) * By default, start a number of workers equal to the number of CPUs. * Fix stress tests.	2017-04-06 00:02:58 -07:00
Robert Nishihara	85b373a4be	Suppress warning in start_ray.sh about leaving child processes running when parent exits. (#429 )	2017-04-05 23:54:22 -07:00
Robert Nishihara	ba02fc0eb0	Run flake8 in Travis and make code PEP8 compliant. (#387 )	2017-03-21 12:57:54 -07:00
Stephanie Wang	12c9618c0c	Plasma and worker node failure. (#373 ) * Failing test case * Local scheduler exits cleanly after plasma store dies * Tolerate one plasma store failure * Tolerate plasma store failures on all nodes except head node * Plasma manager heartbeats * Component failure tests * Don't run the helper for Python testing * Fix C test * Fix hanging plasma transfer test * Fix python3 * Consolidate ClientConnection code * Fix valgrind test * fix c test * We can restart worker nodes! * Fix flatbuffers bug * Address comments * Only register actual workers with the local scheduler * Fix bug * Fix segfaults * Add test case that tests for driver liveness, fix local scheduler bug * Clean up after tests * Allocate retry info on the stack * Send SIGKILL before waiting * Relax unit test conditions * Driver liveness test case and documentation	2017-03-17 17:03:58 -07:00
Robert Nishihara	f1d4dda8cb	Put all log files in redis and visualize them in UI. (#350 ) * Start process for monitoring log files and push changes to redis. * Display log files in UI. * Bug fix for recent tasks. * Use flatbuffers to parse local scheduler heartbeats.	2017-03-16 15:27:00 -07:00
Robert Nishihara	53dffe0bf2	Use flatbuffers for some messages from Redis. (#341 ) * Compile the Ray redis module with C++. * Redo parsing of object table notifications with flatbuffers. * Update redis module python tests. * Redo parsing of task table notifications with flatbuffers. * Fix linting. * Redo parsing of db client notifications with flatbuffers. * Redo publishing of local scheduler heartbeats with flatbuffers. * Fix linting. * Remove usage of fixed-width formatting of scheduling state in channel name. * Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion. * Fix linting and tests. * fix * cleanup * simplify logic in ReplyWithTask	2017-03-10 18:35:25 -08:00
Stephanie Wang	41b8675d04	Availability after local scheduler failure (#329 ) * Clean up plasma subscribers on EPIPE First pass at a monitoring script - monitor can detect local scheduler death Clean up task table upon local scheduler death in monitoring script Don't schedule to dead local schedulers in global scheduler Have global scheduler update the db clients table, monitor script cleans up state Documentation Monitor script should scan tables before beginning to read from subscription channel Fix for python3 Redirect monitor output to redis logs, fix hanging in multinode tests * Publish auxiliary addresses as part of db_client deletion notifications * Fix test case? * Small changes. * Use SCAN instead of KEYS * Address comments * Address more comments * Free redis module strings	2017-03-02 19:51:20 -08:00
Robert Nishihara	1ae7e7d29e	Rename photon -> local scheduler. (#322 )	2017-02-27 12:24:07 -08:00
Robert Nishihara	072eadd57f	Pipe num_cpus and num_gpus through from start_ray.py. (#275 ) * Pipe num_cpus and num_gpus through from start_ray.py. * Improve load balancing tests. * Fix bug. * Factor out some testing code.	2017-02-13 17:43:23 -08:00
Robert Nishihara	3934d5f6eb	Remove old files and remove old documentation for copying files around cluster. (#274 )	2017-02-13 11:20:04 -08:00
Robert Nishihara	cb7f6ca9b5	Attempt to start web UI when starting Ray. (#269 ) * Attempt to start web UI when starting Ray. * Add instructions for using web UI to cluster documentation. * Don't check if port 8080 is open. * Remove print statement.	2017-02-12 15:17:58 -08:00
Robert Nishihara	f6ce9dfa6c	Allow start_ray.sh to take an object manager port. (#272 ) * Allow start_ray.sh to take a object manager port. * Fix typo and add test. * Small cleanups.	2017-02-12 12:39:32 -08:00
Johann Schleier-Smith	6ad2b5d87a	Add Redis port option to startup script (#232 ) * specify redis address when starting head * cleanup * update starting cluster documentation * Whitespace. * Address Philipp's comments. * Change redis_host -> redis_ip_address.	2017-01-31 00:28:00 -08:00
Richard Liaw	4575cd88b2	Improve error messages when nodes can't communicate with each other. (#223 ) * Good error messages when nodes can't communicate with each other * Print more information when starting the head node. * Change retries back to 5.	2017-01-22 14:53:15 -08:00
Robert Nishihara	9bb8162621	Improvements to documentation and error messages. (#221 )	2017-01-19 20:27:46 -08:00
Robert Nishihara	84296c8905	Documentation for using Ray on a cluster. (#165 )	2016-12-30 00:29:03 -08:00
Robert Nishihara	241c955707	Determine node IP address programatically. (#151 ) * Determine node ip address programatically. * Factor out methods for getting node IP addresses. * Address comments.	2016-12-23 15:31:40 -08:00
Robert Nishihara	92010ca5b5	Check that we can connect to Redis and that there aren't existing redis clients on the same node in start_ray.py (#148 )	2016-12-22 21:54:19 -08:00
Robert Nishihara	6cd02d71f8	Fixes and cleanups for the multinode setting. (#143 ) * Add function for driver to get address info from Redis. * Use Redis address instead of Redis port. * Configure Redis to run in unprotected mode. * Add method for starting Ray processes on non-head node. * Pass in correct node ip address to start_plasma_manager. * Script for starting Ray processes. * Handle the case where an object already exists in the store. Maybe this should also compare the object hashes. * Have driver get info from Redis when start_ray_local=False. * Fix. * Script for killing ray processes. * Catch some errors when the main_loop in a worker throws an exception. * Allow redirecting stdout and stderr to /dev/null. * Wrap start_ray.py in a shell script. * More helpful error messages. * Fixes. * Wait for redis server to start up before configuring it. * Allow seeding of deterministic object ID generation. * Small change.	2016-12-21 18:53:12 -08:00
Robert Nishihara	ddba1df802	Start working toward Python3 compatibility. (#117 )	2016-12-11 12:25:31 -08:00
Robert Nishihara	072f442c1f	Update worker.py and services.py to use plasma and the local scheduler. (#19 ) * Update worker code and services code to use plasma and the local scheduler. * Cleanups. * Fix bug in which threads were started before the worker mode was set. This caused remote functions to be defined on workers before the worker knew it was in WORKER_MODE. * Fix bug in install-dependencies.sh. * Lengthen timeout in failure_test.py. * Cleanups. * Cleanup services.start_ray_local. * Clean up random name generation. * Cleanups.	2016-11-02 00:39:35 -07:00
Robert Nishihara	6ed641177d	Remove unnecessary files. (#4 )	2016-10-26 23:24:40 -07:00
Robert Nishihara	91f16a3df0	Migrate repositories to ray-project. (#438 ) * Migrate repositories to ray-project. * Update numbuf to the migrated version.	2016-09-17 00:52:05 -07:00
Robert Nishihara	e06311d415	Automatically add relevant directories to Python paths of workers (#380 ) * Make ray.init set python paths of workers. * Decouple starting cluster from copying user source code * also add current directory to path * Add comments about deallocation. * Add test for new code path.	2016-08-16 14:53:55 -07:00
Robert Nishihara	13df8302e6	enable running example apps in cluster mode (#357 )	2016-08-08 16:01:13 -07:00
Robert Nishihara	a6452aca47	Command for installing example applications dependencies on cluster (#353 )	2016-08-05 14:54:32 -07:00
Robert Nishihara	1454c26693	fix bug with home directory on cluster (#352 )	2016-08-05 11:49:11 -07:00
Robert Nishihara	ac363bf451	Let worker get worker address and object store address from scheduler (#350 )	2016-08-04 17:47:08 -07:00
Johann Schleier-Smith	3ee0fd8f34	Update cluster guide (#347 ) * clarify cluster setup instructions * update multinode documentation, update cluster script, fix minor bug in worker.py * clarify cluster documentation and fix update_user_code	2016-08-04 09:14:20 -07:00
Robert Nishihara	2040372084	unify starting local cluster with attaching to existing cluster (#327 )	2016-07-31 19:26:35 -07:00
Robert Nishihara	bcd0e3781f	remove example functions and remove imports from shell (#314 )	2016-07-29 12:42:44 -07:00
Philipp Moritz	b5215f1e6a	make it possible to use directory as user source directory that doesn't contain worker.py (#297 )	2016-07-26 18:39:06 -07:00
Robert Nishihara	aa2f618ab7	add directory containing script to python path of workers (#296 )	2016-07-26 16:18:39 -07:00
Robert Nishihara	3bae6f136b	export remote functions and reusable variables that were defined before connect was called (#292 )	2016-07-26 11:40:09 -07:00
Robert Nishihara	8465df1146	script for launching nodes on ec2 (#270 ) * original spark-ec2 script * modifying spark-ec2 for ray	2016-07-16 15:14:14 -07:00
mehrdadn	0f1d7c5835	Run IPython shell without embedding (#269 )	2016-07-16 14:42:58 -07:00
Robert Nishihara	80526f7777	add documentation and refactor cluster.py (#238 )	2016-07-12 23:54:18 -07:00
Robert Nishihara	8952ff8cf9	allow cluster script to update worker code on nodes (#243 )	2016-07-11 17:58:16 -07:00
Robert Nishihara	e1a74eadbe	remove installation of dependencies from setup script (#239 )	2016-07-08 20:03:21 -07:00
Robert Nishihara	5dd411546d	clean up imports (#230 )	2016-07-08 12:46:47 -07:00
Robert Nishihara	875b20e397	only run cleanup if we've started ray in local mode and actually started the processes (#228 )	2016-07-08 00:14:26 -07:00
Robert Nishihara	8e6b7929d6	make services.cleanup happen automatically (#224 )	2016-07-07 14:05:25 -07:00
Robert Nishihara	5873831c21	basic tutorials (#204 )	2016-07-06 13:51:32 -07:00
Robert Nishihara	0947024ad9	fix bug for functions with no return values and with one return value (#211 )	2016-07-05 15:57:05 -07:00
Robert Nishihara	529e86ce64	add example functions to default worker (#210 )	2016-07-05 14:39:42 -07:00
Robert Nishihara	0ffe657e27	enable restarting workers in singlenode case, plus cleanups to cluster.py (#190 )	2016-07-01 14:10:51 -07:00
Robert Nishihara	7611fbce4d	fixes to shell.py (#195 )	2016-06-30 22:57:29 -07:00
Robert Nishihara	ad35da08f3	fix (#188 )	2016-06-30 13:26:06 -07:00

1 2

59 commits