Commit graph

54 commits

Author SHA1 Message Date
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Robert Nishihara
f1d4dda8cb Put all log files in redis and visualize them in UI. (#350)
* Start process for monitoring log files and push changes to redis.

* Display log files in UI.

* Bug fix for recent tasks.

* Use flatbuffers to parse local scheduler heartbeats.
2017-03-16 15:27:00 -07:00
Robert Nishihara
53dffe0bf2 Use flatbuffers for some messages from Redis. (#341)
* Compile the Ray redis module with C++.

* Redo parsing of object table notifications with flatbuffers.

* Update redis module python tests.

* Redo parsing of task table notifications with flatbuffers.

* Fix linting.

* Redo parsing of db client notifications with flatbuffers.

* Redo publishing of local scheduler heartbeats with flatbuffers.

* Fix linting.

* Remove usage of fixed-width formatting of scheduling state in channel name.

* Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion.

* Fix linting and tests.

* fix

* cleanup

* simplify logic in ReplyWithTask
2017-03-10 18:35:25 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Robert Nishihara
1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00
Robert Nishihara
072eadd57f Pipe num_cpus and num_gpus through from start_ray.py. (#275)
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
2017-02-13 17:43:23 -08:00
Robert Nishihara
3934d5f6eb Remove old files and remove old documentation for copying files around cluster. (#274) 2017-02-13 11:20:04 -08:00
Robert Nishihara
cb7f6ca9b5 Attempt to start web UI when starting Ray. (#269)
* Attempt to start web UI when starting Ray.

* Add instructions for using web UI to cluster documentation.

* Don't check if port 8080 is open.

* Remove print statement.
2017-02-12 15:17:58 -08:00
Robert Nishihara
f6ce9dfa6c Allow start_ray.sh to take an object manager port. (#272)
* Allow start_ray.sh to take a object manager port.

* Fix typo and add test.

* Small cleanups.
2017-02-12 12:39:32 -08:00
Johann Schleier-Smith
6ad2b5d87a Add Redis port option to startup script (#232)
* specify redis address when starting head

* cleanup

* update starting cluster documentation

* Whitespace.

* Address Philipp's comments.

* Change redis_host -> redis_ip_address.
2017-01-31 00:28:00 -08:00
Richard Liaw
4575cd88b2 Improve error messages when nodes can't communicate with each other. (#223)
* Good error messages when nodes can't communicate with each other

* Print more information when starting the head node.

* Change retries back to 5.
2017-01-22 14:53:15 -08:00
Robert Nishihara
9bb8162621 Improvements to documentation and error messages. (#221) 2017-01-19 20:27:46 -08:00
Robert Nishihara
84296c8905 Documentation for using Ray on a cluster. (#165) 2016-12-30 00:29:03 -08:00
Robert Nishihara
241c955707 Determine node IP address programatically. (#151)
* Determine node ip address programatically.

* Factor out methods for getting node IP addresses.

* Address comments.
2016-12-23 15:31:40 -08:00
Robert Nishihara
92010ca5b5 Check that we can connect to Redis and that there aren't existing redis clients on the same node in start_ray.py (#148) 2016-12-22 21:54:19 -08:00
Robert Nishihara
6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara
ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Robert Nishihara
072f442c1f Update worker.py and services.py to use plasma and the local scheduler. (#19)
* Update worker code and services code to use plasma and the local scheduler.

* Cleanups.

* Fix bug in which threads were started before the worker mode was set. This caused remote functions to be defined on workers before the worker knew it was in WORKER_MODE.

* Fix bug in install-dependencies.sh.

* Lengthen timeout in failure_test.py.

* Cleanups.

* Cleanup services.start_ray_local.

* Clean up random name generation.

* Cleanups.
2016-11-02 00:39:35 -07:00
Robert Nishihara
6ed641177d Remove unnecessary files. (#4) 2016-10-26 23:24:40 -07:00
Robert Nishihara
91f16a3df0 Migrate repositories to ray-project. (#438)
* Migrate repositories to ray-project.

* Update numbuf to the migrated version.
2016-09-17 00:52:05 -07:00
Robert Nishihara
e06311d415 Automatically add relevant directories to Python paths of workers (#380)
* Make ray.init set python paths of workers.

* Decouple starting cluster from copying user source code

* also add current directory to path

* Add comments about deallocation.

* Add test for new code path.
2016-08-16 14:53:55 -07:00
Robert Nishihara
13df8302e6 enable running example apps in cluster mode (#357) 2016-08-08 16:01:13 -07:00
Robert Nishihara
a6452aca47 Command for installing example applications dependencies on cluster (#353) 2016-08-05 14:54:32 -07:00
Robert Nishihara
1454c26693 fix bug with home directory on cluster (#352) 2016-08-05 11:49:11 -07:00
Robert Nishihara
ac363bf451 Let worker get worker address and object store address from scheduler (#350) 2016-08-04 17:47:08 -07:00
Johann Schleier-Smith
3ee0fd8f34 Update cluster guide (#347)
* clarify cluster setup instructions

* update multinode documentation, update cluster script, fix minor bug in worker.py

* clarify cluster documentation and fix update_user_code
2016-08-04 09:14:20 -07:00
Robert Nishihara
2040372084 unify starting local cluster with attaching to existing cluster (#327) 2016-07-31 19:26:35 -07:00
Robert Nishihara
bcd0e3781f remove example functions and remove imports from shell (#314) 2016-07-29 12:42:44 -07:00
Philipp Moritz
b5215f1e6a make it possible to use directory as user source directory that doesn't contain worker.py (#297) 2016-07-26 18:39:06 -07:00
Robert Nishihara
aa2f618ab7 add directory containing script to python path of workers (#296) 2016-07-26 16:18:39 -07:00
Robert Nishihara
3bae6f136b export remote functions and reusable variables that were defined before connect was called (#292) 2016-07-26 11:40:09 -07:00
Robert Nishihara
8465df1146 script for launching nodes on ec2 (#270)
* original spark-ec2 script

* modifying spark-ec2 for ray
2016-07-16 15:14:14 -07:00
mehrdadn
0f1d7c5835 Run IPython shell without embedding (#269) 2016-07-16 14:42:58 -07:00
Robert Nishihara
80526f7777 add documentation and refactor cluster.py (#238) 2016-07-12 23:54:18 -07:00
Robert Nishihara
8952ff8cf9 allow cluster script to update worker code on nodes (#243) 2016-07-11 17:58:16 -07:00
Robert Nishihara
e1a74eadbe remove installation of dependencies from setup script (#239) 2016-07-08 20:03:21 -07:00
Robert Nishihara
5dd411546d clean up imports (#230) 2016-07-08 12:46:47 -07:00
Robert Nishihara
875b20e397 only run cleanup if we've started ray in local mode and actually started the processes (#228) 2016-07-08 00:14:26 -07:00
Robert Nishihara
8e6b7929d6 make services.cleanup happen automatically (#224) 2016-07-07 14:05:25 -07:00
Robert Nishihara
5873831c21 basic tutorials (#204) 2016-07-06 13:51:32 -07:00
Robert Nishihara
0947024ad9 fix bug for functions with no return values and with one return value (#211) 2016-07-05 15:57:05 -07:00
Robert Nishihara
529e86ce64 add example functions to default worker (#210) 2016-07-05 14:39:42 -07:00
Robert Nishihara
0ffe657e27 enable restarting workers in singlenode case, plus cleanups to cluster.py (#190) 2016-07-01 14:10:51 -07:00
Robert Nishihara
7611fbce4d fixes to shell.py (#195) 2016-06-30 22:57:29 -07:00
Robert Nishihara
ad35da08f3 fix (#188) 2016-06-30 13:26:06 -07:00
Philipp Moritz
8d70dd15df Fix imports for default_worker and set SHELL_MODE for shell 2016-06-27 17:23:01 -07:00
Robert Nishihara
731280fd75 Merge pull request #177 from amplab/newshell
Implement launching cluster with shell
2016-06-27 16:38:01 -07:00
Philipp Moritz
a0df13b14f Implement launching cluster with shell 2016-06-27 16:33:12 -07:00
Philipp Moritz
7af0f1b221 Write computation graph to file 2016-06-27 12:20:30 -07:00
Robert Nishihara
902cac3089 arrays -> array (#172) 2016-06-27 11:35:31 -07:00