Commit graph

7 commits

Author SHA1 Message Date
Robert Nishihara
0eac3ccdd0 Reduce verbosity of component_failures_test.py. (#440) 2017-04-07 23:05:29 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Robert Nishihara
303d0fed3e Prevent plasma store and manager from dying when a client dies. (#203)
* Prevent plasma store and manager from dying when a worker dies.

* Check errno inside of warn_if_sigpipe. Passing in errno doesn't work because the arguments to warn_if_sigpipe can be evaluated out of order.
2017-01-17 20:34:31 -08:00