Commit graph

17 commits

Author SHA1 Message Date
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00
Stephanie Wang
a0dd3a44c0 Dynamically grow worker pool to partially solve hanging workloads (#286)
* First pass at a policy to solve deadlock

* Address Robert's comments

* stress test

* unit test

* Fix test cases

* Fix test for python3

* add more logging

* White space.
2017-02-17 17:08:52 -08:00
Philipp Moritz
12a68e84d2 Implement a first pass at actors in the API. (#242)
* Implement actor field for tasks

* Implement actor management in local scheduler.

* initial python frontend for actors

* import actors on worker

* IPython code completion and tests

* prepare creating actors through local schedulers

* add actor id to PyTask

* submit actor calls to local scheduler

* starting to integrate

* simple fix

* Fixes from rebasing.

* more work on python actors

* Improve local scheduler actor handlers.

* Pass actor ID to local scheduler when connecting a client.

* first working version of actors

* fixing actors

* fix creating two copies of the same actor

* fix actors

* remove sleep

* get rid of export synchronization

* update

* insert actor methods into the queue in the right order

* remove print statements

* make it compile again after rebase

* Minor updates.

* fix python actor ids

* Pass actor_id to start_worker.

* add test

* Minor changes.

* Update actor tests.

* Temporary plan for import counter.

* Temporarily fix import counters.

* Fix some tests.

* Fixes.

* Make actor creation non-blocking.

* Fix test?

* Fix actors on Python 2.

* fix rare case.

* Fix python 2 test.

* More tests.

* Small fixes.

* Linting.

* Revert tensorflow version to 0.12.0 temporarily.

* Small fix.

* Enhance inheritance test.
2017-02-15 00:10:05 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Robert Nishihara
2d1c980ad7 Refactor local scheduler to remove worker indices. (#245)
* Refactor local scheduler to remove worker indices.

* Change scheduling state enum to int in all function signatures.

* Bug fix, don't use pointers into a resizable array.

* Remove total_num_workers.

* Fix tests.
2017-02-05 14:52:28 -08:00
Robert Nishihara
6703f7be6f Provide functionality for local scheduler to start new workers. (#230)
* Provide functionality for local scheduler to start new workers.

* Pass full command for starting new worker in to local scheduler.

* Separate out configuration state of local scheduler.
2017-01-27 01:28:48 -08:00
Stephanie Wang
f1987cdc16 Split local scheduler task queue (#211)
* Split local scheduler task queue into waiting and dispatch queue

* Fix memory leak

* Add a new task scheduling status for when a task has been queued locally

* Fix global scheduler test case and add task status doc

* Documentation

* Address Philipp's comments

* Move tasks back to the waiting queue if their dependencies become unavailable

* Update existing task table entries instead of overwriting
2017-01-18 20:27:40 -08:00
Robert Nishihara
acf1703afd Implement naive scheduling algorithm using local scheduler load. (#164)
* Implement naive scheduling algorithm using local scheduler load.

* Have the global scheduler estimate load on local schedulers better.

* Fixes.
2016-12-28 22:33:20 -08:00
Robert Nishihara
3d697c7ed2 Introduce local scheduler heartbeats which carry load information. (#155)
* Introduce local scheduler heartbeats which carry load information.
2016-12-24 20:02:25 -08:00
Robert Nishihara
6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara
c9c1b3e6af Change db_connect to allow different arguments from different processes. (#142)
* Allow db_connect to take a variable number of arguments.

* Fix tests.

* Fixes.

* Formatting.

* Fixes.

* Simplifications.

* Fix typo.
2016-12-20 20:21:35 -08:00
Alexey Tumanov
946242929f Plasma photon association: passing through plasma address with photon db connection (#123)
* passing plasma ip:port association with photon through redis to global scheduler

* Fix test.

* sanity-checking aux_address inside db_connect_extended

* clang format

* fix photon tests

* clang format photon tests
2016-12-13 17:21:38 -08:00
Stephanie Wang
4bdb9f7224 Object reconstruction in Photon (#65)
* Object reconstruction in Photon and C test cases for Photon

* Fix hanging test case on mac

* Remove unnecessary event from photon tests

* make photon_disconnect not leak file descriptors

* fix some of the memory errors

* Fix valgrind

* lint

* Address Robert's comments and add test case for object reconstruction suppression

* Remove OWNER
2016-12-12 23:17:22 -08:00
Robert Nishihara
35b9dedb48 Remove scheduler_info. (#84) 2016-12-04 15:51:03 -08:00
Robert Nishihara
d77b685a90 Global scheduler skeleton (#45)
* Initial scheduler commit

* global scheduler

* add global scheduler

* Implement global scheduler skeleton.

* Formatting.

* Allow local scheduler to be started without a connection to redis so that we can test it without a global scheduler.

* Fail if there are no local schedulers when the global scheduler receives a task.

* Initialize uninitialized value and formatting fix.

* Generalize local scheduler table to db client table.

* Remove code duplication in local scheduler and add flag for whether a task came from the global scheduler or not.

* Queue task specs in the local scheduler instead of tasks.

* Simple global scheduler tests, including valgrind.

* Factor out functions for starting processes.

* Fixes.
2016-11-18 19:57:51 -08:00
Robert Nishihara
ad55166472 Rearrange local scheduler files to prepare to merge into Ray. 2016-10-25 14:16:23 -07:00
Renamed from photon_scheduler.h (Browse further)