Commit graph

32 commits

Author SHA1 Message Date
Robert Nishihara
232601f90d Change all table calls to use default retry behavior. (#312)
* Change all table calls to use default retry behavior and change default retry behavior.

* Add warning for table retries.
2017-02-24 12:41:32 -08:00
Alexey Tumanov
3159a78ad7 terminate photon task dispatch early when workers or resources are unavailable (#311)
* terminate photon task dispatch early when no workers or resources available

* style
2017-02-23 00:05:16 -08:00
Stephanie Wang
a0dd3a44c0 Dynamically grow worker pool to partially solve hanging workloads (#286)
* First pass at a policy to solve deadlock

* Address Robert's comments

* stress test

* unit test

* Fix test cases

* Fix test for python3

* add more logging

* White space.
2017-02-17 17:08:52 -08:00
Robert Nishihara
88a5b4e77b Simplify imports and exports and provide driver isolation for remote functions. (#288)
* Remove import counter and export counter.

* Provide isolation between drivers for remote functions.

* Add test for driver function isolation.

* Hash source code into function ID to reduce likelihood of collisions.

* Fix failure test example.

* Replace assertTrue with assertIn to improve failure messages in tests.

* Fix failure test.
2017-02-16 11:30:35 -08:00
Philipp Moritz
12a68e84d2 Implement a first pass at actors in the API. (#242)
* Implement actor field for tasks

* Implement actor management in local scheduler.

* initial python frontend for actors

* import actors on worker

* IPython code completion and tests

* prepare creating actors through local schedulers

* add actor id to PyTask

* submit actor calls to local scheduler

* starting to integrate

* simple fix

* Fixes from rebasing.

* more work on python actors

* Improve local scheduler actor handlers.

* Pass actor ID to local scheduler when connecting a client.

* first working version of actors

* fixing actors

* fix creating two copies of the same actor

* fix actors

* remove sleep

* get rid of export synchronization

* update

* insert actor methods into the queue in the right order

* remove print statements

* make it compile again after rebase

* Minor updates.

* fix python actor ids

* Pass actor_id to start_worker.

* add test

* Minor changes.

* Update actor tests.

* Temporary plan for import counter.

* Temporarily fix import counters.

* Fix some tests.

* Fixes.

* Make actor creation non-blocking.

* Fix test?

* Fix actors on Python 2.

* fix rare case.

* Fix python 2 test.

* More tests.

* Small fixes.

* Linting.

* Revert tensorflow version to 0.12.0 temporarily.

* Small fix.

* Enhance inheritance test.
2017-02-15 00:10:05 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Robert Nishihara
2d1c980ad7 Refactor local scheduler to remove worker indices. (#245)
* Refactor local scheduler to remove worker indices.

* Change scheduling state enum to int in all function signatures.

* Bug fix, don't use pointers into a resizable array.

* Remove total_num_workers.

* Fix tests.
2017-02-05 14:52:28 -08:00
Stephanie Wang
241b539ff8 Reconstruction for evicted objects (#181)
* First pass at reconstruction in the worker

Modify reconstruction stress testing to start Plasma service before rest of Ray cluster

TODO about reconstructing ray.puts

Fix ray.put error for double creates

Distinguish between empty entry and no entry in object table

Fix test case

Fix Python test

Fix tests

* Only call reconstruct on objects we have not yet received

* Address review comments

* Fix reconstruction for Python3

* remove unused code

* Address Robert's comments, stress tests are crashing

* Test and update the task's scheduling state to suppress duplicate
reconstruction requests.

* Split result table into two lookups, one for task ID and the other as a
test-and-set for the task state

* Fix object table tests

* Fix redis module result_table_lookup test case

* Multinode reconstruction tests

* Fix python3 test case

* rename

* Use new start_redis

* Remove unused code

* lint

* indent

* Address Robert's comments

* Use start_redis from ray.services in state table tests

* Remove unnecessary memset
2017-02-01 19:18:46 -08:00
Robert Nishihara
6703f7be6f Provide functionality for local scheduler to start new workers. (#230)
* Provide functionality for local scheduler to start new workers.

* Pass full command for starting new worker in to local scheduler.

* Separate out configuration state of local scheduler.
2017-01-27 01:28:48 -08:00
Stephanie Wang
3c6686db08 Photon optimizations (#219)
* Optimizations:
- Track mapping of missing object to dependent tasks to avoid iterating over task queue
- Perform all fetch requests for missing objects using the same timer

* Fix bug and add regression test

* Record task dependencies and active fetch requests in the same hash table

* fix typo

* Fix memory leak and add test cases for scheduling when dependencies are evicted

* Fix python3 test case

* Minor details.
2017-01-23 19:44:15 -08:00
Stephanie Wang
f1987cdc16 Split local scheduler task queue (#211)
* Split local scheduler task queue into waiting and dispatch queue

* Fix memory leak

* Add a new task scheduling status for when a task has been queued locally

* Fix global scheduler test case and add task status doc

* Documentation

* Address Philipp's comments

* Move tasks back to the waiting queue if their dependencies become unavailable

* Update existing task table entries instead of overwriting
2017-01-18 20:27:40 -08:00
Stephanie Wang
6828d694ae Test object notifications from Plasma store (#141)
* Object notification test for Photon, and turn on valgrind for Photon C tests

* Test object notification handler in the plasma manager

* Fix hanging test case
2016-12-29 23:10:38 -08:00
Robert Nishihara
acf1703afd Implement naive scheduling algorithm using local scheduler load. (#164)
* Implement naive scheduling algorithm using local scheduler load.

* Have the global scheduler estimate load on local schedulers better.

* Fixes.
2016-12-28 22:33:20 -08:00
Robert Nishihara
3d697c7ed2 Introduce local scheduler heartbeats which carry load information. (#155)
* Introduce local scheduler heartbeats which carry load information.
2016-12-24 20:02:25 -08:00
Stephanie Wang
6a73711888 Update the task table (#129)
* Update the task table

* Move updating task table out of scheduling algorithm.
2016-12-20 00:13:39 -08:00
Stephanie Wang
d729f9b7ea Object table remove (#139)
* Object table remove redis module

* Test case for object table remove redis module

* Client code for object_table_remove

* Delete object notifications in plasma

* Test for object deletion notifications

* Fix subscribe deletion test

* Address Robert's comments

* free hash table entry
2016-12-19 23:18:57 -08:00
Stephanie Wang
4bdb9f7224 Object reconstruction in Photon (#65)
* Object reconstruction in Photon and C test cases for Photon

* Fix hanging test case on mac

* Remove unnecessary event from photon tests

* make photon_disconnect not leak file descriptors

* fix some of the memory errors

* Fix valgrind

* lint

* Address Robert's comments and add test case for object reconstruction suppression

* Remove OWNER
2016-12-12 23:17:22 -08:00
Robert Nishihara
9474d03912 Switch to updated Plasma API and consolidate wait and fetch implementations. (#116)
* Consolidate wait implementations.

* Consolidate fetch implementations.

* Share callback between wait and fetch to address issue in which only one callback can be run for a given subscribe channel.

* Reactivate manager tests.

* Remove more code.

* Add some documentation.
2016-12-10 21:22:05 -08:00
Robert Nishihara
b3c05655a0 Enable fetching objects from remote object stores. (#87)
* Fetch missing dependencies from local scheduler.

* Factor out global scheduler policy state.

* Use object_table_subscribe instead of object_table_lookup.

* Fix bug in which timer was being created twice for a single fetch request.

* Free old manager vector.
2016-12-06 15:47:31 -08:00
Robert Nishihara
35b9dedb48 Remove scheduler_info. (#84) 2016-12-04 15:51:03 -08:00
Wapaul1
9a513363f9 Init_table_callback now takes ownership of passed in data (#80)
* temp commit

* Stuff

* Ownership is now taken by init table callback

* Fixed lint errors

* Fixed travis warnings

* Fixed spacing

* add .gitkeep

* fix global scheduler

* Whitespace.
2016-12-03 13:49:09 -08:00
Robert Nishihara
c8c3983195 Use sizeof(field) instead of sizeof(type) and other fixes. (#47)
* Use sizeof(field) instead of sizeof(type) and other fixes.

* Fix formatting.

* Bug fix.

* Zero-initialize structs. There are many more instances of these that I haven't changed yet.

* Bug fix.

* Revert from atexit to signaling to fix valgrind tests.

* Address Philipp's comments.
2016-11-19 12:19:49 -08:00
Robert Nishihara
d77b685a90 Global scheduler skeleton (#45)
* Initial scheduler commit

* global scheduler

* add global scheduler

* Implement global scheduler skeleton.

* Formatting.

* Allow local scheduler to be started without a connection to redis so that we can test it without a global scheduler.

* Fail if there are no local schedulers when the global scheduler receives a task.

* Initialize uninitialized value and formatting fix.

* Generalize local scheduler table to db client table.

* Remove code duplication in local scheduler and add flag for whether a task came from the global scheduler or not.

* Queue task specs in the local scheduler instead of tasks.

* Simple global scheduler tests, including valgrind.

* Factor out functions for starting processes.

* Fixes.
2016-11-18 19:57:51 -08:00
Stephanie Wang
7babe0d22f Logging level (#38)
* Set logging levels in Makefile using -DRAY_COMMON_LOG_LEVEL=level

* Lower level of some LOG_ERROR messages, log the name of the table operation on failure

* Address rest of Robert's comments

* Fix spurious log message
2016-11-15 20:33:29 -08:00
Stephanie Wang
9d1e750e8f Merge task table and task log into a single table (#30)
* Merge task table and task log

* Fix test in db tests

* Address Robert's comments and some better error checking

* Add a LOG_FATAL that exits the program
2016-11-10 18:13:26 -08:00
Robert Nishihara
194bdb1d96 Compute task IDs and object IDs deterministically. (#31)
* Put infrastructure in place to compute task IDs and object IDs.

* Fix version number for common library.

* Compute task IDs and object IDs deterministically.

* Address Stephanie's comments.

* Update task documentation.

* Fix formatting.

* Add more tests and checks.

* Fix formatting.

* Enable DCHECKs and change CHECKs to DCHECKs.
2016-11-08 14:46:34 -08:00
Philipp Moritz
90a2aa4bf7 Various performance improvements (#24)
* switch from array to linked list for photon queue

* performance optimizations

* fix tests

* various fixes
2016-11-04 00:41:20 -07:00
Robert Nishihara
072f442c1f Update worker.py and services.py to use plasma and the local scheduler. (#19)
* Update worker code and services code to use plasma and the local scheduler.

* Cleanups.

* Fix bug in which threads were started before the worker mode was set. This caused remote functions to be defined on workers before the worker knew it was in WORKER_MODE.

* Fix bug in install-dependencies.sh.

* Lengthen timeout in failure_test.py.

* Cleanups.

* Cleanup services.start_ray_local.

* Clean up random name generation.

* Cleanups.
2016-11-02 00:39:35 -07:00
Ion
ee3718c80c Ion and Philipp's table retries (#10)
* Ion and Philipp's table retries

* Refactor the retry struct:
- Rename it from retry_struct to retry_info
- Retry information contains the failure callback, not the retry callback
- All functions take in retry information as an arg instead of its expanded fields

* Rename cb -> callback

* Remove prints

* Fix compiler warnings

* Change some CHECKs to greatest ASSERTs

* Key outstanding callbacks hash table with timer ID instead of callback data pointer

* Use the new retry API for table commands

* Memory cleanup in plasma unit tests

* fix Robert's comments

* add valgrind for common
2016-10-29 15:22:33 -07:00
Philipp Moritz
b4b462809f fix valgrind 2016-10-27 15:09:50 -07:00
Robert Nishihara
6f75c738b5 [WIP] Fix valgrind tests. (#5)
* Make tests fail when valgrind finds a memory leak.

* Properly clean up scheduler state.

* Remove unnecessary malloc.
2016-10-26 23:23:46 -07:00
Robert Nishihara
ad55166472 Rearrange local scheduler files to prepare to merge into Ray. 2016-10-25 14:16:23 -07:00
Renamed from photon_algorithm.c (Browse further)