Commit graph

1354 commits

Author SHA1 Message Date
Robert Nishihara
6703f7be6f Provide functionality for local scheduler to start new workers. (#230)
* Provide functionality for local scheduler to start new workers.

* Pass full command for starting new worker in to local scheduler.

* Separate out configuration state of local scheduler.
2017-01-27 01:28:48 -08:00
Stephanie Wang
a5c8f28f33 Plasma subscribe (#227)
* Use object_info as notification, not just the object_id

* Add a regression test for plasma managers connecting to store after some objects have been created

* Send notifications for existing objects to new plasma subscribers

* Continuously try the request to the plasma manager instead of setting a timeout in the test case

* Use ray.services to start Redis in plasma test cases

* fix test case
2017-01-25 22:57:15 -08:00
Robert Nishihara
ab8c3432f7 Add driver ID to task spec and add driver ID to Python error handling. (#225)
* Add driver ID to task spec and add driver ID to Python error handling.

* Make constants global variables.

* Add test for error isolation.
2017-01-25 22:53:48 -08:00
Stephanie Wang
3c6686db08 Photon optimizations (#219)
* Optimizations:
- Track mapping of missing object to dependent tasks to avoid iterating over task queue
- Perform all fetch requests for missing objects using the same timer

* Fix bug and add regression test

* Record task dependencies and active fetch requests in the same hash table

* fix typo

* Fix memory leak and add test cases for scheduling when dependencies are evicted

* Fix python3 test case

* Minor details.
2017-01-23 19:44:15 -08:00
Robert Nishihara
b98a63fd3a Change get to take a timeout and multiple object IDs. (#212)
* Change plasma_get to take a timeout and an array of object IDs.

* Address comments.

* Bug fix related to computing object hashes.

* Add test.

* Fix file descriptor leak.

* Fix valgrind.

* Formatting.

* Remove call to plasma_contains from the plasma client. Use timeout internally in ray.get.

* small fixes
2017-01-19 12:21:12 -08:00
Robert Nishihara
677a019cbd Remove unnecessary bookkeepping in utlist in plasma client. (#215) 2017-01-18 23:03:08 -08:00
Stephanie Wang
f1987cdc16 Split local scheduler task queue (#211)
* Split local scheduler task queue into waiting and dispatch queue

* Fix memory leak

* Add a new task scheduling status for when a task has been queued locally

* Fix global scheduler test case and add task status doc

* Documentation

* Address Philipp's comments

* Move tasks back to the waiting queue if their dependencies become unavailable

* Update existing task table entries instead of overwriting
2017-01-18 20:27:40 -08:00
Robert Nishihara
303d0fed3e Prevent plasma store and manager from dying when a client dies. (#203)
* Prevent plasma store and manager from dying when a worker dies.

* Check errno inside of warn_if_sigpipe. Passing in errno doesn't work because the arguments to warn_if_sigpipe can be evaluated out of order.
2017-01-17 20:34:31 -08:00
Philipp Moritz
7f329db4b2 wait until kill operation was successful (#210) 2017-01-17 20:15:48 -08:00
Philipp Moritz
a708e36225 Switch build system to use CMake completely. (#200)
* switch to CMake completely

...

* cleanup

* Run C tests, update installation instructions.
2017-01-17 16:56:40 -08:00
Philipp Moritz
ab3448a9b4 Plasma Optimizations (#190)
* bypass python when storing objects into the object store

* clang-format

* Bug fixes.

* fix include paths

* Fixes.

* fix bug

* clang-format

* fix

* fix release after disconnect
2017-01-09 20:15:54 -08:00
Robert Nishihara
973716d310 Use cloudpickle 0.2.2. (#189) 2017-01-08 17:30:06 -08:00
Alexey Tumanov
674ec3a3cb generate pytask from string and string from pytask (#188)
* pytask creation from bytestring: saving work

* pytask now works

* documentation and tests

* linting

* Lint and fix test case
2017-01-08 02:16:40 -08:00
Stephanie Wang
c13d73b4c9 Suppress duplicate transfer requests (#185) 2017-01-06 22:14:51 -08:00
Robert Nishihara
651aa6007a Log profiling information from worker. (#178)
* Log timing events on workers.

* Have workers log to the event log through the local scheduler.

* Fixes and address comments.

* bug fix

* styling
2017-01-05 16:47:16 -08:00
Johann Schleier-Smith
b1e76e582e Check /dev/shm on Linux (#174)
* check available shared memory when starting object store

* exit with error if not enough shared memory available for object store

* Some comments and formatting.
2017-01-03 12:33:29 -08:00
Stephanie Wang
6828d694ae Test object notifications from Plasma store (#141)
* Object notification test for Photon, and turn on valgrind for Photon C tests

* Test object notification handler in the plasma manager

* Fix hanging test case
2016-12-29 23:10:38 -08:00
Robert Nishihara
acf1703afd Implement naive scheduling algorithm using local scheduler load. (#164)
* Implement naive scheduling algorithm using local scheduler load.

* Have the global scheduler estimate load on local schedulers better.

* Fixes.
2016-12-28 22:33:20 -08:00
Robert Nishihara
baf835efcd Throw Python exception if plasma store cannot create new object. (#162)
* Propagate error messages through plasma create.

* Use custom exception types instead of exception messages.
2016-12-28 11:56:16 -08:00
Robert Nishihara
10e067e5e5 Delay releasing a maximum number of bytes in the plasma client. (#160)
* Send message from plasma client to get plasma store capacity.

* Release objects from plasma client if they are too large.

* Use doubly-linked list instead of ring buffer for plasma client release history.

* Address comments.

* Fix problem with slicing PlasmaBuffer objects.

* Fix crash in plasma manager during transfer.

* Formatting.

* Make plasma client cache larger and make caching test not throw exceptions on Travis.
2016-12-27 19:51:26 -08:00
Robert Nishihara
26941e02aa Attempt to free up to 20% of the plasma store capacity during eviction. (#159) 2016-12-27 12:12:33 -08:00
Robert Nishihara
985c424172 Use redismodules for task table and result table. (#156)
* Switch to using redis modules for task table.

* Switch to using redis modules for the task table.

* Fix some tests.

* Fix naming and remove code duplication.

* Remove duplication in redis modules and add more cleanups.

* Address comments.
2016-12-25 23:57:05 -08:00
Philipp Moritz
d6695c867a fix wait test (#158) 2016-12-25 23:43:01 -08:00
Philipp Moritz
8309e3f355 Redis string formatting (#157)
* redis string formatting

* fixes

* add documentation

* fixes
2016-12-25 22:43:07 -08:00
Robert Nishihara
3d697c7ed2 Introduce local scheduler heartbeats which carry load information. (#155)
* Introduce local scheduler heartbeats which carry load information.
2016-12-24 20:02:25 -08:00
Robert Nishihara
9bb9f8cb54 Fix bug in ray.wait. (#153)
* Fix bug in wait implementation.

* Add test that exposes previous bug.
2016-12-23 16:22:41 -08:00
Robert Nishihara
86b211f5c2 Give run_function_on_all_workers to take a worker_info dictionary including a counter. (#149)
* Suppress Redis warnings and remove some global scheduler logging.

* Pass a counter into run_function_on_all_workers indicating how many workers have begun executing this function.
2016-12-22 22:05:58 -08:00
Alexey Tumanov
46a887039e Global scheduler - per-task transfer-aware policy (#145)
* global scheduler with object transfer cost awareness -- upstream rebase

* debugging global scheduler: multiple subscriptions

* global scheduler: utarray push bug fix; tasks change state to SCHEDULED

* change global scheduler test to be an integraton test

* unit and integration tests are passing for global scheduler

* improve global scheduler test: break up into several

* global scheduler checkpoint: fix photon object id bug in test

* test with timesync between object and task notifications; TODO: handle OoO object+task notifications in GS

* fallback to base policy if no object dependencies are cached (may happen due to OoO object+task notification arrivals

* clean up printfs; handle a missing LS in LS cache

* Minor changes to Python test and factor out some common code.

* refactoring handle task waiting

* addressing comments

* log_info -> log_debug

* Change object ID printing.

* PRId64 merge

* Python 3 fix.

* PRId64.

* Python 3 fix.

* resurrect differentiation between no args and missing object info; spacing

* Valgrind fix.

* Run all global scheduler tests in valgrind.

* clang format

* Comments and documentation changes.

* Minor cleanups.

* fix whitespace

* Fix.

* Documentation fix.
2016-12-22 03:11:46 -08:00
Robert Nishihara
6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara
c9c1b3e6af Change db_connect to allow different arguments from different processes. (#142)
* Allow db_connect to take a variable number of arguments.

* Fix tests.

* Fixes.

* Formatting.

* Fixes.

* Simplifications.

* Fix typo.
2016-12-20 20:21:35 -08:00
Philipp Moritz
0ca0864856 Use flatcc for serialization of IPC messages. (#140)
* added Phllipp's updates

* Switch to using flatbuffers for IPC.

* Various changes.

* convert remaining messages and cleanups

* fix

* fix function signatures

* fix valgrind errors

* clang-format

* final commit

* Fix valgrind test.
2016-12-20 14:46:25 -08:00
Stephanie Wang
6a73711888 Update the task table (#129)
* Update the task table

* Move updating task table out of scheduling algorithm.
2016-12-20 00:13:39 -08:00
Stephanie Wang
d729f9b7ea Object table remove (#139)
* Object table remove redis module

* Test case for object table remove redis module

* Client code for object_table_remove

* Delete object notifications in plasma

* Test for object deletion notifications

* Fix subscribe deletion test

* Address Robert's comments

* free hash table entry
2016-12-19 23:18:57 -08:00
Alexey Tumanov
cb3e6cde9e passing object info information with redis module (#138)
* adding object broadcast channel; published on each object table add

* publishing data size to the bcast channel

* bug fix: objectkey

* update object tests to test for data size: C + py

* remove debug

* clang format

* Minor changes.

* Fix error.

* merging with Robert's comments

* clang format for the object table test upgrade
2016-12-19 21:07:25 -08:00
Robert Nishihara
269f37e26f Implement object table notification subscriptions and switch to using Redis modules for object table. (#134)
* Implement RAY.OBJECT_TABLE_REQUEST_NOTIFICATIONS.

* Call object_table_request_notifications from plasma manager.

* Use Redis modules for object table.

* Cleaning up code.

* More checks.

* Formatting.

* Make object table tests pass.

* Formatting.

* Add prefix to the object notification channel name.

* Formatting.

* Fixes.

* Increase time in redismodule test.
2016-12-18 18:19:02 -08:00
Robert Nishihara
c89bf4e5bc Fix improper handling of NULL characters when opening Redis keys. (#136)
* Fix improper handling of NULL characters when opening Redis keys.

* Add test.
2016-12-18 13:06:28 -08:00
Robert Nishihara
edf8d1ee9f Fix Python3 error in tests. (#135) 2016-12-17 12:42:37 -08:00
Stephanie Wang
e23661c375 Task table Redis module (#125)
* Task table redis module implementation

* Publish tasks and take in individual fields as args, not task object

* Scheduling state integer has width 1, error on illegal put

* Unit tests for task table and more documentation

* Task table subscribe, fix publish topics and address Philipp and Alexey's comments

* Helper function to create prefixed strings

* Factor out the table prefixes in the test cases
2016-12-16 14:40:44 -08:00
Robert Nishihara
58a873eb20 Deploy Redis module and start using custom Redis commands. (#128)
* Add RAY.CONNECT Redis command.

* Add RAY.GET_CLIENT_ADDRESS command.

* Build and clean Redis in common Makefile.

* Use custom Redis module in Ray and use custom CONNECT and GET_CLIENT_ADDRESS commands.

* Fixes.

* Remove mapping from redis client ID to ray db client ID.

* Fix.
2016-12-16 14:40:44 -08:00
Stephanie Wang
b0ba54e4c0 Fix psubscribe bug in object_table_subscribe (#126)
* Fix psubscribe

* Add TODO about subscription callbacks
2016-12-16 14:40:44 -08:00
Robert Nishihara
79dd1815a2 Python 3 compatibility. (#121)
* Make common module Python 3 compatible.

* Make plasma module Python 3 compatible.

* Make photon module Python 3 compatible.

* Make numbuf module Python 3 compatible.

* Remaining changes for Python 3 compatibility.

* Test Python 3 in Travis.

* Fixes.
2016-12-16 14:40:37 -08:00
Alexey Tumanov
946242929f Plasma photon association: passing through plasma address with photon db connection (#123)
* passing plasma ip:port association with photon through redis to global scheduler

* Fix test.

* sanity-checking aux_address inside db_connect_extended

* clang format

* fix photon tests

* clang format photon tests
2016-12-13 17:21:38 -08:00
Robert Nishihara
bce7e0fc07 Add include for usleep. (#124) 2016-12-13 14:24:59 -08:00
Philipp Moritz
2152cd9f31 Fix seed bug for generating object ids for put (#120)
* fix seed bug for generating object ids for put

* fix clang-format
2016-12-13 00:54:38 -08:00
Stephanie Wang
24d2b42d86 Fix object table subscriptions (#122)
* First attempt at fixing psubscribe. psubscribe_success_test will fail

* psubscribe test

* SUBSCRIBE returns the number of subscriptions, not success

* Comment out failing test.
2016-12-13 00:47:21 -08:00
Stephanie Wang
4bdb9f7224 Object reconstruction in Photon (#65)
* Object reconstruction in Photon and C test cases for Photon

* Fix hanging test case on mac

* Remove unnecessary event from photon tests

* make photon_disconnect not leak file descriptors

* fix some of the memory errors

* Fix valgrind

* lint

* Address Robert's comments and add test case for object reconstruction suppression

* Remove OWNER
2016-12-12 23:17:22 -08:00
Philipp Moritz
817f1e730c Implement tables with redis modules (#114)
* initial redis module

* temp commit

* temp commit

* temp commit

* Empty object table functions and broken object_table_lookup

* fix segfault and clean up code

* cleanup and tests

* try to ignore redismodule.h

* check if data_size is integer

* Minor changes to redis-module tests.

* try to exclude redismodule from clang-format

* try something different

* fix clang-format and tests

* sleep a bit

* Result table

* fix redis_module tests

* fix tests and add tests for result table

* more tests

* randomize ports

* Minor changes.

* More fixes.
2016-12-11 17:40:19 -08:00
Philipp Moritz
311e2be7dc clean common when cleaning photon (#118) 2016-12-11 17:30:52 -08:00
Robert Nishihara
ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Robert Nishihara
9474d03912 Switch to updated Plasma API and consolidate wait and fetch implementations. (#116)
* Consolidate wait implementations.

* Consolidate fetch implementations.

* Share callback between wait and fetch to address issue in which only one callback can be run for a given subscribe channel.

* Reactivate manager tests.

* Remove more code.

* Add some documentation.
2016-12-10 21:22:05 -08:00