Commit graph

703 commits

Author SHA1 Message Date
Stephanie Wang
c403ab11ab Allow ray.init to take in address information about existing services. (#161)
* Refactor ray.init and ray.services to allow processes that are already running

* Fix indexing error

* Address Robert's comments
2016-12-28 14:17:29 -08:00
Robert Nishihara
baf835efcd Throw Python exception if plasma store cannot create new object. (#162)
* Propagate error messages through plasma create.

* Use custom exception types instead of exception messages.
2016-12-28 11:56:16 -08:00
Robert Nishihara
10e067e5e5 Delay releasing a maximum number of bytes in the plasma client. (#160)
* Send message from plasma client to get plasma store capacity.

* Release objects from plasma client if they are too large.

* Use doubly-linked list instead of ring buffer for plasma client release history.

* Address comments.

* Fix problem with slicing PlasmaBuffer objects.

* Fix crash in plasma manager during transfer.

* Formatting.

* Make plasma client cache larger and make caching test not throw exceptions on Travis.
2016-12-27 19:51:26 -08:00
Robert Nishihara
26941e02aa Attempt to free up to 20% of the plasma store capacity during eviction. (#159) 2016-12-27 12:12:33 -08:00
Robert Nishihara
985c424172 Use redismodules for task table and result table. (#156)
* Switch to using redis modules for task table.

* Switch to using redis modules for the task table.

* Fix some tests.

* Fix naming and remove code duplication.

* Remove duplication in redis modules and add more cleanups.

* Address comments.
2016-12-25 23:57:05 -08:00
Philipp Moritz
d6695c867a fix wait test (#158) 2016-12-25 23:43:01 -08:00
Philipp Moritz
8309e3f355 Redis string formatting (#157)
* redis string formatting

* fixes

* add documentation

* fixes
2016-12-25 22:43:07 -08:00
Robert Nishihara
3d697c7ed2 Introduce local scheduler heartbeats which carry load information. (#155)
* Introduce local scheduler heartbeats which carry load information.
2016-12-24 20:02:25 -08:00
Robert Nishihara
9bb9f8cb54 Fix bug in ray.wait. (#153)
* Fix bug in wait implementation.

* Add test that exposes previous bug.
2016-12-23 16:22:41 -08:00
Robert Nishihara
241c955707 Determine node IP address programatically. (#151)
* Determine node ip address programatically.

* Factor out methods for getting node IP addresses.

* Address comments.
2016-12-23 15:31:40 -08:00
Robert Nishihara
8d90c9f432 Experimental utils for copying directories to other machines in the c… (#150)
* Experimental utils for copying directories to other machines in the cluster using Ray.

* Test copying directory functionality.

* Small fix.
2016-12-23 00:43:16 -08:00
Robert Nishihara
86b211f5c2 Give run_function_on_all_workers to take a worker_info dictionary including a counter. (#149)
* Suppress Redis warnings and remove some global scheduler logging.

* Pass a counter into run_function_on_all_workers indicating how many workers have begun executing this function.
2016-12-22 22:05:58 -08:00
Robert Nishihara
92010ca5b5 Check that we can connect to Redis and that there aren't existing redis clients on the same node in start_ray.py (#148) 2016-12-22 21:54:19 -08:00
Alexey Tumanov
46a887039e Global scheduler - per-task transfer-aware policy (#145)
* global scheduler with object transfer cost awareness -- upstream rebase

* debugging global scheduler: multiple subscriptions

* global scheduler: utarray push bug fix; tasks change state to SCHEDULED

* change global scheduler test to be an integraton test

* unit and integration tests are passing for global scheduler

* improve global scheduler test: break up into several

* global scheduler checkpoint: fix photon object id bug in test

* test with timesync between object and task notifications; TODO: handle OoO object+task notifications in GS

* fallback to base policy if no object dependencies are cached (may happen due to OoO object+task notification arrivals

* clean up printfs; handle a missing LS in LS cache

* Minor changes to Python test and factor out some common code.

* refactoring handle task waiting

* addressing comments

* log_info -> log_debug

* Change object ID printing.

* PRId64 merge

* Python 3 fix.

* PRId64.

* Python 3 fix.

* resurrect differentiation between no args and missing object info; spacing

* Valgrind fix.

* Run all global scheduler tests in valgrind.

* clang format

* Comments and documentation changes.

* Minor cleanups.

* fix whitespace

* Fix.

* Documentation fix.
2016-12-22 03:11:46 -08:00
Robert Nishihara
6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara
c9c1b3e6af Change db_connect to allow different arguments from different processes. (#142)
* Allow db_connect to take a variable number of arguments.

* Fix tests.

* Fixes.

* Formatting.

* Fixes.

* Simplifications.

* Fix typo.
2016-12-20 20:21:35 -08:00
Philipp Moritz
0ca0864856 Use flatcc for serialization of IPC messages. (#140)
* added Phllipp's updates

* Switch to using flatbuffers for IPC.

* Various changes.

* convert remaining messages and cleanups

* fix

* fix function signatures

* fix valgrind errors

* clang-format

* final commit

* Fix valgrind test.
2016-12-20 14:46:25 -08:00
Stephanie Wang
6a73711888 Update the task table (#129)
* Update the task table

* Move updating task table out of scheduling algorithm.
2016-12-20 00:13:39 -08:00
Stephanie Wang
d729f9b7ea Object table remove (#139)
* Object table remove redis module

* Test case for object table remove redis module

* Client code for object_table_remove

* Delete object notifications in plasma

* Test for object deletion notifications

* Fix subscribe deletion test

* Address Robert's comments

* free hash table entry
2016-12-19 23:18:57 -08:00
Alexey Tumanov
cb3e6cde9e passing object info information with redis module (#138)
* adding object broadcast channel; published on each object table add

* publishing data size to the bcast channel

* bug fix: objectkey

* update object tests to test for data size: C + py

* remove debug

* clang format

* Minor changes.

* Fix error.

* merging with Robert's comments

* clang format for the object table test upgrade
2016-12-19 21:07:25 -08:00
Robert Nishihara
269f37e26f Implement object table notification subscriptions and switch to using Redis modules for object table. (#134)
* Implement RAY.OBJECT_TABLE_REQUEST_NOTIFICATIONS.

* Call object_table_request_notifications from plasma manager.

* Use Redis modules for object table.

* Cleaning up code.

* More checks.

* Formatting.

* Make object table tests pass.

* Formatting.

* Add prefix to the object notification channel name.

* Formatting.

* Fixes.

* Increase time in redismodule test.
2016-12-18 18:19:02 -08:00
Robert Nishihara
c89bf4e5bc Fix improper handling of NULL characters when opening Redis keys. (#136)
* Fix improper handling of NULL characters when opening Redis keys.

* Add test.
2016-12-18 13:06:28 -08:00
Robert Nishihara
a9e6a53360 Print helpful error message when libgcc error occurs. (#133) 2016-12-17 15:38:48 -08:00
Robert Nishihara
edf8d1ee9f Fix Python3 error in tests. (#135) 2016-12-17 12:42:37 -08:00
Stephanie Wang
e23661c375 Task table Redis module (#125)
* Task table redis module implementation

* Publish tasks and take in individual fields as args, not task object

* Scheduling state integer has width 1, error on illegal put

* Unit tests for task table and more documentation

* Task table subscribe, fix publish topics and address Philipp and Alexey's comments

* Helper function to create prefixed strings

* Factor out the table prefixes in the test cases
2016-12-16 14:40:44 -08:00
Robert Nishihara
58a873eb20 Deploy Redis module and start using custom Redis commands. (#128)
* Add RAY.CONNECT Redis command.

* Add RAY.GET_CLIENT_ADDRESS command.

* Build and clean Redis in common Makefile.

* Use custom Redis module in Ray and use custom CONNECT and GET_CLIENT_ADDRESS commands.

* Fixes.

* Remove mapping from redis client ID to ray db client ID.

* Fix.
2016-12-16 14:40:44 -08:00
Robert Nishihara
1c95840765 Revert cloudpickle -> pickling change, which no longer seems necessary. (#127) 2016-12-16 14:40:44 -08:00
Stephanie Wang
b0ba54e4c0 Fix psubscribe bug in object_table_subscribe (#126)
* Fix psubscribe

* Add TODO about subscription callbacks
2016-12-16 14:40:44 -08:00
Robert Nishihara
79dd1815a2 Python 3 compatibility. (#121)
* Make common module Python 3 compatible.

* Make plasma module Python 3 compatible.

* Make photon module Python 3 compatible.

* Make numbuf module Python 3 compatible.

* Remaining changes for Python 3 compatibility.

* Test Python 3 in Travis.

* Fixes.
2016-12-16 14:40:37 -08:00
Alexey Tumanov
946242929f Plasma photon association: passing through plasma address with photon db connection (#123)
* passing plasma ip:port association with photon through redis to global scheduler

* Fix test.

* sanity-checking aux_address inside db_connect_extended

* clang format

* fix photon tests

* clang format photon tests
2016-12-13 17:21:38 -08:00
Robert Nishihara
bce7e0fc07 Add include for usleep. (#124) 2016-12-13 14:24:59 -08:00
Philipp Moritz
2152cd9f31 Fix seed bug for generating object ids for put (#120)
* fix seed bug for generating object ids for put

* fix clang-format
2016-12-13 00:54:38 -08:00
Stephanie Wang
24d2b42d86 Fix object table subscriptions (#122)
* First attempt at fixing psubscribe. psubscribe_success_test will fail

* psubscribe test

* SUBSCRIBE returns the number of subscriptions, not success

* Comment out failing test.
2016-12-13 00:47:21 -08:00
Stephanie Wang
4bdb9f7224 Object reconstruction in Photon (#65)
* Object reconstruction in Photon and C test cases for Photon

* Fix hanging test case on mac

* Remove unnecessary event from photon tests

* make photon_disconnect not leak file descriptors

* fix some of the memory errors

* Fix valgrind

* lint

* Address Robert's comments and add test case for object reconstruction suppression

* Remove OWNER
2016-12-12 23:17:22 -08:00
Philipp Moritz
817f1e730c Implement tables with redis modules (#114)
* initial redis module

* temp commit

* temp commit

* temp commit

* Empty object table functions and broken object_table_lookup

* fix segfault and clean up code

* cleanup and tests

* try to ignore redismodule.h

* check if data_size is integer

* Minor changes to redis-module tests.

* try to exclude redismodule from clang-format

* try something different

* fix clang-format and tests

* sleep a bit

* Result table

* fix redis_module tests

* fix tests and add tests for result table

* more tests

* randomize ports

* Minor changes.

* More fixes.
2016-12-11 17:40:19 -08:00
Philipp Moritz
311e2be7dc clean common when cleaning photon (#118) 2016-12-11 17:30:52 -08:00
Robert Nishihara
ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Robert Nishihara
3d083c8b58 Build arrow without tests. (#85) 2016-12-10 21:25:52 -08:00
Robert Nishihara
0f7091099d Throw an exception if a Ray method is called from a thread that isn't the main thread. (#97) 2016-12-10 21:24:50 -08:00
Robert Nishihara
9474d03912 Switch to updated Plasma API and consolidate wait and fetch implementations. (#116)
* Consolidate wait implementations.

* Consolidate fetch implementations.

* Share callback between wait and fetch to address issue in which only one callback can be run for a given subscribe channel.

* Reactivate manager tests.

* Remove more code.

* Add some documentation.
2016-12-10 21:22:05 -08:00
Robert Nishihara
86973059de Switch to new wait implementation. (#113)
* Duplicate wait1 implementation and seperate out wait datastructures.

* Address Philipp's comments.

* Temporarily address test failure problem by increasing timeout and reducing load in tests.

* Update stress tests to include distributed wait.
2016-12-09 19:26:11 -08:00
Robert Nishihara
6441571d31 Introduce some stress tests. (#106)
* Retry first connection to redis in db_connect.

* Declare usleep.

* Formatting.

* Introduce some stress tests.
2016-12-09 17:49:31 -08:00
Robert Nishihara
c740b165f4 Retry first connection to redis in db_connect. (#112)
* Retry first connection to redis in db_connect.

* Declare usleep.

* Formatting.
2016-12-09 17:21:49 -08:00
Robert Nishihara
46d0e6bdfb In tests, check that processes are still alive before killing them to catch crashes. (#110) 2016-12-09 13:04:08 -08:00
Robert Nishihara
5819e90962 Remove all object files when doing make clean. (#109) 2016-12-09 11:00:00 -08:00
Alexey Tumanov
0abbf5a113 End-to-end object size information passthrough (#105)
* rebase Alexey's PR on top

* rebase on master

* fix test failure waiting for plasma manager to exit

* clang format

* addressing comments

* Minor formatting and naming fixes.
2016-12-09 00:51:44 -08:00
Stephanie Wang
61904c4c3e Object hashes (#104)
* factoring out object_info for general use by several Ray components

* addressing comments

* Replace SHA256 task hash with MD5

Add object hash to object table (always overwrites)

Support for table operations that span multiple asynchronous Redis
commands

Add a new object location in a transaction, using Redis's optimistic
concurrency

Use Redis GETSET instead of transactions and Python frontend code for object hashing

Remove spurious log message

Fix for object_table_add

Revert "Replace SHA256 task hash with MD5"

This reverts commit e599de473c8dad9189ccb0600429534b469b76a2.

Revert to sha256

Test case for illegal puts

Use SETNX to set object hashes

Initialize digest with zeros

Initialize plasma_request with zeros

* Fixes

* replace SHA256 with a faster hash in the object store

* Fix valgrind

* Address Robert's comments

* Check that plasma_compute_object_hash succeeds.

* Don't run test_illegal_put test with valgrind because it causes an intentional crash which causes valgrind to complain.

* Debugging after rebase.

* handling Robert's comments

* Fix bugs after rebase.

* final fixes for Stephanie's PR

* fix
2016-12-08 20:57:08 -08:00
Robert Nishihara
4a62a3c5d7 Cause plasma client tests failures to show up on travis. (#99)
* Cause plasma client tests failures to show up on travis.

* Don't run flaky test.
2016-12-08 19:32:24 -08:00
atumanov
1c946b2f6a Factoring out object_info structure for use in several Ray components (#101)
* change plasma object notifications to carry a struct of information

* factoring out object_info for general use by several Ray components

* fixing a bug in python test

* addressing comments

* handling Robert's comments

* clang format

* Fix valgrind.
2016-12-08 19:14:10 -08:00
atumanov
88206417cb unifying plasma seal path through the store to mgr to redis (#96) 2016-12-07 17:25:40 -08:00