Commit graph

146 commits

Author SHA1 Message Date
Stephanie Wang
da06b4db82 Warn the user when a nondeterministic task is detected. (#339)
* WARN instead of FATAL for object hash mismatches, push error to driver

* Document the callback signature for object_table_add/remove

* Error table

* Wait for all errors in python test

* Fix doc

* Fix state test
2017-03-07 00:32:15 -08:00
Philipp Moritz
0b8d279ef2 Convert task_spec to flatbuffers (#255)
* convert Ray to C++

* convert task_spec to flatbuffers

* fix

* it compiles

* latest

* tests are passing

* task2 -> task

* fix

* fix

* fix

* fix

* fix

* linting

* fix valgrind

* upgrade flatbuffers

* use debug mode for valgrind

* fix naming and comments

* downgrade flatbuffers

* fix linting

* reintroduce TaskSpec_free

* rename TaskSpec -> TaskInfo

* refactoring

* linting
2017-03-05 02:05:02 -08:00
Robert Nishihara
65a8659f3d Some plasma manager transfer optimizations. (#334)
* Change tranfer queue to doubly-linked list to speed up append.

* Maintain set of pending transfers to make deduplication easy.

* Fix naming convention for structs in plasma manager.
2017-03-04 23:15:17 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Alexey Tumanov
4f9e74469e Fix segfault induced by getting more than 200k objects (#333)
* [RAY-567]: allocate memory on the heap for large gets

* linting
2017-03-02 01:35:10 -08:00
Philipp Moritz
793a102846 Make Ray code C++ compatible (#321)
* convert Ray to C++
* const correctness
2017-03-01 01:17:24 -08:00
Alexey Tumanov
b91d9cba45 Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325)
* adding flatbuffers and migrating flatcc to flatbuffers for plasma

* variable name changes in plasma_protocol and plasma flatbuffers schema

* quick fix

* cleanups and remove flatcc

* more cleanup

* add doc

* linting

* fix linting

* fix mac os x build

* linting

* cleanup

* c++ fix for plasma flatbuffers

* Remove flatcc from CMakeLists.txt.

* linting; trigger travis
2017-02-28 18:47:40 -08:00
Philipp Moritz
a30eed452e Change type naming convention. (#315)
* Rename object_id -> ObjectID.

* Rename ray_logger -> RayLogger.

* rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID

* Rename plasma_store_info -> PlasmaStoreInfo.

* Rename plasma_store_state -> PlasmaStoreState.

* Rename plasma_object -> PlasmaObject.

* Rename object_request -> ObjectRequests.

* Rename eviction_state -> EvictionState.

* Bug fix.

* rename db_handle -> DBHandle

* Rename local_scheduler_state -> LocalSchedulerState.

* rename db_client_id -> DBClientID

* rename task -> Task

* make redis.c C++ compatible

* Rename scheduling_algorithm_state -> SchedulingAlgorithmState.

* Rename plasma_connection -> PlasmaConnection.

* Rename client_connection -> ClientConnection.

* Fixes from rebase.

* Rename local_scheduler_client -> LocalSchedulerClient.

* Rename object_buffer -> ObjectBuffer.

* Rename client -> Client.

* Rename notification_queue -> NotificationQueue.

* Rename object_get_requests -> ObjectGetRequests.

* Rename get_request -> GetRequest.

* Rename object_info -> ObjectInfo.

* Rename scheduler_object_info -> SchedulerObjectInfo.

* Rename local_scheduler -> LocalScheduler and some fixes.

* Rename local_scheduler_info -> LocalSchedulerInfo.

* Rename global_scheduler_state -> GlobalSchedulerState.

* Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState.

* Rename object_size_entry -> ObjectSizeEntry.

* Rename aux_address_entry -> AuxAddressEntry.

* Rename various ID helper methods.

* Rename Task helper methods.

* Rename db_client_cache_entry -> DBClientCacheEntry.

* Rename local_actor_info -> LocalActorInfo.

* Rename actor_info -> ActorInfo.

* Rename retry_info -> RetryInfo.

* Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData.

* Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData.

* Rename table_callback_data -> TableCallbackData.

* Rename object_info_subscribe_data -> ObjectInfoSubscribeData.

* Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData.

* Rename more redis call data structures.

* Rename photon_conn PhotonConnection.

* Rename photon_mock -> PhotonMock.

* Fix formatting errors.
2017-02-26 00:32:43 -08:00
Robert Nishihara
232601f90d Change all table calls to use default retry behavior. (#312)
* Change all table calls to use default retry behavior and change default retry behavior.

* Add warning for table retries.
2017-02-24 12:41:32 -08:00
Stephanie Wang
334aed9fa9 Fetch the object after requesting reconstruction during ray.get (#301)
* Fetch the object after requesting reconstruction during ray.get

* revert

* Fix documentation and memory leak

* Fix hanging reconstruction bug

* Fix for python3
2017-02-20 21:41:34 -08:00
Stephanie Wang
67c591c33b Retry connections in photon connect, consolidate code in io.c (#294) 2017-02-17 23:41:21 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Philipp Moritz
ca254b8689 Fix stack overflow if many objects are fetched. (#237)
* fix stack overflow if many objects are fetched

* fix other stack allocations

* add tests and fix linting

* address stephanie's comments

* fix linting

* fix tests
2017-02-04 16:49:36 -08:00
Stephanie Wang
241b539ff8 Reconstruction for evicted objects (#181)
* First pass at reconstruction in the worker

Modify reconstruction stress testing to start Plasma service before rest of Ray cluster

TODO about reconstructing ray.puts

Fix ray.put error for double creates

Distinguish between empty entry and no entry in object table

Fix test case

Fix Python test

Fix tests

* Only call reconstruct on objects we have not yet received

* Address review comments

* Fix reconstruction for Python3

* remove unused code

* Address Robert's comments, stress tests are crashing

* Test and update the task's scheduling state to suppress duplicate
reconstruction requests.

* Split result table into two lookups, one for task ID and the other as a
test-and-set for the task state

* Fix object table tests

* Fix redis module result_table_lookup test case

* Multinode reconstruction tests

* Fix python3 test case

* rename

* Use new start_redis

* Remove unused code

* lint

* indent

* Address Robert's comments

* Use start_redis from ray.services in state table tests

* Remove unnecessary memset
2017-02-01 19:18:46 -08:00
Robert Nishihara
f69d4aaaa7 Change fetch requests in plasma manager to use a single timer. (#234)
* Change fetch requests in plasma manager to use a single timer.

* Fix manager tests, other cleanups.
2017-02-01 12:21:52 -08:00
Stephanie Wang
a5c8f28f33 Plasma subscribe (#227)
* Use object_info as notification, not just the object_id

* Add a regression test for plasma managers connecting to store after some objects have been created

* Send notifications for existing objects to new plasma subscribers

* Continuously try the request to the plasma manager instead of setting a timeout in the test case

* Use ray.services to start Redis in plasma test cases

* fix test case
2017-01-25 22:57:15 -08:00
Robert Nishihara
b98a63fd3a Change get to take a timeout and multiple object IDs. (#212)
* Change plasma_get to take a timeout and an array of object IDs.

* Address comments.

* Bug fix related to computing object hashes.

* Add test.

* Fix file descriptor leak.

* Fix valgrind.

* Formatting.

* Remove call to plasma_contains from the plasma client. Use timeout internally in ray.get.

* small fixes
2017-01-19 12:21:12 -08:00
Robert Nishihara
677a019cbd Remove unnecessary bookkeepping in utlist in plasma client. (#215) 2017-01-18 23:03:08 -08:00
Robert Nishihara
303d0fed3e Prevent plasma store and manager from dying when a client dies. (#203)
* Prevent plasma store and manager from dying when a worker dies.

* Check errno inside of warn_if_sigpipe. Passing in errno doesn't work because the arguments to warn_if_sigpipe can be evaluated out of order.
2017-01-17 20:34:31 -08:00
Philipp Moritz
7f329db4b2 wait until kill operation was successful (#210) 2017-01-17 20:15:48 -08:00
Philipp Moritz
a708e36225 Switch build system to use CMake completely. (#200)
* switch to CMake completely

...

* cleanup

* Run C tests, update installation instructions.
2017-01-17 16:56:40 -08:00
Philipp Moritz
ab3448a9b4 Plasma Optimizations (#190)
* bypass python when storing objects into the object store

* clang-format

* Bug fixes.

* fix include paths

* Fixes.

* fix bug

* clang-format

* fix

* fix release after disconnect
2017-01-09 20:15:54 -08:00
Stephanie Wang
c13d73b4c9 Suppress duplicate transfer requests (#185) 2017-01-06 22:14:51 -08:00
Johann Schleier-Smith
b1e76e582e Check /dev/shm on Linux (#174)
* check available shared memory when starting object store

* exit with error if not enough shared memory available for object store

* Some comments and formatting.
2017-01-03 12:33:29 -08:00
Stephanie Wang
6828d694ae Test object notifications from Plasma store (#141)
* Object notification test for Photon, and turn on valgrind for Photon C tests

* Test object notification handler in the plasma manager

* Fix hanging test case
2016-12-29 23:10:38 -08:00
Robert Nishihara
baf835efcd Throw Python exception if plasma store cannot create new object. (#162)
* Propagate error messages through plasma create.

* Use custom exception types instead of exception messages.
2016-12-28 11:56:16 -08:00
Robert Nishihara
10e067e5e5 Delay releasing a maximum number of bytes in the plasma client. (#160)
* Send message from plasma client to get plasma store capacity.

* Release objects from plasma client if they are too large.

* Use doubly-linked list instead of ring buffer for plasma client release history.

* Address comments.

* Fix problem with slicing PlasmaBuffer objects.

* Fix crash in plasma manager during transfer.

* Formatting.

* Make plasma client cache larger and make caching test not throw exceptions on Travis.
2016-12-27 19:51:26 -08:00
Robert Nishihara
26941e02aa Attempt to free up to 20% of the plasma store capacity during eviction. (#159) 2016-12-27 12:12:33 -08:00
Philipp Moritz
d6695c867a fix wait test (#158) 2016-12-25 23:43:01 -08:00
Robert Nishihara
9bb9f8cb54 Fix bug in ray.wait. (#153)
* Fix bug in wait implementation.

* Add test that exposes previous bug.
2016-12-23 16:22:41 -08:00
Alexey Tumanov
46a887039e Global scheduler - per-task transfer-aware policy (#145)
* global scheduler with object transfer cost awareness -- upstream rebase

* debugging global scheduler: multiple subscriptions

* global scheduler: utarray push bug fix; tasks change state to SCHEDULED

* change global scheduler test to be an integraton test

* unit and integration tests are passing for global scheduler

* improve global scheduler test: break up into several

* global scheduler checkpoint: fix photon object id bug in test

* test with timesync between object and task notifications; TODO: handle OoO object+task notifications in GS

* fallback to base policy if no object dependencies are cached (may happen due to OoO object+task notification arrivals

* clean up printfs; handle a missing LS in LS cache

* Minor changes to Python test and factor out some common code.

* refactoring handle task waiting

* addressing comments

* log_info -> log_debug

* Change object ID printing.

* PRId64 merge

* Python 3 fix.

* PRId64.

* Python 3 fix.

* resurrect differentiation between no args and missing object info; spacing

* Valgrind fix.

* Run all global scheduler tests in valgrind.

* clang format

* Comments and documentation changes.

* Minor cleanups.

* fix whitespace

* Fix.

* Documentation fix.
2016-12-22 03:11:46 -08:00
Robert Nishihara
6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara
c9c1b3e6af Change db_connect to allow different arguments from different processes. (#142)
* Allow db_connect to take a variable number of arguments.

* Fix tests.

* Fixes.

* Formatting.

* Fixes.

* Simplifications.

* Fix typo.
2016-12-20 20:21:35 -08:00
Philipp Moritz
0ca0864856 Use flatcc for serialization of IPC messages. (#140)
* added Phllipp's updates

* Switch to using flatbuffers for IPC.

* Various changes.

* convert remaining messages and cleanups

* fix

* fix function signatures

* fix valgrind errors

* clang-format

* final commit

* Fix valgrind test.
2016-12-20 14:46:25 -08:00
Stephanie Wang
d729f9b7ea Object table remove (#139)
* Object table remove redis module

* Test case for object table remove redis module

* Client code for object_table_remove

* Delete object notifications in plasma

* Test for object deletion notifications

* Fix subscribe deletion test

* Address Robert's comments

* free hash table entry
2016-12-19 23:18:57 -08:00
Alexey Tumanov
cb3e6cde9e passing object info information with redis module (#138)
* adding object broadcast channel; published on each object table add

* publishing data size to the bcast channel

* bug fix: objectkey

* update object tests to test for data size: C + py

* remove debug

* clang format

* Minor changes.

* Fix error.

* merging with Robert's comments

* clang format for the object table test upgrade
2016-12-19 21:07:25 -08:00
Robert Nishihara
269f37e26f Implement object table notification subscriptions and switch to using Redis modules for object table. (#134)
* Implement RAY.OBJECT_TABLE_REQUEST_NOTIFICATIONS.

* Call object_table_request_notifications from plasma manager.

* Use Redis modules for object table.

* Cleaning up code.

* More checks.

* Formatting.

* Make object table tests pass.

* Formatting.

* Add prefix to the object notification channel name.

* Formatting.

* Fixes.

* Increase time in redismodule test.
2016-12-18 18:19:02 -08:00
Robert Nishihara
58a873eb20 Deploy Redis module and start using custom Redis commands. (#128)
* Add RAY.CONNECT Redis command.

* Add RAY.GET_CLIENT_ADDRESS command.

* Build and clean Redis in common Makefile.

* Use custom Redis module in Ray and use custom CONNECT and GET_CLIENT_ADDRESS commands.

* Fixes.

* Remove mapping from redis client ID to ray db client ID.

* Fix.
2016-12-16 14:40:44 -08:00
Robert Nishihara
79dd1815a2 Python 3 compatibility. (#121)
* Make common module Python 3 compatible.

* Make plasma module Python 3 compatible.

* Make photon module Python 3 compatible.

* Make numbuf module Python 3 compatible.

* Remaining changes for Python 3 compatibility.

* Test Python 3 in Travis.

* Fixes.
2016-12-16 14:40:37 -08:00
Stephanie Wang
4bdb9f7224 Object reconstruction in Photon (#65)
* Object reconstruction in Photon and C test cases for Photon

* Fix hanging test case on mac

* Remove unnecessary event from photon tests

* make photon_disconnect not leak file descriptors

* fix some of the memory errors

* Fix valgrind

* lint

* Address Robert's comments and add test case for object reconstruction suppression

* Remove OWNER
2016-12-12 23:17:22 -08:00
Robert Nishihara
ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Robert Nishihara
9474d03912 Switch to updated Plasma API and consolidate wait and fetch implementations. (#116)
* Consolidate wait implementations.

* Consolidate fetch implementations.

* Share callback between wait and fetch to address issue in which only one callback can be run for a given subscribe channel.

* Reactivate manager tests.

* Remove more code.

* Add some documentation.
2016-12-10 21:22:05 -08:00
Robert Nishihara
86973059de Switch to new wait implementation. (#113)
* Duplicate wait1 implementation and seperate out wait datastructures.

* Address Philipp's comments.

* Temporarily address test failure problem by increasing timeout and reducing load in tests.

* Update stress tests to include distributed wait.
2016-12-09 19:26:11 -08:00
Robert Nishihara
46d0e6bdfb In tests, check that processes are still alive before killing them to catch crashes. (#110) 2016-12-09 13:04:08 -08:00
Robert Nishihara
5819e90962 Remove all object files when doing make clean. (#109) 2016-12-09 11:00:00 -08:00
Alexey Tumanov
0abbf5a113 End-to-end object size information passthrough (#105)
* rebase Alexey's PR on top

* rebase on master

* fix test failure waiting for plasma manager to exit

* clang format

* addressing comments

* Minor formatting and naming fixes.
2016-12-09 00:51:44 -08:00
Stephanie Wang
61904c4c3e Object hashes (#104)
* factoring out object_info for general use by several Ray components

* addressing comments

* Replace SHA256 task hash with MD5

Add object hash to object table (always overwrites)

Support for table operations that span multiple asynchronous Redis
commands

Add a new object location in a transaction, using Redis's optimistic
concurrency

Use Redis GETSET instead of transactions and Python frontend code for object hashing

Remove spurious log message

Fix for object_table_add

Revert "Replace SHA256 task hash with MD5"

This reverts commit e599de473c8dad9189ccb0600429534b469b76a2.

Revert to sha256

Test case for illegal puts

Use SETNX to set object hashes

Initialize digest with zeros

Initialize plasma_request with zeros

* Fixes

* replace SHA256 with a faster hash in the object store

* Fix valgrind

* Address Robert's comments

* Check that plasma_compute_object_hash succeeds.

* Don't run test_illegal_put test with valgrind because it causes an intentional crash which causes valgrind to complain.

* Debugging after rebase.

* handling Robert's comments

* Fix bugs after rebase.

* final fixes for Stephanie's PR

* fix
2016-12-08 20:57:08 -08:00
Robert Nishihara
4a62a3c5d7 Cause plasma client tests failures to show up on travis. (#99)
* Cause plasma client tests failures to show up on travis.

* Don't run flaky test.
2016-12-08 19:32:24 -08:00
atumanov
1c946b2f6a Factoring out object_info structure for use in several Ray components (#101)
* change plasma object notifications to carry a struct of information

* factoring out object_info for general use by several Ray components

* fixing a bug in python test

* addressing comments

* handling Robert's comments

* clang format

* Fix valgrind.
2016-12-08 19:14:10 -08:00
atumanov
88206417cb unifying plasma seal path through the store to mgr to redis (#96) 2016-12-07 17:25:40 -08:00