Commit graph

474 commits

Author SHA1 Message Date
Robert Nishihara
c688a64235 Expose GPU IDs to remote functions. (#496)
* Change local scheduler bookkeeping to use GPU IDs.

* Update actor test.

* Add tests for actors and tasks simultaneously using GPUs.

* Add additional task GPU ID test.

* Fix linting.

* Make redis GPU assignment ignore GPU IDs.

* Small fix.
2017-05-07 13:03:49 -07:00
Philipp Moritz
1dddd5336a Fix actor bug arising from overwriting task specifications in the local scheduler (#513)
* copy task specifications put into the actor task cache so it won't get overwritten when the scheduler receives the next task

* cleanup

* cleanup and fix

* linting

* fix jenkins test

* fix linting
2017-05-06 17:39:35 -07:00
Stephanie Wang
e50a23b820 Fix bug with reused file descriptors (#471)
* Fix bug with reused file descriptors

* Remove client connection if write_object_chunk fails

* Handle ECONNRESET on unsuccessful write

* lint

* Back to lowercase

* fix compilation

* fix linting
2017-05-02 19:45:27 -07:00
Robert Nishihara
2bbfc5da8d Dispatch actor tasks when actor connects. (#495) 2017-04-28 17:36:43 -07:00
Robert Nishihara
6d301d9079 Simplify resource bookkeeping in local scheduler. (#494)
* Simplify resource bookkeeping in local scheduler.

* Change ints to doubles.
2017-04-28 12:09:47 -07:00
Robert Nishihara
eea19371b7 Suppress warning about working dying when driver exits. (#492) 2017-04-26 23:52:13 -07:00
Robert Nishihara
1627f89945 Fix problem in which actors and workers running tasks are not killed by driver exit. (#490)
* Augment test to verify that relevant workers and actors are killed during driver cleanup.

* Fix bug in which we were only killing one worker when a driver exited.

* Fix remove driver test.

* Fix and augment test.
2017-04-26 15:13:39 -07:00
Philipp Moritz
b7ace01b5f Convert Plasma client to STL (#486)
* convert mmap table to STL

* update

* fix

* convert objects_in_use

* fix

* convert release_history

* cleanup

* linting

* update

* fix

* linting
2017-04-25 01:25:40 -07:00
Robert Nishihara
0ac125e9b2 Clean up when a driver disconnects. (#462)
* Clean up state when drivers exit.

* Remove unnecessary field in ActorMapEntry struct.

* Have monitor release GPU resources in Redis when driver exits.

* Enable multiple drivers in multi-node tests and test driver cleanup.

* Make redis GPU allocation a redis transaction and small cleanups.

* Fix multi-node test.

* Small cleanups.

* Make global scheduler take node_ip_address so it appears in the right place in the client table.

* Cleanups.

* Fix linting and cleanups in local scheduler.

* Fix removed_driver_test.

* Fix bug related to vector -> list.

* Fix linting.

* Cleanup.

* Fix multi node tests.

* Fix jenkins tests.

* Add another multi node test with many drivers.

* Fix linting.

* Make the actor creation notification a flatbuffer message.

* Revert "Make the actor creation notification a flatbuffer message."

This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.

* Add comment explaining flatbuffer problems.
2017-04-24 18:10:21 -07:00
Philipp Moritz
8194b71f32 Convert pending_notifications to STL (#484)
* temp commit

* converted more plasma notifications

* cleanup

* rename

* linting

* fixes

* fixes
2017-04-24 14:41:34 -07:00
Philipp Moritz
892e53d69e Convert plasma client array and object notification queue to STL (#482)
* Conver plasma clients to STL

* use a deque for object notifications in plasma store for perf

* cleanup

* linting

* fix include order
2017-04-24 00:43:48 -07:00
Philipp Moritz
e36de2dad1 Convert object table to STL (#480)
* convert object table to stl

* temp commit

* fix

* comments

* linting
2017-04-23 22:24:05 -07:00
Alexey Tumanov
a67a107e0e Fix int-type compilation problem on redhat. (#472) 2017-04-19 02:43:33 -07:00
Richard Shin
cf68cf743c Change UniqueID hash function to look at the lowest instead of highest bytes. (#469) 2017-04-18 15:31:49 -07:00
Philipp Moritz
8ac6c59931 Remove n^2 algorithm in plasma get (#466)
Remove n^2 algorithm in plasma get.
2017-04-17 23:37:33 -07:00
Guru Medasani
0189b09581 Fixes Mac OSX installation error (#464)
* changes to address ARROW-826 and ARROW-444

* changes to address ARROW-826 and ARROW-444

* ignoring cmake-build-debug

* additional IDEA ignore files

* additional IDEA ignore files

* remove arrow ipc and arrow io libraries

* add boost dependencies

* fix arrow origin and remove submodule
2017-04-16 15:02:15 -07:00
Robert Nishihara
dad57e3b62 Convert actor data structures to C++. (#454) 2017-04-12 01:18:16 -07:00
Robert Nishihara
fb4525f833 Convert some local scheduler data structures to C++ STL. (#445)
* Convert more local scheduler data structures to C++ STL.

* Convert vector pointer to vector.

* Convert some of the UT_arrays to std::vector.

* Simplify worker vectors.

* Simplify remote_object and local_object containers.

* Change some unnecessary checks to DCHECK.
2017-04-10 21:02:36 -07:00
Philipp Moritz
6ffc849d23 Use Arrow Tensors for serializing numpy arrays and get rid of extra memcpy. (#436)
* Use Arrow Tensors for serializing numpy arrays and get rid of extra memcpy

* fix nondeterminism problem

* mark array as immutable

* make arrays contiguous

* fix serialize_list and deseralize_list

* fix numbuf tests

* linting

* add optimization flags

* fixes

* roll back arrow
2017-04-10 01:37:34 -07:00
Robert Nishihara
c9d66555e2 Fix bug in queue_task function in local scheduler. (#443) 2017-04-09 19:34:43 -07:00
Robert Nishihara
05fd4c2c37 Changes to local scheduler client protocol. (#435)
* Make local scheduler clients receive reply upon registration.

* Fix tests and linting.
2017-04-07 23:03:37 -07:00
Alexey Tumanov
6f9225490b Plasma manager performance: speed up wait with a wait request object map (#427)
* plasma manager perf: speedup wait with a wait request object map

* removing duplicate == operator in plasma store

* fix serialization test

* code cleanup

* minor cleanup

* factoring out uniqueid hash and equality operators into common

* plasma manager: c++ify the WaitRequest struct

* plasma manager: get rid of the initial object request malloc

* cleanup

* linting

* cleanups and fix compiler warnings

* compiler warnings and linting
2017-04-07 12:32:12 -07:00
Robert Nishihara
fa363a5a3a Notify driver when a worker dies while executing a task. (#419)
* Notify driver when a worker dies while executing a task.

* Fix linting.

* Don't push error when local scheduler is cleaning up.
2017-04-06 00:02:39 -07:00
Stephanie Wang
93679df724 Stopped nodes can rejoin immediately (#428)
* Ignore deleted clients when reading address info from Redis

* Remove self from db_client table when exiting cleanly

* Fix valgrind test

* Do not call plasma_perform_release when disconnecting
2017-04-05 23:50:38 -07:00
Philipp Moritz
4043769ba2 Make putting large objects work. (#411)
* putting large objects

* add more checks

* support large objects

* fix test

* fix linting

* upgrade to latest arrow version

* check malloc return code

* print mmap file sizes

* printing

* revert to dlmalloc

* add prints

* more prints

* add printing

* printing

* fix

* update

* fix

* update

* print

* initialization

* temp

* fix

* update

* fix linting

* comment out object_store_full tests

* fix test

* fix test

* evict objects if dlmalloc fails

* fix stresstests

* Fix linting.

* Uncomment large-memory tests.

* Increase memory for docker image for jenkins tests.

* Reduce large memory tests.

* Further reduce large memory tests.
2017-04-05 01:04:05 -07:00
Robert Nishihara
1e84747e13 Remove incorrect check. (#421) 2017-04-03 14:51:53 -07:00
Richard Shin
227c916c25 Convert plasma/plasma_store.cc to use STL (#324)
* Change plasma_store.c to C++ (clobbering existing FlatBuffers usage).

* Convert plasma_store.cc to use STL (with a caveat)

* Fix CMakeLists and mutation-while-iterating problem

* Remove extra extern "C" declarations

* Remove redundant -std=c++11 from plasma/CMakeLists.txt
2017-03-31 22:58:10 -07:00
Robert Nishihara
f1b48f2fd4 Avoid publishing in the task table unnecessarily. (#416) 2017-03-30 13:41:32 -07:00
Stephanie Wang
036b873bf2 Implement local scheduler task queues using C++ data structures (#392)
* Switch to using C++ lists for task queues

* Init and free methods for TaskQueueEntry

* Switch from utarray to c++ vector for TaskQueueEntry

* Get rid of some pointers

* Back to O(1) deletion from waiting_task_queue

* Fix comments

* Cut code

* Non const iterators

* Fix Alexey's comments
2017-03-30 00:40:01 -07:00
Alexey Tumanov
78e1167a42 Parallelize make in build.sh. (#371)
* parallelize build.sh make

* Encode in cmake the dependency of ray_redis_module on autogenerated flatbuffer files.
2017-03-27 20:55:50 -07:00
Alexey Tumanov
a3d58607bf parallelize numbuf memcpy and plasma object hash construction (#366)
* parallelizing memcopy and object hash construction in numbuf/plasma

* clang format

* whitespace

* refactoring compute object hash: get rid of the prefix chunk

* clang format

* Document performance optimization.

* Remove check for 64-byte alignment, since it may not be guaranteed.
2017-03-21 16:17:35 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang
083e7a28ad Push an error to the driver when the workload hangs on ray.put reconstruction (#382)
* Fix worker blocked bug

* tmp

* Push an error to the driver on ray.put for non-driver tasks

* Fix result table tests

* Fix test, logging

* Address comments

* Fix suppression bug

* Fix redis module test

* Edit error message

* Get values in chunks during reconstruction

* Test case for driver ray.put errors

* Error for evicting ray.put objects from the driver

* Fix tests

* Reduce verbosity

* Documentation
2017-03-21 00:16:48 -07:00
Philipp Moritz
4618fd45b1 Port Ray to latest Arrow version (#370)
* rebase on top of latest arrow

* clang-format

* address comments

* fix
2017-03-20 16:31:46 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Philipp Moritz
068429ffd8 Convert local scheduler messages to flatbuffers (#340)
* use flatbuffer messages for local scheduler

* make sure constructor gets called for C++ object ObjectInfoT

* fix typo

* fix Robert's comments

* Small change to actor test.

* fix valgrind error

* linting

* free notification

* fix

* valgrind

* fix valgrind

* fix other bugs

* valgrind fix

* fixes

* more fixes

* Small changes to comments.
2017-03-15 16:27:52 -07:00
Robert Nishihara
53dffe0bf2 Use flatbuffers for some messages from Redis. (#341)
* Compile the Ray redis module with C++.

* Redo parsing of object table notifications with flatbuffers.

* Update redis module python tests.

* Redo parsing of task table notifications with flatbuffers.

* Fix linting.

* Redo parsing of db client notifications with flatbuffers.

* Redo publishing of local scheduler heartbeats with flatbuffers.

* Fix linting.

* Remove usage of fixed-width formatting of scheduling state in channel name.

* Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion.

* Fix linting and tests.

* fix

* cleanup

* simplify logic in ReplyWithTask
2017-03-10 18:35:25 -08:00
Philipp Moritz
0de57be085 upgrade flatbuffers to 1.6.0 (#345) 2017-03-07 21:33:46 -08:00
Stephanie Wang
da06b4db82 Warn the user when a nondeterministic task is detected. (#339)
* WARN instead of FATAL for object hash mismatches, push error to driver

* Document the callback signature for object_table_add/remove

* Error table

* Wait for all errors in python test

* Fix doc

* Fix state test
2017-03-07 00:32:15 -08:00
Philipp Moritz
0b8d279ef2 Convert task_spec to flatbuffers (#255)
* convert Ray to C++

* convert task_spec to flatbuffers

* fix

* it compiles

* latest

* tests are passing

* task2 -> task

* fix

* fix

* fix

* fix

* fix

* linting

* fix valgrind

* upgrade flatbuffers

* use debug mode for valgrind

* fix naming and comments

* downgrade flatbuffers

* fix linting

* reintroduce TaskSpec_free

* rename TaskSpec -> TaskInfo

* refactoring

* linting
2017-03-05 02:05:02 -08:00
Robert Nishihara
65a8659f3d Some plasma manager transfer optimizations. (#334)
* Change tranfer queue to doubly-linked list to speed up append.

* Maintain set of pending transfers to make deduplication easy.

* Fix naming convention for structs in plasma manager.
2017-03-04 23:15:17 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Alexey Tumanov
4f9e74469e Fix segfault induced by getting more than 200k objects (#333)
* [RAY-567]: allocate memory on the heap for large gets

* linting
2017-03-02 01:35:10 -08:00
Robert Nishihara
6a4bde54dc Only install ray python packages. (#330)
* Only install ray python packages.

* Add some __init__.py files.

* Install Ray before building documentation.

* Fix install-ray.sh.

* Fix.
2017-03-01 23:34:44 -08:00
Philipp Moritz
793a102846 Make Ray code C++ compatible (#321)
* convert Ray to C++
* const correctness
2017-03-01 01:17:24 -08:00
Alexey Tumanov
b91d9cba45 Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325)
* adding flatbuffers and migrating flatcc to flatbuffers for plasma

* variable name changes in plasma_protocol and plasma flatbuffers schema

* quick fix

* cleanups and remove flatcc

* more cleanup

* add doc

* linting

* fix linting

* fix mac os x build

* linting

* cleanup

* c++ fix for plasma flatbuffers

* Remove flatcc from CMakeLists.txt.

* linting; trigger travis
2017-02-28 18:47:40 -08:00
Robert Nishihara
1a997ed279 Move documentation to ReadTheDocs. (#326) 2017-02-27 21:14:31 -08:00
Robert Nishihara
1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00
Philipp Moritz
a30eed452e Change type naming convention. (#315)
* Rename object_id -> ObjectID.

* Rename ray_logger -> RayLogger.

* rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID

* Rename plasma_store_info -> PlasmaStoreInfo.

* Rename plasma_store_state -> PlasmaStoreState.

* Rename plasma_object -> PlasmaObject.

* Rename object_request -> ObjectRequests.

* Rename eviction_state -> EvictionState.

* Bug fix.

* rename db_handle -> DBHandle

* Rename local_scheduler_state -> LocalSchedulerState.

* rename db_client_id -> DBClientID

* rename task -> Task

* make redis.c C++ compatible

* Rename scheduling_algorithm_state -> SchedulingAlgorithmState.

* Rename plasma_connection -> PlasmaConnection.

* Rename client_connection -> ClientConnection.

* Fixes from rebase.

* Rename local_scheduler_client -> LocalSchedulerClient.

* Rename object_buffer -> ObjectBuffer.

* Rename client -> Client.

* Rename notification_queue -> NotificationQueue.

* Rename object_get_requests -> ObjectGetRequests.

* Rename get_request -> GetRequest.

* Rename object_info -> ObjectInfo.

* Rename scheduler_object_info -> SchedulerObjectInfo.

* Rename local_scheduler -> LocalScheduler and some fixes.

* Rename local_scheduler_info -> LocalSchedulerInfo.

* Rename global_scheduler_state -> GlobalSchedulerState.

* Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState.

* Rename object_size_entry -> ObjectSizeEntry.

* Rename aux_address_entry -> AuxAddressEntry.

* Rename various ID helper methods.

* Rename Task helper methods.

* Rename db_client_cache_entry -> DBClientCacheEntry.

* Rename local_actor_info -> LocalActorInfo.

* Rename actor_info -> ActorInfo.

* Rename retry_info -> RetryInfo.

* Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData.

* Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData.

* Rename table_callback_data -> TableCallbackData.

* Rename object_info_subscribe_data -> ObjectInfoSubscribeData.

* Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData.

* Rename more redis call data structures.

* Rename photon_conn PhotonConnection.

* Rename photon_mock -> PhotonMock.

* Fix formatting errors.
2017-02-26 00:32:43 -08:00
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00