Commit graph

76 commits

Author SHA1 Message Date
Stephanie Wang
ee08c8274b Shard Redis. (#539)
* Implement sharding in the Ray core

* Single node Python modifications to do sharding

* Do the sharding in redis.cc

* Pipe num_redis_shards through start_ray.py and worker.py.

* Use multiple redis shards in multinode tests.

* first steps for sharding ray.global_state

* Fix problem in multinode docker test.

* fix runtest.py

* fix some tests

* fix redis shard startup

* fix redis sharding

* fix

* fix bug introduced by the map-iterator being consumed

* fix sharding bug

* shard event table

* update number of Redis clients to be 64K

* Fix object table tests by flushing shards in between unit tests

* Fix local scheduler tests

* Documentation

* Register shard locations in the primary shard

* Add plasma unit tests back to build

* lint

* lint and fix build

* Fix

* Address Robert's comments

* Refactor start_ray_processes to start Redis shard

* lint

* Fix global scheduler python tests

* Fix redis module test

* Fix plasma test

* Fix component failure test

* Fix local scheduler test

* Fix runtest.py

* Fix global scheduler test for python3

* Fix task_table_test_and_update bug, from actor task table submission race

* Fix jenkins tests.

* Retry Redis shard connections

* Fix test cases

* Convert database clients to DBClient struct

* Fix race condition when subscribing to db client table

* Remove unused lines, add APITest for sharded Ray

* Fix

* Fix memory leak

* Suppress ReconstructionTests output

* Suppress output for APITestSharded

* Reissue task table add/update commands if initial command does not publish to any subscribers.

* fix

* Fix linting.

* fix tests

* fix linting

* fix python test

* fix linting
2017-05-18 17:40:41 -07:00
Robert Nishihara
9018dffd7f Fix bug in actor task dispatch. (#552)
* Fix bug in actor task dispatch.

* Return early from dispatch_actor_task if creation notification has not arrived. Also fix comment.
2017-05-15 23:47:15 -07:00
Robert Nishihara
c688a64235 Expose GPU IDs to remote functions. (#496)
* Change local scheduler bookkeeping to use GPU IDs.

* Update actor test.

* Add tests for actors and tasks simultaneously using GPUs.

* Add additional task GPU ID test.

* Fix linting.

* Make redis GPU assignment ignore GPU IDs.

* Small fix.
2017-05-07 13:03:49 -07:00
Philipp Moritz
1dddd5336a Fix actor bug arising from overwriting task specifications in the local scheduler (#513)
* copy task specifications put into the actor task cache so it won't get overwritten when the scheduler receives the next task

* cleanup

* cleanup and fix

* linting

* fix jenkins test

* fix linting
2017-05-06 17:39:35 -07:00
Robert Nishihara
2bbfc5da8d Dispatch actor tasks when actor connects. (#495) 2017-04-28 17:36:43 -07:00
Robert Nishihara
6d301d9079 Simplify resource bookkeeping in local scheduler. (#494)
* Simplify resource bookkeeping in local scheduler.

* Change ints to doubles.
2017-04-28 12:09:47 -07:00
Robert Nishihara
eea19371b7 Suppress warning about working dying when driver exits. (#492) 2017-04-26 23:52:13 -07:00
Robert Nishihara
1627f89945 Fix problem in which actors and workers running tasks are not killed by driver exit. (#490)
* Augment test to verify that relevant workers and actors are killed during driver cleanup.

* Fix bug in which we were only killing one worker when a driver exited.

* Fix remove driver test.

* Fix and augment test.
2017-04-26 15:13:39 -07:00
Robert Nishihara
0ac125e9b2 Clean up when a driver disconnects. (#462)
* Clean up state when drivers exit.

* Remove unnecessary field in ActorMapEntry struct.

* Have monitor release GPU resources in Redis when driver exits.

* Enable multiple drivers in multi-node tests and test driver cleanup.

* Make redis GPU allocation a redis transaction and small cleanups.

* Fix multi-node test.

* Small cleanups.

* Make global scheduler take node_ip_address so it appears in the right place in the client table.

* Cleanups.

* Fix linting and cleanups in local scheduler.

* Fix removed_driver_test.

* Fix bug related to vector -> list.

* Fix linting.

* Cleanup.

* Fix multi node tests.

* Fix jenkins tests.

* Add another multi node test with many drivers.

* Fix linting.

* Make the actor creation notification a flatbuffer message.

* Revert "Make the actor creation notification a flatbuffer message."

This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.

* Add comment explaining flatbuffer problems.
2017-04-24 18:10:21 -07:00
Robert Nishihara
dad57e3b62 Convert actor data structures to C++. (#454) 2017-04-12 01:18:16 -07:00
Robert Nishihara
fb4525f833 Convert some local scheduler data structures to C++ STL. (#445)
* Convert more local scheduler data structures to C++ STL.

* Convert vector pointer to vector.

* Convert some of the UT_arrays to std::vector.

* Simplify worker vectors.

* Simplify remote_object and local_object containers.

* Change some unnecessary checks to DCHECK.
2017-04-10 21:02:36 -07:00
Robert Nishihara
c9d66555e2 Fix bug in queue_task function in local scheduler. (#443) 2017-04-09 19:34:43 -07:00
Robert Nishihara
05fd4c2c37 Changes to local scheduler client protocol. (#435)
* Make local scheduler clients receive reply upon registration.

* Fix tests and linting.
2017-04-07 23:03:37 -07:00
Robert Nishihara
fa363a5a3a Notify driver when a worker dies while executing a task. (#419)
* Notify driver when a worker dies while executing a task.

* Fix linting.

* Don't push error when local scheduler is cleaning up.
2017-04-06 00:02:39 -07:00
Stephanie Wang
93679df724 Stopped nodes can rejoin immediately (#428)
* Ignore deleted clients when reading address info from Redis

* Remove self from db_client table when exiting cleanly

* Fix valgrind test

* Do not call plasma_perform_release when disconnecting
2017-04-05 23:50:38 -07:00
Robert Nishihara
1e84747e13 Remove incorrect check. (#421) 2017-04-03 14:51:53 -07:00
Stephanie Wang
036b873bf2 Implement local scheduler task queues using C++ data structures (#392)
* Switch to using C++ lists for task queues

* Init and free methods for TaskQueueEntry

* Switch from utarray to c++ vector for TaskQueueEntry

* Get rid of some pointers

* Back to O(1) deletion from waiting_task_queue

* Fix comments

* Cut code

* Non const iterators

* Fix Alexey's comments
2017-03-30 00:40:01 -07:00
Stephanie Wang
083e7a28ad Push an error to the driver when the workload hangs on ray.put reconstruction (#382)
* Fix worker blocked bug

* tmp

* Push an error to the driver on ray.put for non-driver tasks

* Fix result table tests

* Fix test, logging

* Address comments

* Fix suppression bug

* Fix redis module test

* Edit error message

* Get values in chunks during reconstruction

* Test case for driver ray.put errors

* Error for evicting ray.put objects from the driver

* Fix tests

* Reduce verbosity

* Documentation
2017-03-21 00:16:48 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Philipp Moritz
068429ffd8 Convert local scheduler messages to flatbuffers (#340)
* use flatbuffer messages for local scheduler

* make sure constructor gets called for C++ object ObjectInfoT

* fix typo

* fix Robert's comments

* Small change to actor test.

* fix valgrind error

* linting

* free notification

* fix

* valgrind

* fix valgrind

* fix other bugs

* valgrind fix

* fixes

* more fixes

* Small changes to comments.
2017-03-15 16:27:52 -07:00
Stephanie Wang
da06b4db82 Warn the user when a nondeterministic task is detected. (#339)
* WARN instead of FATAL for object hash mismatches, push error to driver

* Document the callback signature for object_table_add/remove

* Error table

* Wait for all errors in python test

* Fix doc

* Fix state test
2017-03-07 00:32:15 -08:00
Philipp Moritz
0b8d279ef2 Convert task_spec to flatbuffers (#255)
* convert Ray to C++

* convert task_spec to flatbuffers

* fix

* it compiles

* latest

* tests are passing

* task2 -> task

* fix

* fix

* fix

* fix

* fix

* linting

* fix valgrind

* upgrade flatbuffers

* use debug mode for valgrind

* fix naming and comments

* downgrade flatbuffers

* fix linting

* reintroduce TaskSpec_free

* rename TaskSpec -> TaskInfo

* refactoring

* linting
2017-03-05 02:05:02 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Robert Nishihara
6a4bde54dc Only install ray python packages. (#330)
* Only install ray python packages.

* Add some __init__.py files.

* Install Ray before building documentation.

* Fix install-ray.sh.

* Fix.
2017-03-01 23:34:44 -08:00
Philipp Moritz
793a102846 Make Ray code C++ compatible (#321)
* convert Ray to C++
* const correctness
2017-03-01 01:17:24 -08:00
Robert Nishihara
1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00