Commit graph

1638 commits

Author SHA1 Message Date
Robert Nishihara
27a0d58e54 Include resource string in error message for infeasible actors. (#1768) 2018-04-02 00:31:30 -07:00
Philipp Moritz
71829a2af9 [XRay] Pass in node IP address to Raylet (#1808) 2018-04-02 00:21:19 -07:00
Philipp Moritz
0bda11e009 [XRay] Fix linting (#1809) 2018-04-01 23:11:06 -07:00
Melih Elibol
6e06a9e338 XRay Task Forwarding Milestone (#1785)
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.

Raylet Changes:

Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:

LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:

Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:

Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).


Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
2018-03-31 18:02:58 -07:00
Stephanie Wang
925e392b2d Add an Append call to the GCS Log that checks for current length (#1788)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log

* AppendAt Redis and gcs Log command

* unit test for AppendAt

* Add a Log for task reconstruction data

* Add check for unique entries in TABLE_APPEND

* Documentation
2018-03-27 13:04:43 -07:00
Stephanie Wang
51fdbe3867 Convert the ObjectTable implementation to a Log (#1779)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log
2018-03-26 20:36:48 -07:00
Stephanie Wang
0fd4112354 Introduce a log interface for the new GCS (#1771)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint
2018-03-26 16:00:43 -07:00
Stephanie Wang
0ad1054b8b
Add a GCS table for the xray task flatbuffer (#1775)
* Introduce Task flatbuffer into xray, add to GCS

* Compile and test raylet TaskTable
2018-03-23 13:18:23 -07:00
Stephanie Wang
8704c8618c
Request and cancel notifications in the new GCS API (#1758)
* Add TableRequestNotifications and TableCancelNotifications to Redis modules

* Add RequestNotifications and CancelNotifications to generic GCS Table

* Add tests for subscribing to specific keys

* Remove TODO!

* Return the current value at the key directly from RequestNotifications instead of through publish

* Add unit test for Lookup failure callback

* Modify tests to account for empty subscription response

* Remove ObjectTable notification methods

* Clean up message parsing and doc in redis context

* Use vectors of DataT in all GCS callbacks

* Clean up SubscriptionCallback

* Move Table definitions into tables.cc

* Refactor and document redis modules

* doc

* Fix new GCS build

* Cleanups

* Revert "Fix new GCS build"

This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.

* Use vectors for internal callback interface, user-facing interface takes a reference to a single item

* Fix new GCS build

* Add unit test for Lookup failure callback

* Fix compiler errors

* Cleanup

* Publish the entry ID with the notification

* Check that the ID for a notification matches in client tests
2018-03-22 10:31:07 -07:00
Robert Nishihara
0c835a379f Fix resource bookkeeping for blocked actor methods. (#1766) 2018-03-21 20:48:04 -07:00
Stephanie Wang
5c7ef34b05
Define string prefixes for all tables in the new GCS API (#1755)
* Define string prefixes for all tables in the new GCS API

* Extra check for TablePrefix enum

* Remove unused field and add doc for existing fields
2018-03-20 20:27:11 -07:00
Robert Nishihara
4bccabd910 Redirect output of all processes by default. (#1752)
* Redirect output of all processes by default.

* Add separate flag for redirecting worker output.

* Fix tests.
2018-03-20 18:14:54 -07:00
Robert Nishihara
794f547d0a Always send actor creation tasks to the global scheduler. (#1757) 2018-03-20 14:55:20 -07:00
Robert Nishihara
4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara
96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00
Melih Elibol
3c080f4baa
Add a callback for gcs table lookup failures. (#1702)
* Add callback to gcs client for table lookup failures.

* update plasma_manager reflecting changes to gcs callback.
2018-03-15 22:25:01 -07:00
Stephanie Wang
6114b6d20e
Implement the client table for the new GCS (#1674)
* Add subscription callback to CallbackData

* Implement ClientTable

* Hook up ClientTable to AsyncGCSClient

* Add client_info to GCSClient Connect interface

* client table callbacks

* Unit test for client table

* Doc

* Fix idempotency check

* Fix mac build

* Fix memory issues in gcs client test

* Fix disconnection bug

* lint
2018-03-11 19:17:18 -07:00
Robert Nishihara
cae108d019 Replace special single quote with regular single quote. (#1693) 2018-03-10 20:36:01 -08:00
Philipp Moritz
5ef0892236 Compile boost from source to fix macOS wheels (#1688) 2018-03-08 23:22:23 -08:00
Alexey Tumanov
91464a56dd [XRay] Raylet node and object manager unification/backend redesign. (#1640)
* directory for raylet

* some initial class scaffolding -- in progress

* node_manager build code and test stub files.

* class scaffolding for resources, workers, and the worker pool

* Node manager server loop

* raylet policy and queue - wip checkpoint

* fix dependencies

* add gen_nm_fbs as target.

* object manager build, stub, and test code.

* Start integrating WorkerPool into node manager

* fix build on mac

* tmp

* adding LsResources boilerplate

* add/build Task spec boilerplate

* checkpoint ActorInformation and LsQueue

* Worker pool maintains started and removed workers

* todos for e2e task assignment

* fix build on mac

* build/add lsqueue interface

* channel resource config through from NodeServer to LsResources; prep LsResources to replace/provide worker_pool

* progress on LsResources class: resource availability check implementation

* Read task submission messages from a client

* Submit tasks from the client to the local scheduler

* Assign a task to a worker from the WorkerPool

* change the way node_manager is built to prevent build issues for object_manager.

* add namespaces. fix build.

* Move ClientConnection message handling into server, remove reference to
WorkerPool

* Add raw constructors for TaskSpecification

* Define TaskArgument by reference and by value

* Flatbuffer serialization for TaskSpec

* expand resource implementation

* Start integrating TaskExecutionSpecification into Task

* Separate WorkerPool from LsResources, give ownership to NodeServer

* checkpoint queue and resource code

* resoving merge conflicts

* lspolicy::schedule ; adding lsqueue and lspolicy to the nodeserver

* Implement LsQueue RemoveTasks and QueueReadyTasks

* Fill in some LsQueue code for assigning a task

* added suport for test_asio

* Implement LsQueue queue tasks methods, queue running tasks

* calling into policy from nodeserver; adding cluster resource map

* Feedback and Testing.
Incorporate Alexey's feedback. Actually test some code. Clean up callback imp.

* end to end task assignment

* Decouple local scheduler from node server

* move TODO

* Move local scheduler to separate file

* Add scaffolding for reconstruction policy, task dependency manager, and object manager

* fix

* asio for store client notifications.
added asio for plasma store connection.
added tests for store notifications.
encapsulate store interaction under store_messenger.

* Move Worker inside of ClientConnection

* Set the assigned task ID in the worker

* Several changes toward object manager implementation.
Store client integration with asio.
Complete OM/OD scaffolding.

* simple simulator to estimate number of retry timeouts

* changing dbclientid --> clientid

* fix build (include sandbox after it's fixed).

* changes to object manager, adding lambdas to the interface

* changing void * callbacks to std::function typed callbacks

* remove use namespace std from .h files.
use ray:: for Status everywhere.

* minor

* lineage cache interfaces

* TODO for object IDs

* Interface for the GCS client table

* Revert "Set the assigned task ID in the worker"

This reverts commit a770dd31048a289ef431c56d64e491fa7f9b2737.

* Revert "Move Worker inside of ClientConnection"

This reverts commit dfaa0d662a76976c05be6d76b214b45d88482818.

* OD/OM: ray::Status

* mock gcs integration.

* gcs mock clientinfo assignment

* Allow lookup of a Worker in the WorkerPool

* Split out Worker and ClientConnection source files

* Allow assignment of a task ID to a worker, skeleton for finishing a task

* integrate mock gcs with om tests.

* added tcp connection acceptor

* integrated OM with NM.
integrated GcsClient with NM.
Added multi-node integration tests.

* OM to receive incoming tcp connections.

* implemented object manager connection protocol.

* Added todos.

* slight adjustment to add/remove handler invocation on object store client.

* Simplify Task interface for getting dependencies

* Remove unused object manager file

* TaskDependencyManager tracks missing task dependencies and processes object add notifications

* Local scheduler queues tasks according to argument availability

* Fill in TaskSpecification methods to get arguments

* Implemented push.

* Queue tasks that have been scheduled but that are waiting for a worker

* Pull + mock gcs cleanup.

* OD/OM/GCS mock code review, fixing unused-result issues, eliminating copy ctor

* Remove unique_ptr from object_store_client

* Fix object manager Push memory error

* Pull task arguments in task dependency manager

* Add a demo script for remote task dependencies

* Some comments for the TaskDependencyManager

* code cleanup; builds on mac

* Make ClientConnection a templated type based on the connection protocol

* Add gmock to build

* Add WorkerPool unit tests

* clean up.

* clean up connection code.

* instantiate a template instance in the module

* Virtual destructors

* Document public api.

* Separate read and write buffers in ClientConnection; documentation

* Remove ObjectDirectory from NodeServer constructor, make directory InitGcs call a separate constructor

* Convert NodeServer Terminate to a destructor

* NodeServer documentation

* WorkerPool documentation

* TaskDependencyManager doc

* unifying naming conventions

* unifying naming conventions

* Task cleanup and documentation

* unifying naming conventions

* unifying naming conventions

* code cleanup and naming conventions

* code cleanup

* Rename om --> object_manager

* Merge with master

* SchedulingQueue doc

* Docs and implementation skeleton for ClientTable

* Node manager documentation

* ReconstructionPolicy doc

* Replace std::bind with lambda in TaskDependencyManager

* lineage cache doc

* Use \param style for doc

* documentation for scheduling policy and resources

* minor code cleanup

* SchedulingResources class documentation + code cleanup

* referencing ray/raylet directory; doxygen documentation

* updating trivial policy

* Fix bug where event loop stops after task submission

* Define entry point for ClientManager for handling new connections

* Node manager to node manager protocol, heartbeat protocol

* Fix flatbuffer

* Fix GCS flatbuffer naming conflict

* client connection moved to common dir.

* rename based on feedback.

* Added google style and 90 char lines clang-format file under src/ray.

* const ref ClientID.

* Incorporated feedback from PR.

* raylet: includes and namespaces

* raylet/om/gcs logging/using

* doxygen style

* camel casing, comments, other style; DBClientID -> ClientID

* object_manager : naming, defines, style

* consistent caps and naming; misc style

* cleaning up client connection + other stylistic fixes

* cmath, std::nan

* more style polish: OM, Raylet, gcs tables

* removing sandbox (moved to ray-project/sandbox)

* raylet linting

* object manager linting

* gcs linting

* all other linting


Co-authored-by: Melih <elibol@gmail.com>
Co-authored-by: Stephanie <swang@cs.berkeley.edu>
2018-03-08 12:53:24 -08:00
Stephanie Wang
0a6edb55a8 Implement the Subscribe call for the new GCS API (#1652)
* Implement the Subscribe call for the new GCS API

* Document tests

* Upper case function name

* Fix build errors

* lint
2018-03-06 09:56:12 -08:00
Philipp Moritz
a683cf2c70 Gcs Asio integration (#1633) 2018-03-04 14:51:04 -08:00
Zhenyu Guo
f1e5789c26 restructure how to organize 3rd party libs (#1630)
* restructure how to organize 3rd party libs

* Minor whitespace changes.

* Fix compilation on Linux.

* Pass around Python executable so that the correct version of Python is used.
2018-03-01 14:29:56 -08:00
Robert Nishihara
ec9dfe7748 Allow setting INCLUDE_UI=0 to disable building the UI. (#1618) 2018-03-01 02:17:15 -08:00
Robert Nishihara
0fcceef772 Update logging and check macros. (#1627)
* Update logging and check macros.

* Fix linting.

* Fix RAY_DCHECK and unused variable.

* Fix linting
2018-02-28 15:13:00 -08:00
Robert Nishihara
ba1ce85f58 Download Redis and flatbuffers differently. (#1602)
* Download Redis differently.

* Get flatbuffers with curl
2018-02-25 20:32:33 -08:00
Robert Nishihara
f4b1881fec Update arrow to use updated pandas serializer. (#1582) 2018-02-22 11:10:52 -08:00
Robert Nishihara
db4a920bdb Cleanup parquet installation. (#1549)
* Cleanup parquet installation.

* Fix

* Small changes.

* Add brew installs

* Modify paths for compilation of parquet.

* Remove LD_LIBRARY_PATH

* Don't set unnecessary environment variables on Linux.

* Set environment variables for make.

* Brew installs for macos wheels.

* Update

* Pass PARQUET_HOME when building pyarrow.

* Don't exit with error code.
2018-02-20 15:21:32 -08:00
Philipp Moritz
eabc4027c8 Hiredis asio integration (#1547) 2018-02-20 13:37:09 -08:00
Simon Mo
a24cc28773 [DataFrame] Add Parquet Support in Build Process (#1531)
* Add shell script for building parquet

* Use parquet ci script; remove anaconda

* Remove gcc flag, use default

* add boost_root

* Fix $TP_DIR reference issue

* fix the PR

* check out specific parquet-cpp commit
2018-02-16 07:18:42 -08:00
Alexey Tumanov
844a6afcdd Implement simple random spillback policy. (#1493)
* spillback policy implementation: global + local scheduler

* modernize global scheduler policy state; factor out random number engine and generator

* Minimal version.

* Fix test.

* Make load balancing test less strenuous.
2018-02-13 00:09:35 -08:00
Philipp Moritz
1ab2e63dbd Tune transfer buffer size (#1363)
Increase buffsize from `4096` to `80*1024`.
2018-02-09 14:56:36 -08:00
Robert Nishihara
89db7841d2 Update arrow version. (#1512) 2018-02-07 23:05:16 -08:00
Stephanie Wang
ff8e7f8259
Actor checkpointing for distributed actor handles (#1498)
* Expose calls to get and set the actor frontier

* Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded

* Prototype for actor checkpointing

* Filter out duplicate tasks on the local scheduler

* Clean up some of the Python checkpointing code

* More cleanups

* Documentation

* cleanup and fix unit test

* Allow remote checkpoint calls through actor handle

* Check whether object is local before reconstructing

* Enable checkpointing for distributed actor handles, refactor tests

* Fix local scheduler tests

* lint

* Address comments

* lint

* Skip tests that fail on new GCS

* style

* Don't put same object twice when setting the actor frontier

* Address Philipp's comments, cleaner fbs naming
2018-02-07 11:19:32 -08:00
Melih Elibol
d8850eac4b Suppress object transfer requests when object is already being received. (#1430)
* added deterministic check for objects received in fetch_timeout_handler.

* use receive time, in case something goes wrong after object is received.

* increase timeout for removal.

* indentation fix.

* make log info log debug. clean up debug log.

* undo unecessary changes.

* changed description var.

* shorten line 949.

* incorporate feedback.

* linting; make is_object_received function consts.

* change semantics of received_objects to objects being received.
added checks to both points at which objects are re-requested.
updated object receive initialization accordingly.

* eliminate erase on receive init. check call to request_transfer_from instead of request_transfer.

* updated comments.

* added todo for multiple object transfers.

* linting.
2018-02-01 22:45:31 -08:00
Philipp Moritz
a3f8fa426b Start integrating new GCS APIs (#1379)
* Start integrating new GCS calls

* fixes

* tests

* cleanup

* cleanup and valgrind fix

* update tests

* fix valgrind

* fix more valgrind

* fixes

* add separate tests for GCS

* fix linting

* update tests

* cleanup

* fix python linting

* more fixes

* fix linting

* add plasma manager callback

* add some documentation

* fix linting

* fix linting

* fixes

* update

* fix linting

* fix

* add spillback count

* fixes

* linting

* fixes

* fix linting

* fix

* fix

* fix
2018-01-31 11:01:12 -08:00
Robert Nishihara
3195c6aa63 Fix local scheduler crash when driver creates actor and exits. (#1474)
* Make check failures in redis.cc more informative.

* Fix bug by calling task_table_add_task.

* Add test.
2018-01-26 14:29:53 -08:00
Stephanie Wang
668737f383 Replace actor dummy objects with mock calls to the local scheduler (#1467)
* Replace putting the dummy object with a call to the local scheduler

* Mark dummy objects as locally available
2018-01-26 14:18:45 -08:00
Robert Nishihara
5acc98e629 Update arrow with better dataframe serialization and get rid of custo… (#1413)
* Update arrow with better dataframe serialization and get rid of custom dataframe serializers.

* Update plasma client API.

* Fix potential bug.

* Bug fix.

* Update arrow to use deduplicated file descriptors and mutable buffers.

* Fix tests.

* Update commit.

* Update commit.

* Update commit.

* Update commit.

* Update commit

* Update commit back to arrow codebase.'
2018-01-24 10:03:29 -08:00
Alexey Tumanov
f1303291b4 Ray scheduler spillback plumbing + mechanism (#1362)
* spillback mechanism and plumbing : adding spillback counter + timestamp

* linting fix

* documentation

* Fix argument name.
2018-01-23 20:18:12 -08:00
Melih Elibol
4b1c8be4fe Fix setting log-level to debug. (#1432) 2018-01-21 21:51:05 -08:00
Stephanie Wang
74718efa73
Nondeterministic reconstruction for actors (#1344)
* Add failing unit test for nondeterministic reconstruction

* Retry scheduling actor tasks if reassigned to local scheduler

* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction

* Fix bug for updating checkpoint task execution dependencies

* Update comments for deterministic reconstruction

* cleanup

* Add (and skip) failing test case for nondeterministic reconstruction

* Suppress test output
2018-01-21 13:44:13 -08:00
Robert Nishihara
088f01496c Remove unused object info table code. (#1388) 2018-01-05 11:00:06 -08:00
Robert Nishihara
e970e24ea5 Update arrow, and pass memcopy_threads into put. (#1374) 2017-12-31 13:32:06 -08:00
Philipp Moritz
3d224c4edf Second Part of Internal API Refactor (#1326) 2017-12-26 16:22:04 -08:00
Melih Elibol
4a2d62e7ef fix thirdparty install bug. (#1354) 2017-12-20 23:08:53 -08:00
Philipp Moritz
3c4408cf51 Rebase Ray on Arrow 0.8 (#1323)
* rebase Ray on Arrow 0.8

* rebase on apache repo
2017-12-19 14:24:21 -08:00
Robert Nishihara
76b6b4a2d3 When killing worker, release resources before dispatching tasks. (#1327) 2017-12-15 18:12:03 -08:00
Stephanie Wang
12fdb3f53a Convert actor dummy objects to task execution edges. (#1281)
* Define execution dependencies flatbuffer and add to Redis commands

* Convert TaskSpec to TaskExecutionSpec

* Add execution dependencies to Python bindings

* Submitting actor tasks uses execution dependency API instead of dummy argument

* Fix dependency getters and some cleanup for fetching missing dependencies

* C++ convention

* Make TaskExecutionSpec a C++ class

* Convert local scheduler to use TaskExecutionSpec class

* Convert some pointers to references

* Finish conversion to TaskExecutionSpec class

* fix

* Fix

* Fix memory errors?

* Cast flatbuffers GetSize to size_t

* Fixes

* add more retries in global scheduler unit test

* fix linting and cast fbb.GetSize to size_t

* Style and doc

* Fix linting and simplify from_flatbuf.
2017-12-14 20:47:54 -08:00
Philipp Moritz
cac5f47600 First Part of Internal Ray API Refactor (#1173)
* add Ray status class

* add C++ util files

* add ID types

* more APIs

* build system integration

* add test infrastructure and implement some APIs

* add more tests

* fix bugs

* add task table tests

* update

* add toolchain file

* fix

* test

* link with pthread

* update

* fix

* more fixes

* fixes

* always vendor gtest and gflags

* linting

* fixes

* add constants file

* comments

* more fixes

* fix linting
2017-12-14 14:54:09 -08:00