Commit graph

551 commits

Author SHA1 Message Date
Philipp Moritz
143a118fbf [xray] Fix valgrind crash when memory profiling raylet (#2583)
* use different random number generator to be compatible with older valgrind versions

* seed from time

* style

* fix

* remove more random devices

* also remove random_device from global scheduler

* rename mutex

* linting
2018-08-09 15:37:17 -07:00
Stephanie Wang
f093ed1fc6 [xray] Fix crash in case of spurious reconstruction (#2609)
* Exit if task already queued

* address comments
2018-08-09 14:46:46 -07:00
Stephanie Wang
2de9bfc7e3 [xray] Log warnings for asio handlers that take too long (#2601)
* Add fatal check for heartbeat drift

* Log warning messages for handlers that take too long

* Add debug labels to all ClientConnections
2018-08-09 14:39:23 -07:00
Stephanie Wang
d49b4bef0a [xray] Basic task reconstruction mechanism (#2526)
## What do these changes do?

This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.

Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).

Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
2018-08-09 07:24:37 -07:00
Melih Elibol
8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00
Alexey Tumanov
df7ee7ff1e raylet memory corruption fixes (#2591)
* raylet memory corruption fixes

* add util function to translate boost error to ray status

* tcp client connection now using ray status utility function

* lint
2018-08-08 19:50:43 -07:00
Stephanie Wang
6ab01a2cad [xray] Fix bug when counting a task's lineage size (#2600) 2018-08-08 00:00:17 -07:00
Ujval Misra
a0691ee49b [xray] Prevent sending excessive uncommitted lineage on task forwarding (#2534)
* Add set to lineage cache entry to track nodes already forwarded to.

* Uncommitted lineage function naming, documentation.

* Simple test for uncommitted lineage with a marked task.

* Rebased, changed tests to use ClientID::nil.

* Bug fix, change MergeLineageHelper function type.

* Formatting.

* Checks and test changes based on PR comments.

* GetUncommittedLineage now always returns at least the requested task ID.

* Bug fix (return at least requested task ID)

* Formatting
2018-08-07 21:10:23 -07:00
Philipp Moritz
e7f76d7914 [xray] Fix typo concerning heartbeat_timeout_milliseconds in monitor (#2586) 2018-08-07 13:45:51 -07:00
Philipp Moritz
25f0094ee4 Fix copying the plasma fbs directory from arrow (#2579) 2018-08-07 00:04:37 -07:00
Yuhong Guo
d35ce7fa63 Use real callback index in subscribe_callback_index_ (#2473) 2018-08-06 15:29:56 -07:00
Alexey Tumanov
85b8b2a395 mark all remaining placeable tasks pending with task dependency manager (#2528) 2018-08-06 13:08:11 -07:00
Melih Elibol
34d3a46f48 [xray] Revert dynamic chunk size optimization for ObjectManager. (#2557)
* Revert dynamic chunk size optimization.

* fix mac build issues.
2018-08-05 02:09:37 -07:00
Wang Qing
e4f68ff8cf [Java Worker] Support raylet on Java (#2479) 2018-08-01 17:52:49 -07:00
Zhijun Fu
ca36827f01 [Issues 2403][xray] Fix raylet performance issues on scheduling queue (#2438)
* merge from ray
* Revert "merge from ray"
This reverts commit 32b181ebbb1fa184026631e1a7368112c4c3118d.
* fix raylet performance regression
* address comments
* Update code after merging latest changes
* fix lint
* address comments
2018-08-01 14:41:20 -07:00
Stephanie Wang
e90ecef297 [xray] Try to flush children of a task that is evicted from the lineage cache (#2531) 2018-08-01 00:23:02 -07:00
Stephanie Wang
a45f9cfafc [xray] Implement task lease table, logic for deciding when to reconstruct a task (#2497) 2018-07-30 14:42:28 -07:00
Ion
80db69d245 State transition diagram documentation. (#2502)
* Added description of transition diagram and a few name changes for imporved clarity.

* rename some methods and update task_states.rst
2018-07-28 22:28:45 -07:00
Robert Nishihara
2be1ccbd8f Raise application-level exceptions for some failure scenarios. (#2429)
* Raise application level exception for actor methods that can't be executed and failed tasks.

* Retry task forwarding for actor tasks.

* Small cleanups

* Move constant to ray_config.

* Create ForwardTaskOrResubmit method.

* Minor

* Clean up queued tasks for dead actors.

* Some cleanups.

* Linting

* Notify task_dependency_manager_ about failed tasks.

* Manage timer lifetime better.

* Use smart pointers to deallocate the timer.

* Fix

* add comment
2018-07-27 19:53:30 -04:00
Stephanie Wang
6675361684 [xray] Track ray.get calls as task dependencies (#2362) 2018-07-27 11:59:17 -07:00
Zhijun Fu
9ad6a973a0 [xray] lineage optimization: avoid unnecessary lineage entry allocation & free (#2463)
* merge from ray

* Revert "merge from ray"

This reverts commit 32b181ebbb1fa184026631e1a7368112c4c3118d.

* [xray] avoid unnecessary lineage entry allocation & free

* address comments

* address review comments

* address comments
2018-07-26 10:44:38 -04:00
Yuhong Guo
b35ce5dbf1 Update Arrow Package with breaking changes (#2440)
* Merge the breaking change of Arrow Package.

* Fix typo

* Fix lint.

* put forward declarations into header

* fix

* add protocol.h

* fix linting
2018-07-25 14:28:33 -07:00
Philipp Moritz
e821f852ef [xray] Silence some object manager logging (#2437) 2018-07-20 13:10:03 -07:00
Robert Nishihara
eed39163f9 Add callback to node manager for client removed event. (#2417)
* Add callback to node manager for client removed event.

* Fix linting.
2018-07-18 16:59:04 -07:00
Philipp Moritz
4c82ac72df Upgrade arrow to include the plasma TensorFlow op (#2412) 2018-07-18 12:33:02 -07:00
Yuhong Guo
206254bcf3 Add const to to_plasma_id function to make it usable by const ObjectID (#2404)
* Add const to to_plasma_id to make it usable by const ObjectID

* Separate the building script to another PR.
2018-07-16 11:05:29 -07:00
Hao Chen
c1575e98c1 Make local scheduler client thread-safe (#2386)
* Make local scheduler client thread-safe for python

* lock write_messages

* remove allow-threads

* fix linter

* rename _write_message to do_write_message
2018-07-13 16:19:00 -07:00
Philipp Moritz
fbde8cad74 Update apache arrow to include TensorFlow fix (#2345) 2018-07-06 13:18:56 -07:00
Stephanie Wang
5b7475a2e0
[xray] Unsubscribe to task dependencies when task starts execution (#2354)
* Add back call to unsubscribe to task dependencies

* fix
2018-07-05 21:08:58 -07:00
Stephanie Wang
c50f1966e0 Publish a notification for empty keys in the GCS (#2347)
* Publish an empty notification for empty keys

* Add failure callback to Table::Subscribe, add unit test for new behavior
2018-07-05 13:39:07 -07:00
Robert Nishihara
b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Zongheng Yang
ba28dddf6f Make xray object table credis-managed and hence flushable. (#2338)
* monitor.py: issue flushes to data shard

* ResultTableAdd & ObjectTableAdd: add credis-managed versions

* Fix return codes

* Credis-manage xray object table & associated ray.table_append cmd

* Fix incorrect return code from TableAppend_DoWrite()

* Revert "ResultTableAdd & ObjectTableAdd: add credis-managed versions"

This reverts commit 628c2ea190df4c861dda0c284fab7ca6faa1ea24.

* Address comments

* Lint: fix indent

* Address comment
2018-07-03 17:32:44 -07:00
Philipp Moritz
f21d783e6d Remove new gcs code from legacy Ray codepath (#2329) 2018-07-03 11:48:50 -07:00
Peter Schafhalter
bb1d7eaece Replenish workers for disconnected actors (#2307) 2018-07-02 08:26:10 -07:00
Philipp Moritz
762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Alexey Tumanov
965e182384
[xray] raylet task queue transition discipline (#2302)
* add queueing interface to move tasks between queues internally
* queueing discipline change: ready->waiting->scheduled->running
* rename task states : ready -> placeable; update documentation
* rename task states : scheduled -> ready; update documentation
* cleanup comments
* cleanup; transition placeable actor tasks
* minor comment cleanup
* addressing comments
* linting
2018-06-27 14:23:41 -07:00
Yuhong Guo
aa42331844 Fix build failure while using make -j1. Issue 2257 (#2279)
* Fix build failure while using make -j1

* Fix java test failure
2018-06-21 15:18:00 -07:00
Robert Nishihara
ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Zongheng Yang
8190ff1fd0 Experimental: enable automatic GCS flushing with configurable policy. (#2266)
* build_credis.sh: use an up-to-date credis commit.

* build_credis.sh: leveldb is updated, so update build cmds for it

* WIP: make monitor.py issue flush; switch gcs client to use credis

* Experimental: enable automatic GCS flushing with configurable policy.

* Fix linux compilation error

* Fix leveldb build

* Use optimized build for credis

* Address comments

* Attempt to fix tests
2018-06-20 14:40:57 -07:00
Melih Elibol
60bc3a014f [xray] Sets good object manager defaults. (#2255)
* better object manager defaults. added max for number of chunks.

* change source of cores.
2018-06-20 14:10:57 -07:00
Yuhong Guo
51744459f3 Mitigate randomly building failure: adding gen_local_scheduler_fbs to raylet lib. (#2271) 2018-06-19 15:29:57 -07:00
Hao Chen
8efd0f7b1b [xray] support multi-workers per process (#2244)
* support multi-workers per process

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* use RayConfig

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

* remove clear

* address comments

* fix lint

* fix bug

* make WorkerPool and WorkerPoolMock more consistent
2018-06-13 10:14:05 -07:00
Robert Nishihara
61139e1509 Enable fractional resources and resource IDs for xray. (#2187)
* Implement GPU IDs and fractional resources.

* Add documentation and python exceptions.

* Fix signed/unsigned comparison.

* Fix linting.

* Fixes from rebase.

* Re-enable tests that use ray.wait.

* Don't kill the raylet if an infeasible task is submitted.

* Ignore tests that require better load balancing.

* Linting

* Ignore array test.

* Ignore stress test reconstructions tests.

* Don't kill node manager if remote node manager disconnects.

* Ignore more stress tests.

* Naming changes

* Remove outdated todo

* Small fix

* Re-enable test.

* Linting

* Fix resource bookkeeping for blocked tasks.

* Fix linting

* Fix Java client.

* Ignore test

* Ignore put error tests
2018-06-10 15:31:43 -07:00
Philipp Moritz
4ec5bea03b [xray] Implement fetch (#2195) 2018-06-09 23:36:27 -07:00
Stephanie Wang
cb5e6e6d68 Add dependency between copy_ray and python extensions (#2221) 2018-06-08 20:41:54 -07:00
Yuhong Guo
0a34bea0b0 Use scoped enums in C++ and flatbuffers. (#2194)
* Enable --scoped-enums in flatbuffer compiler.

* Change enum to c++11 style (enum class).

* Resolve conflicts.

* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.

* Merge with master and fix CI failure.
2018-06-07 01:01:21 -07:00
Hao Chen
f0907a6ee9 Optimize lineage eviction efficiency (#2196)
* Java in vscode.

* Optimize lineage eviction

* minor fix

* fix ut

* fix comment and lint

* format

* format

* remove unneeded code
2018-06-07 00:35:15 -07:00
Philipp Moritz
343f29801b [xray] Fix compilation on mac (#2199) 2018-06-06 22:33:46 -07:00
Melih Elibol
7246ff80a4
[xray] Implements ray.wait (#2162)
Implements ray.wait for xray. Fixes #1128.
2018-06-06 16:56:44 -07:00
songqing
451cdb43f6 Fix redefinition of flatbuffer types (#2189) 2018-06-05 00:08:05 -07:00