Commit graph

474 commits

Author SHA1 Message Date
Philipp Moritz
fbde8cad74 Update apache arrow to include TensorFlow fix (#2345) 2018-07-06 13:18:56 -07:00
Stephanie Wang
5b7475a2e0
[xray] Unsubscribe to task dependencies when task starts execution (#2354)
* Add back call to unsubscribe to task dependencies

* fix
2018-07-05 21:08:58 -07:00
Stephanie Wang
c50f1966e0 Publish a notification for empty keys in the GCS (#2347)
* Publish an empty notification for empty keys

* Add failure callback to Table::Subscribe, add unit test for new behavior
2018-07-05 13:39:07 -07:00
Robert Nishihara
b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Zongheng Yang
ba28dddf6f Make xray object table credis-managed and hence flushable. (#2338)
* monitor.py: issue flushes to data shard

* ResultTableAdd & ObjectTableAdd: add credis-managed versions

* Fix return codes

* Credis-manage xray object table & associated ray.table_append cmd

* Fix incorrect return code from TableAppend_DoWrite()

* Revert "ResultTableAdd & ObjectTableAdd: add credis-managed versions"

This reverts commit 628c2ea190df4c861dda0c284fab7ca6faa1ea24.

* Address comments

* Lint: fix indent

* Address comment
2018-07-03 17:32:44 -07:00
Philipp Moritz
f21d783e6d Remove new gcs code from legacy Ray codepath (#2329) 2018-07-03 11:48:50 -07:00
Peter Schafhalter
bb1d7eaece Replenish workers for disconnected actors (#2307) 2018-07-02 08:26:10 -07:00
Philipp Moritz
762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Alexey Tumanov
965e182384
[xray] raylet task queue transition discipline (#2302)
* add queueing interface to move tasks between queues internally
* queueing discipline change: ready->waiting->scheduled->running
* rename task states : ready -> placeable; update documentation
* rename task states : scheduled -> ready; update documentation
* cleanup comments
* cleanup; transition placeable actor tasks
* minor comment cleanup
* addressing comments
* linting
2018-06-27 14:23:41 -07:00
Yuhong Guo
aa42331844 Fix build failure while using make -j1. Issue 2257 (#2279)
* Fix build failure while using make -j1

* Fix java test failure
2018-06-21 15:18:00 -07:00
Robert Nishihara
ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Zongheng Yang
8190ff1fd0 Experimental: enable automatic GCS flushing with configurable policy. (#2266)
* build_credis.sh: use an up-to-date credis commit.

* build_credis.sh: leveldb is updated, so update build cmds for it

* WIP: make monitor.py issue flush; switch gcs client to use credis

* Experimental: enable automatic GCS flushing with configurable policy.

* Fix linux compilation error

* Fix leveldb build

* Use optimized build for credis

* Address comments

* Attempt to fix tests
2018-06-20 14:40:57 -07:00
Melih Elibol
60bc3a014f [xray] Sets good object manager defaults. (#2255)
* better object manager defaults. added max for number of chunks.

* change source of cores.
2018-06-20 14:10:57 -07:00
Yuhong Guo
51744459f3 Mitigate randomly building failure: adding gen_local_scheduler_fbs to raylet lib. (#2271) 2018-06-19 15:29:57 -07:00
Hao Chen
8efd0f7b1b [xray] support multi-workers per process (#2244)
* support multi-workers per process

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* use RayConfig

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

* remove clear

* address comments

* fix lint

* fix bug

* make WorkerPool and WorkerPoolMock more consistent
2018-06-13 10:14:05 -07:00
Robert Nishihara
61139e1509 Enable fractional resources and resource IDs for xray. (#2187)
* Implement GPU IDs and fractional resources.

* Add documentation and python exceptions.

* Fix signed/unsigned comparison.

* Fix linting.

* Fixes from rebase.

* Re-enable tests that use ray.wait.

* Don't kill the raylet if an infeasible task is submitted.

* Ignore tests that require better load balancing.

* Linting

* Ignore array test.

* Ignore stress test reconstructions tests.

* Don't kill node manager if remote node manager disconnects.

* Ignore more stress tests.

* Naming changes

* Remove outdated todo

* Small fix

* Re-enable test.

* Linting

* Fix resource bookkeeping for blocked tasks.

* Fix linting

* Fix Java client.

* Ignore test

* Ignore put error tests
2018-06-10 15:31:43 -07:00
Philipp Moritz
4ec5bea03b [xray] Implement fetch (#2195) 2018-06-09 23:36:27 -07:00
Stephanie Wang
cb5e6e6d68 Add dependency between copy_ray and python extensions (#2221) 2018-06-08 20:41:54 -07:00
Yuhong Guo
0a34bea0b0 Use scoped enums in C++ and flatbuffers. (#2194)
* Enable --scoped-enums in flatbuffer compiler.

* Change enum to c++11 style (enum class).

* Resolve conflicts.

* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.

* Merge with master and fix CI failure.
2018-06-07 01:01:21 -07:00
Hao Chen
f0907a6ee9 Optimize lineage eviction efficiency (#2196)
* Java in vscode.

* Optimize lineage eviction

* minor fix

* fix ut

* fix comment and lint

* format

* format

* remove unneeded code
2018-06-07 00:35:15 -07:00
Philipp Moritz
343f29801b [xray] Fix compilation on mac (#2199) 2018-06-06 22:33:46 -07:00
Melih Elibol
7246ff80a4
[xray] Implements ray.wait (#2162)
Implements ray.wait for xray. Fixes #1128.
2018-06-06 16:56:44 -07:00
songqing
451cdb43f6 Fix redefinition of flatbuffer types (#2189) 2018-06-05 00:08:05 -07:00
Philipp Moritz
d699bfbf10 Use hashing function that takes into account all UniqueID bytes (#2174) 2018-06-01 23:07:29 -07:00
Philipp Moritz
e1024d84e9 [xray] Start actor workers in parallel (#2168) 2018-06-01 23:04:16 -07:00
songqing
4dd4698564 unify build dir for Python and Java (#2171)
* unify build dir for Python and Java

* enable executables auto installed when just running 'make'

* fix plasma_store copy error

* fix cmake error about copying executables

* lint fix

* recover python/setup.py

* enable to copy optional file automatically

* a small fix of path

* lint fix

* lint fix

* lint fix

* Add comment.
2018-06-01 16:28:27 -07:00
Yuhong Guo
c1de03acac Add timeout mechanism to Push function instead of retries (#2148)
Use timer instead of retries in Push when objects are not local.
2018-06-01 01:21:05 -07:00
Stephanie Wang
117107cb15 [xray] Evict tasks from the lineage cache (#2152) 2018-05-31 00:24:39 -07:00
Robert Nishihara
6172f94c04 Implement Python global state API for xray. (#2125)
* Implement global state API for xray.

* Fix object table.

* Fixes for log structure.

* Implement cluster_resources.

* Add driver task to task table.

* Remove python flatbuffers code

* Get some global state API tests running.

* Python linting.

* Fix linting.

* Fix mock modules for doc

* Copy over flatbuffer bindings.

* Fix for tests.

* Linting

* Fix monitor crash.
2018-05-29 16:25:54 -07:00
Stephanie Wang
166000b089
[xray] Improve flush algorithm for the lineage cache (#2130)
* Private method to flush a single task from the lineage cache

* Track parent->child relationships for faster flushing

* doc

* Only flush the newly ready task

* Flush() returns void

* x
2018-05-28 21:03:15 -07:00
caopeng428
bb8bfce403 bugfix: use array redis_primary_addr out of its scope (#2139) 2018-05-25 21:40:23 -07:00
Yuhong Guo
a8517cc82a Fix infinite retry in Push function. (#2133) 2018-05-25 01:16:44 -07:00
Yujie Liu
5c2b2c7b49 [JavaWorker] Changes to the directory under src for support java worker (#2093)
* Changes to the directory under src for support java worker
--------------------------
This commit includes changes to the directory under src, which is part of the java worker support of Ray.
It consists of the following changes:
 src/common/task.cc - just fix null point problem
 org_ray_spi_impl_DefaultLocalSchedulerClient.* - JNI support for local scheduler client, and the org_ray_spi_impl_DefaultLocalSchedulerClient.cc file is not autogenerated
2018-05-25 00:59:05 -07:00
Zongheng Yang
fa97acbc89 Integrate credis with Ray & route task table entries into credis. (#1841) 2018-05-24 23:35:25 -07:00
Philipp Moritz
225608ec66 Update arrow to latest master (#2100) 2018-05-24 00:26:13 -07:00
yuyiming
9ff3d57429 do not fetch from dead Plasma Manager (#2116) 2018-05-23 16:13:09 -07:00
Robert Nishihara
9b9ff19dd0 Use automatic memory management in Redis modules. (#1797) 2018-05-22 01:05:09 -07:00
eric-jj
eb078766d8 Performance fix (#2110) 2018-05-20 18:07:55 -07:00
Kunal Gosar
eba73449cc fix unused lambda capture (#2102) 2018-05-19 13:27:10 -07:00
Melih Elibol
f1da721522
[xray] Use pubsub instead of timeout for ObjectManager Pull. (#2079)
Use pubsub instead of timeout for Pull.
2018-05-18 21:35:12 -07:00
Yujie Liu
5918776dd4 [JavaWorker] Changes to the build system for support java worker (#2092)
* Changes to the build system for support java worker
--------------------------
This commit includes changes to the build system, which is part of the java worker support of Ray.
It consists of the following changes:
 - the changes of CMakeLists.txt files
 - the changes of the python setup.py and init files for the adaptation of the changed build system
 - move the location of local_scheduler_extension.cc for the adaptation of the changed build system which maybe better support multi-language worker

* minor whitespace

* Linting
2018-05-18 19:09:23 -07:00
Stephanie Wang
71e5cca59f
[xray] Fix bug in updating actor execution dependencies (#2064)
* [xray] FIX: bugs in actor execution

* comments

* Stronger check
2018-05-18 12:45:17 -07:00
Melih Elibol
25e7aa1e79 [xray] Better error messaging when pulling from self. (#2068)
* complain more loudly when object pulls from self.

* Add checks for node manager, and internal checks for object manager.

* linting
2018-05-18 10:26:47 -07:00
Robert Nishihara
15b72f9893 Fix compilation error for RAY_USE_NEW_GCS with latest clang. (#2086) 2018-05-17 23:10:02 -07:00
Melih Elibol
3c245f66d4 [xray] Corrects Error Handling During Push and Pull. (#2059)
* Makes bad status during Pull non-fatal.
Makes a bad status during Push fatal.

* pretty logs

* Stephanie's feedback.
2018-05-17 17:51:55 -07:00
Stephanie Wang
6ca122f723 [xray] Sophisticated task dependency management (#2035) 2018-05-17 17:18:30 -07:00
Stephanie Wang
796864d887
[xray] Lineage cache only requests notifications about remote parent tasks (#2066)
* Only request notifications about a parent task that is remote

* Fix typo

* Fix lineage cache test
2018-05-17 13:01:40 -07:00
Stephanie Wang
88fa98e851
[xray] Fix GCS table prefixes (#2065)
* Fix GCS table prefixes

* More explicit documentation
2018-05-16 13:15:03 -07:00
Stephanie Wang
ad48e47120 Don't crash on duplicate actor notifications (#2043) 2018-05-14 14:26:37 -07:00
Melih Elibol
3ac0c08daa use jobid_nil (#2044) 2018-05-13 14:22:09 -07:00