Commit graph

1910 commits

Author SHA1 Message Date
Hao Chen
8efd0f7b1b [xray] support multi-workers per process (#2244)
* support multi-workers per process

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* use RayConfig

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

* remove clear

* address comments

* fix lint

* fix bug

* make WorkerPool and WorkerPoolMock more consistent
2018-06-13 10:14:05 -07:00
Robert Nishihara
61139e1509 Enable fractional resources and resource IDs for xray. (#2187)
* Implement GPU IDs and fractional resources.

* Add documentation and python exceptions.

* Fix signed/unsigned comparison.

* Fix linting.

* Fixes from rebase.

* Re-enable tests that use ray.wait.

* Don't kill the raylet if an infeasible task is submitted.

* Ignore tests that require better load balancing.

* Linting

* Ignore array test.

* Ignore stress test reconstructions tests.

* Don't kill node manager if remote node manager disconnects.

* Ignore more stress tests.

* Naming changes

* Remove outdated todo

* Small fix

* Re-enable test.

* Linting

* Fix resource bookkeeping for blocked tasks.

* Fix linting

* Fix Java client.

* Ignore test

* Ignore put error tests
2018-06-10 15:31:43 -07:00
Philipp Moritz
4ec5bea03b [xray] Implement fetch (#2195) 2018-06-09 23:36:27 -07:00
Stephanie Wang
cb5e6e6d68 Add dependency between copy_ray and python extensions (#2221) 2018-06-08 20:41:54 -07:00
Yuhong Guo
0a34bea0b0 Use scoped enums in C++ and flatbuffers. (#2194)
* Enable --scoped-enums in flatbuffer compiler.

* Change enum to c++11 style (enum class).

* Resolve conflicts.

* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.

* Merge with master and fix CI failure.
2018-06-07 01:01:21 -07:00
Hao Chen
f0907a6ee9 Optimize lineage eviction efficiency (#2196)
* Java in vscode.

* Optimize lineage eviction

* minor fix

* fix ut

* fix comment and lint

* format

* format

* remove unneeded code
2018-06-07 00:35:15 -07:00
Philipp Moritz
343f29801b [xray] Fix compilation on mac (#2199) 2018-06-06 22:33:46 -07:00
Melih Elibol
7246ff80a4
[xray] Implements ray.wait (#2162)
Implements ray.wait for xray. Fixes #1128.
2018-06-06 16:56:44 -07:00
songqing
451cdb43f6 Fix redefinition of flatbuffer types (#2189) 2018-06-05 00:08:05 -07:00
Philipp Moritz
d699bfbf10 Use hashing function that takes into account all UniqueID bytes (#2174) 2018-06-01 23:07:29 -07:00
Philipp Moritz
e1024d84e9 [xray] Start actor workers in parallel (#2168) 2018-06-01 23:04:16 -07:00
songqing
4dd4698564 unify build dir for Python and Java (#2171)
* unify build dir for Python and Java

* enable executables auto installed when just running 'make'

* fix plasma_store copy error

* fix cmake error about copying executables

* lint fix

* recover python/setup.py

* enable to copy optional file automatically

* a small fix of path

* lint fix

* lint fix

* lint fix

* Add comment.
2018-06-01 16:28:27 -07:00
Yuhong Guo
c1de03acac Add timeout mechanism to Push function instead of retries (#2148)
Use timer instead of retries in Push when objects are not local.
2018-06-01 01:21:05 -07:00
Stephanie Wang
117107cb15 [xray] Evict tasks from the lineage cache (#2152) 2018-05-31 00:24:39 -07:00
Robert Nishihara
6172f94c04 Implement Python global state API for xray. (#2125)
* Implement global state API for xray.

* Fix object table.

* Fixes for log structure.

* Implement cluster_resources.

* Add driver task to task table.

* Remove python flatbuffers code

* Get some global state API tests running.

* Python linting.

* Fix linting.

* Fix mock modules for doc

* Copy over flatbuffer bindings.

* Fix for tests.

* Linting

* Fix monitor crash.
2018-05-29 16:25:54 -07:00
Stephanie Wang
166000b089
[xray] Improve flush algorithm for the lineage cache (#2130)
* Private method to flush a single task from the lineage cache

* Track parent->child relationships for faster flushing

* doc

* Only flush the newly ready task

* Flush() returns void

* x
2018-05-28 21:03:15 -07:00
caopeng428
bb8bfce403 bugfix: use array redis_primary_addr out of its scope (#2139) 2018-05-25 21:40:23 -07:00
Yuhong Guo
a8517cc82a Fix infinite retry in Push function. (#2133) 2018-05-25 01:16:44 -07:00
Yujie Liu
5c2b2c7b49 [JavaWorker] Changes to the directory under src for support java worker (#2093)
* Changes to the directory under src for support java worker
--------------------------
This commit includes changes to the directory under src, which is part of the java worker support of Ray.
It consists of the following changes:
 src/common/task.cc - just fix null point problem
 org_ray_spi_impl_DefaultLocalSchedulerClient.* - JNI support for local scheduler client, and the org_ray_spi_impl_DefaultLocalSchedulerClient.cc file is not autogenerated
2018-05-25 00:59:05 -07:00
Zongheng Yang
fa97acbc89 Integrate credis with Ray & route task table entries into credis. (#1841) 2018-05-24 23:35:25 -07:00
Philipp Moritz
225608ec66 Update arrow to latest master (#2100) 2018-05-24 00:26:13 -07:00
yuyiming
9ff3d57429 do not fetch from dead Plasma Manager (#2116) 2018-05-23 16:13:09 -07:00
Robert Nishihara
9b9ff19dd0 Use automatic memory management in Redis modules. (#1797) 2018-05-22 01:05:09 -07:00
eric-jj
eb078766d8 Performance fix (#2110) 2018-05-20 18:07:55 -07:00
Kunal Gosar
eba73449cc fix unused lambda capture (#2102) 2018-05-19 13:27:10 -07:00
Melih Elibol
f1da721522
[xray] Use pubsub instead of timeout for ObjectManager Pull. (#2079)
Use pubsub instead of timeout for Pull.
2018-05-18 21:35:12 -07:00
Yujie Liu
5918776dd4 [JavaWorker] Changes to the build system for support java worker (#2092)
* Changes to the build system for support java worker
--------------------------
This commit includes changes to the build system, which is part of the java worker support of Ray.
It consists of the following changes:
 - the changes of CMakeLists.txt files
 - the changes of the python setup.py and init files for the adaptation of the changed build system
 - move the location of local_scheduler_extension.cc for the adaptation of the changed build system which maybe better support multi-language worker

* minor whitespace

* Linting
2018-05-18 19:09:23 -07:00
Stephanie Wang
71e5cca59f
[xray] Fix bug in updating actor execution dependencies (#2064)
* [xray] FIX: bugs in actor execution

* comments

* Stronger check
2018-05-18 12:45:17 -07:00
Melih Elibol
25e7aa1e79 [xray] Better error messaging when pulling from self. (#2068)
* complain more loudly when object pulls from self.

* Add checks for node manager, and internal checks for object manager.

* linting
2018-05-18 10:26:47 -07:00
Robert Nishihara
15b72f9893 Fix compilation error for RAY_USE_NEW_GCS with latest clang. (#2086) 2018-05-17 23:10:02 -07:00
Melih Elibol
3c245f66d4 [xray] Corrects Error Handling During Push and Pull. (#2059)
* Makes bad status during Pull non-fatal.
Makes a bad status during Push fatal.

* pretty logs

* Stephanie's feedback.
2018-05-17 17:51:55 -07:00
Stephanie Wang
6ca122f723 [xray] Sophisticated task dependency management (#2035) 2018-05-17 17:18:30 -07:00
Stephanie Wang
796864d887
[xray] Lineage cache only requests notifications about remote parent tasks (#2066)
* Only request notifications about a parent task that is remote

* Fix typo

* Fix lineage cache test
2018-05-17 13:01:40 -07:00
Stephanie Wang
88fa98e851
[xray] Fix GCS table prefixes (#2065)
* Fix GCS table prefixes

* More explicit documentation
2018-05-16 13:15:03 -07:00
Stephanie Wang
ad48e47120 Don't crash on duplicate actor notifications (#2043) 2018-05-14 14:26:37 -07:00
Melih Elibol
3ac0c08daa use jobid_nil (#2044) 2018-05-13 14:22:09 -07:00
eric-jj
71997a481b Improve shared_ptr usage (#2030)
[xray] Improve shared_ptr usage
2018-05-11 20:05:04 -07:00
Stephanie Wang
a292d7ba32
[xray] Fix UniqueID hashing for object and task IDs. (#2017)
* Skip object prefix in UniqueIDHasher, choose shard based on hash

* lint
2018-05-10 21:56:12 -07:00
alonamid
32fa862408 add pthread linking (#1986) 2018-05-02 21:50:29 -07:00
eric-jj
34bc6ce6ea remove UniqueIDHasher (#1957)
* remove UniqueIDHasher

* Format the change

* remove unused line

* Fix format

* fix lint error

* fix linting whitespace
2018-04-30 06:31:23 -07:00
Philipp Moritz
af88fdefcf Incorporate C++ Buffer management and Seal global threadpool fix from arrow (#1950) 2018-04-25 22:53:44 -07:00
Philipp Moritz
dad465a2bf [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (#1944) 2018-04-23 23:51:25 -07:00
Melih Elibol
8264e64b18 Handle interrupts correctly for ASIO synchronous reads and writes. (#1929)
* handle interrupts correctly.

* linting

* handle interrupts on read_some/write_some.
2018-04-20 22:55:40 -07:00
Robert Nishihara
cffda73da1 Allow task_table_update to fail when tasks are finished. (#1927)
* Allow task_table_update to fail when tasks are finished.

* Add comment.
2018-04-20 11:34:29 -07:00
Stephanie Wang
aa07f1ce4e [xray] Workers blocked in a ray.get release their resources (#1920)
* [xray] Throttle task dispatch by required resources
* Pass in number of initial workers into raylet command
* Workers blocked in a ray.get release resources
2018-04-18 20:59:58 -07:00
Alexey Tumanov
1c965fcfeb Raylet task dispatch and throttling worker startup (#1912)
* separate task placement and task dispatch; throttle task dispatch with locally available resournces

* keep track of worker's being started/in flight and suppress starting extraneous workers

* cleanup comments

* remove early termination in task dispatch to support zero-resource actor tasks

* info -> debug

* add documentation

* linting

* mock the worker pool for testing

* some linting

* kill all workers in flight; clear the worker pool in dtor

* remove fixed todo

* lint
2018-04-18 10:58:11 -07:00
Eric Liang
7ab890f4a1 [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (#1848) 2018-04-16 16:58:15 -07:00
Stephanie Wang
2e25972d4d Preemptively push local arguments for actor tasks (#1901) 2018-04-16 16:26:59 -07:00
Melih Elibol
ddfc875149 Multithreading refactor for ObjectManager. (#1911)
* removes transfer service. adds separate pool for sends and receives.

* get rid of send/receive transfer counts.

* update comment.

* remove clang formatting.

* clang formatting.
2018-04-16 15:51:53 -07:00
Melih Elibol
cff37765b1 Addresses missed comments from multichunk object transfer PR. (#1908)
* Move object manager parameters to ray config,
object manager config bug fix.
addresses other comments from #1827.

* linting and uint?

* typos

* remove uint.
2018-04-15 21:35:51 -07:00