Commit graph

1812 commits

Author SHA1 Message Date
Yuhao Yang
ffa043d4b7 [tune] replace self.config (#6313) 2019-11-29 11:09:30 -08:00
Stephanie Wang
724a5e3909
Turn on direct calls for test_failure.py (#6291) 2019-11-28 12:28:30 -08:00
Eric Liang
b7b655c851
Also use NotifyDirectCallTaskBlock/Unblocked for plasma store accesses (#6249)
* wip

* fix it

* lint

* wip

* fix

* unblock

* flaky

* use fetch only flag

* Revert "use fetch only flag"

This reverts commit 56e938a0ee2024f5c99c9ab2d55fd35558fb15e1.

* restore error resolution

* use worker task id

* proto comments

* fix if
2019-11-27 22:46:15 -08:00
Simon Mo
22b305223a
Build Docker Containers for Linux Wheels (#6233) 2019-11-27 17:05:36 -08:00
Stephanie Wang
2797c11b69
[direct task] For serialized object IDs, check with owner before declaring object unreconstructable (#6286)
* Track borrowed vs owned objects

* Serialize owner address with object ID

* serialize owner task id

* Deserialize object IDs

* Pass direct task ID instead of plasma ID

* it works

* Fix ref count test

* Add unit test

* update warning

* we own ray.put objects

* missing file

* doc

* Fix unit test

* comments

* Fix py2

* lint

* update
2019-11-27 15:31:44 -08:00
Edward Oakes
e4f9b3b7d9
Use process reaper for cleanup (#6253) 2019-11-26 22:00:08 -06:00
Eric Liang
30b2fc1d81
Fix actor creation hang due to race in SWAP queue (#6280) 2019-11-26 15:21:03 -08:00
Simon Mo
1ca8c427e3 Consistent Name for Process Title (#6276)
* Consistent naming for setprotitle

* Address comments

* Add debug/verbose mode

* Fix test
2019-11-26 11:56:28 -08:00
Robert Nishihara
ffb9c0ecae Fix bug in which remote function redefinition doesn't happen. (#6175) 2019-11-26 11:19:19 -06:00
Edward Oakes
7f8de61441 [hotfix] Remove python/ray/tests/__init__.py (#6279)
* Remove python/ray/tests/__init__.py for bazel

* Comment out checks
2019-11-25 17:04:20 -08:00
Eric Liang
64a3a7239e
Set RAY_FORCE_DIRECT=1 for run_rllib_tests, test_basic (#6171) 2019-11-25 14:12:11 -08:00
Edward Oakes
e72aef2ba6
[hotfix] Fix building linux wheels 2019-11-25 12:45:31 -07:00
Simon Mo
c8b69727cd
ray stop only kills process with ray keyword (#6257)
* Use psutil to kill processes

* Psutil as core requirement

* Revert "Psutil as core requirement"

This reverts commit d3235ce3d994d2bb7db39e3ad4a46049703898bb.

* Revert "Use psutil to kill processes"

This reverts commit de0ed874fed673f5e98715950688f418bbcc415c.

* Revert back to subproc

* Add comments, grep for ray as well

* SIGTERM
2019-11-24 16:32:07 -08:00
Eric Liang
e5b5c98558
Fix python PATH for build (#6260) 2019-11-24 15:32:06 -08:00
Eric Liang
53641f1f74
Move more unit tests to bazel (#6250)
* move more unit tests to bazel

* move to avoid conflict

* fix lint

* fix deps

* seprate

* fix failing tests

* show tests

* ignore mismatch

* try combining bazel runs

* build lint

* remove tests from install

* fix test utils

* better config

* split up

* exclusive

* fix verbosity

* fix tests class

* cleanup

* remove flaky

* fix metrics test

* Update .travis.yml

* no retry flaky

* split up actor

* split basic test

* split up trial runner test

* split stress

* fix basic test

* fix tests

* switch to pytest runner for main

* make microbench not fail

* move load code to py3

* test is no longer package

* bazel to end
2019-11-24 11:43:34 -08:00
Simon Mo
aa8d5d2f6c
Rate limit asyncio actor (#6242) 2019-11-24 11:39:28 -08:00
Yuhao Yang
f6a5baf844 [tune] minor doc fix (#6248) 2019-11-23 21:54:41 -08:00
Stephanie Wang
d2662fecea
Miscellaneous bug fixes to throw unreconstructable errors for direct calls (#6245)
* Test cases

* Fix InPlasmaError

* raylet fixes to force errors for direct calls

* Disable lineage logging and task pending checks for direct calls

* move todo

* Clean up tests

* Fix bugs in object store for Contains and Delete

* Use direct call in tests

* Fixes, separate actor creation direct call from normal direct call spec
2019-11-23 15:05:49 -08:00
Eric Liang
b052bcf1fc
Bazelify tune tests in travis (#6219) 2019-11-22 13:58:50 -08:00
Simon Mo
eb6a93c0f0
[hotfix] fix lint (#6236) 2019-11-21 18:30:57 -08:00
Eric Liang
7559fdb141 [rllib/tune] Cache get_preprocessor() calls, default max_failur… (#6211) 2019-11-21 15:55:56 -08:00
Stephanie Wang
d3227f2f2d
Fix bug in direct task calls for objects that were evicted (#6216)
* Fix bug and add some checks

* rename
2019-11-21 15:38:31 -08:00
Simon Mo
29ba6bfc64
Basic Async Actor Call (#6183)
* Start trying to figure out where to put fibers

* Pass is_async flag from python to context

* Just running things in fiber works

* Yield implemented, need some debugging to make it work

* It worked!

* Remove debug prints

* Lint

* Revert the clang-format

* Remove unnecessary log

* Remove unncessary import

* Add attribution

* Address comment

* Add test

* Missed a merge conflict

* Make test pass and compile

* Address comment

* Rename async -> asyncio

* Move async test to py3 only

* Fix ignore path
2019-11-21 11:56:46 -08:00
Philipp Moritz
a4437813eb
[Projects] Unify hyphen vs underscore handling for arguments (#6208) 2019-11-20 23:52:41 -08:00
Stephanie Wang
db77595298
Fix segfault for task arguments passed by value (#6214)
* Fix null data

* rename
2019-11-20 22:02:18 -08:00
Stephanie Wang
c0be9e6738
Resolve dependencies locally before submitting direct actor tasks (#6191)
* Priority queue in direct actor transport by task number

* Move LocalDependencyResolver out to separate file, share with direct actor transport

* works

* Test case for ordering

* Cleanups

* Remove priority queue

* comment

* Share ClientFactoryFn with direct actor transport

* Unit test

* fix
2019-11-20 16:45:19 -08:00
Philipp Moritz
33c768ebe4
Fix worker signal.SIGTERM handler being installed from outside the main thread (#6176) 2019-11-20 11:14:28 -08:00
Ujval Misra
0010382cc7 [tune] Report failures in a separate table (#6160)
* Report errors in a separate table.

* Single error file.
2019-11-20 10:53:47 -08:00
ashione
a1744f67fe Add hostname to nodeinfo(#6156) 2019-11-19 15:03:46 +08:00
Ujval Misra
2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
Stephanie Wang
66edebce3a
Spillback scheduling for direct task calls (#6164)
* add dac

* remove cachign

* rename return buffer

* cleanup

* add tests

* add perf

* fix

* flip

* remove

* remove it

* lint

* remove fork safety

* lint

* comments

* s/core/client

* wip

* remove

* fmt

* consistently return direct naming

* basic pass by ref

* fix bugs

* wip

* wip

* wip

* wip

* add test

* works now

* fix constructor

* fix merge

* add todo for perf

* fix single client test

* use lower n

* bazel

* faster

* fix core worker test

* init

* fix tests

* no plasma for direct call

* Update worker.py

* add order test

* fixes

* comments

* remove old assert

* lint

* add test

* Very wip

* wip

* add options for tasks

* add test

* fmt

* add backpressure

* remove idle prof event

* lint

* Fix 0 returns

* Set memcopy threads globally

* add benchmark

* Fix object exists

* Fix reference

* Remove return_buffer

* Add check

* add exit handler

* update benchmarks

* Fix compile error

* Fix NoReturn

* Use is instead of == for NoReturn

* fix

* Remove list comprehension

* Fix core worker test

* comment

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* fix merge error

* lint

* wip

* fix merge

* wip

* finish

* lint

* task interface

* add file

* add

* wip

* now works!

* updated

* wip

* dep resolution

* remove remote dep handling

* comments

* fix test_multithreading

* fix merge

* fix exit handling

* fix merge

* comments

* get fallback fetch working

* handle contains

* fix typo

* Skeleton for SubmitTask proto

* Update src/ray/common/id.h

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* comments

* rename to core worker service

* lint

* fix compile

* wip

* update

* error code

* fix up and rename

* clean up call manager

* comments

* add test and cleanup deserialization

* fix pickle

* fix comments, lint

* test todo

* comments

* use shared ptr

* rename

* Update src/ray/protobuf/gcs.proto

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* require transport type for ids; lint

* cleanup

* comments 1

* use worker available for real

* wip

* fix test

* resolve local dependencies test

* add num pending metric

* client factory

* unit test task submission

* wip

* fix bug

* rename

* Pass through node manager port, connect in raylet client

* finish rename

* Switch submit task to grpc

* fix crash

* Check port in use

* fix merge

* comments more

* doc

* Remove default port, set port randomly from driver

* add unique_ptr comment about TaskSpec

* lint

* fix test

* update

* fix lint

* GetMessageMutable should not be const

* iwyu

* fix const

* Update direct_task_transport_test.cc

* fix segfault

* Fix test

* Add RpcAddress, set in actor table data

* fix serialization

* fix lint

* Pass through task caller address

* Fix object manager test

* RpcAddress -> Address

* merge

* Port WorkerLease to grpc

* wip

* fix test

* add mem test

* update

* comments

* fix core worker tests

* fix

* remove old worker lease code

* First pass on spillback

* lint

* crash?

* Debug

* Fix task spec copy, extend test basic

* lint

* Port return worker to grpc

* lint

* Return worker to the correct raylet

* Only request worker if queued tasks

* A bit better failure handling

* Fix unit test

* Add unit test for spillback

* fix

* python test multinode

* update

* updates

* fix
2019-11-17 20:29:32 -08:00
Philipp Moritz
fc655acfee
Fix linting on master branch (#6174) 2019-11-16 10:02:58 -08:00
Danyang Zhuo
30e2b6b91b Microbenchmark for inter-node object transfer (#6098) 2019-11-15 21:39:06 -08:00
Adam Gleave
e8cce3fdd4 [autoscaler]: automatically pull new docker image (#6111)
* Docker: automatically pull new image

* Fix missing value in schema

* Address review comments
2019-11-15 21:26:28 -08:00
Edward Oakes
dee696577f
Fix passing object ids in local mode (#6170) 2019-11-15 15:46:39 -08:00
Edward Oakes
33040d734f
Disable stopgap GC by default (#6165)
* disable stopgap gc by default

* fix gc testss
2019-11-15 15:42:59 -08:00
Hersh Godse
7aa06fb25c [tune] ExperimentalAnalysis in-memory cache (#5962) 2019-11-15 12:47:50 -08:00
Eric Liang
7d33e9949b
Integrate ref count module into local memory store (#6122) 2019-11-15 10:52:19 -08:00
Richard Liaw
62cbc043b4
[tune] tbx logger (#6133)
* tbx

* add_hparams

* fix_hparams

* ok

* ok

* fix

* ok

* fix
2019-11-15 08:45:44 -08:00
Eric Liang
8ff393a7bd
Handle exchange of direct call objects between tasks and actors (#6147) 2019-11-14 17:32:04 -08:00
Edward Oakes
385783fcec
Ray on YARN + Skein Documentation (#6119) 2019-11-14 15:06:05 -08:00
Edward Oakes
e3b95dafeb
Fix sigterm_handler (#6141) 2019-11-14 13:41:50 -08:00
Eric Liang
0a3623ded6
Fix memory store wait (#6152) 2019-11-14 10:17:30 -08:00
Ujval Misra
e3e3ad4b25 Add timeout param to ray.get (#6107) 2019-11-14 00:50:04 -08:00
Philipp Moritz
f24d96ec4f Revert "Try to enable dashboard (again) (#6069)" (#6159)
This reverts commit 4044af8520.
2019-11-13 12:32:12 -08:00
Eric Liang
b924299833
Add large scale regression test for RLlib (#6093) 2019-11-13 12:22:55 -08:00
Eric Liang
f3f86385d6
Minimal implementation of direct task calls (#6075) 2019-11-12 11:45:28 -08:00
Stephanie Wang
35d177f459
Use grpc for communication from worker to local raylet (task submission and direct actor args only) (#6118)
* Skeleton for SubmitTask proto

* Pass through node manager port, connect in raylet client

* Switch submit task to grpc

* Check port in use

* doc

* Remove default port, set port randomly from driver

* update

* Fix test

* Fix object manager test
2019-11-11 21:17:25 -08:00
Siyuan (Ryans) Zhuang
f48293f96d
Fix deprecated warning (#6142) 2019-11-11 17:49:15 -08:00
Simon Mo
c75ada9e04
[Autoscaler][K8s] Enforce memory limit in k8s yaml (#6138)
* Enforce memory limit in k8s yaml

* Update python/ray/autoscaler/kubernetes/example-full.yaml

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Line wrap
2019-11-11 14:06:34 -08:00